Custom Segmentation

Each time you upload XML, HTML, MD, or any other source files without a key-value structure, the pre-defined segmentation rules (SRX 2.0) are used for automatic content segmentation. Though, there might be situations when the default segmentation rules segment source files in contrast to the desired expectations. In this case, you can define your own segmentation rules for each source file individually using the SRX 2.0 standard.

Change Segmentation

Segmentation could be changed in the Project settings > Files tab.

  1. Open the project where you’d like to adjust the segmentation rules and switch to the Project settings > Files tab.
  2. Click (or right-click) on the needed file and select the Change segmentation option.
  3. In the appeared dialog, paste in your SRX segmentation rules and click Save.

After you saved your new segmentation rules, your source file will be automatically reimported and segmented according to these new rules.

Segmentation Examples

A typical SRX file looks similar to the following:

<?xml version="1.0" encoding="UTF-8"?>
<srx version="2.0" 
    xmlns="http://www.lisa.org/srx20"
    xsi:schemaLocation="http://www.lisa.org/srx20 srx20.xsd"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <header segmentsubflows="yes" cascade="yes">
        <formathandle type="start" include="no"/>
        <formathandle type="end" include="yes"/>
        <formathandle type="isolated" include="yes"/>
    </header>
    <body>
        <languagerules>
            <languagerule languagerulename="Default">
                <!-- Common rules for most languages -->
                <rule break="no">
                    <beforebreak>^\s*[0-9]+\.</beforebreak>
                    <afterbreak>\s</afterbreak>
                </rule>
                <rule break="yes">
                    <afterbreak>\n</afterbreak>
                </rule>
                <rule break="yes">
                    <beforebreak>[\.\?!]+</beforebreak>
                    <afterbreak>\s</afterbreak>
                </rule>
            </languagerule>
        </languagerules>
        <maprules>
            <!-- List exceptions first -->
            <languagemap languagepattern="[Ee][Nn].*" languagerulename="English"/>
            <languagemap languagepattern="[Ff][Rr].*" languagerulename="French"/>
            <!-- Japanese breaking rules -->
            <languagemap languagepattern="[Jj][Aa].*" languagerulename="Japanese"/>
            <!-- Common breaking rules -->
            <languagemap languagepattern=".*" languagerulename="Default"/>
        </maprules>
    </body>
</srx>

Change Sentence Separator for Asian Languages

Usually, the full stop is used as a sentence separator. Although, for some Asian languages it’s not the case. For example, in Chinese, the typical sentence separator is an ideographic full stop (). For such cases, you may want to use the following rule set:

<rule break="yes">
    <beforebreak>[\x3002]+</beforebreak>
    <afterbreak></afterbreak>
</rule>

Break Text into Smaller Parts

In the following simple sentence we’ll break down a case when it’s necessary to segment one text piece into two (or more) strings.

Text with default segmentation rules:

This is the first part of the sample sentence and this is the second part.

Text with new segmentation rules:

This is the first part of the sample sentence
and this is the second part.

For this particular case, the following rule set will brake the initial sentence into two parts:

<rule break="yes">
    <beforebreak>sentence</beforebreak>
    <afterbreak>\u0020</afterbreak>
</rule>

Create Segmentation Rules with SRX Editors

The SRX segmentation rules can be created and maintained with the help of tools like Ratel. It has a visual interface where you can generate segmentation rules from scratch or edit your existing ones.

Procurar por Ajuda

Need help with setting your custom segmentation rules or have any questions? Contact Support Team.

Este artigo foi útil?