Tilpasset segmentering

Each time you upload XML, HTML, MD, or any other source files without a key-value structure, the predefined segmentation rules (SRX 2.0) are used for automatic content segmentation. Although, there might be situations when the default segmentation rules segment source files in contrast to the desired expectations. I så tilfælde kan egne segmenteringsregler defineres for hver kildefil individuelt vha. SRX 2.0-standarden.

Ændr segmentering

You can change segmentation in Sources > Files.

  1. Open the project where you’d like to adjust the segmentation rules and go to Sources > Files.
  2. Click (or right-click) on the needed file and select Settings.
  3. In the appeared dialog, switch to the Parser configuration tab.
  4. In the Excluded elements field, specify all elements that should not be imported.
  5. Select Enable content segmentation and Use custom segmentation rules.
  6. Paste your SRX segmentation rules and click Save.

After you save your new segmentation rules, your source file will be automatically reimported and segmented according to these new rules.

Segmenteringseks.

Note: Regular expressions used in SRX rules must be compatible with PHP (PCRE2) and Node.js.

En typisk SRX-fil vil ligne flg. eks.:

<?xml version="1.0" encoding="UTF-8"?>
<srx version="2.0" 
    xmlns="http://www.lisa.org/srx20"
    xsi:schemaLocation="http://www.lisa.org/srx20 srx20.xsd"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <header segmentsubflows="yes" cascade="yes">
        <formathandle type="start" include="no"/>
        <formathandle type="end" include="yes"/>
        <formathandle type="isolated" include="yes"/>
    </header>
    <body>
        <languagerules>
            <languagerule languagerulename="Default">
                <!-- Common rules for most languages -->
                <rule break="no">
                    <beforebreak>^\s*[0-9]+\.</beforebreak>
                    <afterbreak>\s</afterbreak>
                </rule>
                <rule break="yes">
                    <afterbreak>\n</afterbreak>
                </rule>
                <rule break="yes">
                    <beforebreak>[\.\?!]+</beforebreak>
                    <afterbreak>\s</afterbreak>
                </rule>
            </languagerule>
        </languagerules>
        <maprules>
            <!-- List exceptions first -->
            <languagemap languagepattern="[Ee][Nn].*" languagerulename="English"/>
            <languagemap languagepattern="[Ff][Rr].*" languagerulename="French"/>
            <!-- Japanese breaking rules -->
            <languagemap languagepattern="[Jj][Aa].*" languagerulename="Japanese"/>
            <!-- Common breaking rules -->
            <languagemap languagepattern=".*" languagerulename="Default"/>
        </maprules>
    </body>
</srx>

Ændre sætningsseparator til asiatiske sprog

Normalt bruges punktum som sætningsseparator. Although, for some Asian languages, it’s not the case. For example, the typical sentence separator in Chinese is an ideographic full stop (). For such cases, you may want to use the following ruleset:

<rule break="yes">
    <beforebreak>[\x3002]+</beforebreak>
    <afterbreak></afterbreak>
</rule>

Bryd tekst op i kortere dele

In the following simple sentence, we’ll break down a case when segmenting one text piece into two (or more) strings is necessary.

Tekst med standardsegmenteringsregler:

Dette er den første del af eksempelsætningen, og det er den anden del.

Tekst med nye segmenteringsregler:

Dette er den første del af eksempelsætningen,
og dette er den anden del.

For this particular case, the following ruleset will break the initial sentence into two parts:

<rule break="yes">
    <beforebreak>sætning</beforebreak>
    <afterbreak>\u0020</afterbreak>
</rule>

Opret segmenteringsregler med SRX Editors

SRX-segmenteringsreglerne kan oprettes og vedligeholdes vha. værktøjer såsom Ratel. Det har en visuel grænseflade, hvor segmenteringsregler kan generere fra bunden eller eksisterende kan redigeres.

Søge assistance

Behov for hjælp til at indstille de tilpassede segmenteringsregler, eller evt. spørgsmål? Kontakt Supportteamet.

Var denne artikel nyttig?