Thai On-Line Library - Bitext Corpus
Maintained by Doug Cooper (bugs to doug@th.net)
Center for Research in Computational Linguistics, Bangkok http://seasrc.th.net

ABOUT THE BITEXT CORPUS     WHY A BITEXT CORPUS?     IMPLEMENTATION DETAILS    RESEARCH PROJECTS

Implementation Details
Files that underly the bitext corpus are marked up as simply as possible: <S> precedes any line of text, and <P> precedes any group of lines. Optionally, <T> can tag the title, and <C> can tag a chapter head, instead of <S> ( see examples). These simple tags are easily adapted to more sophisticated formats; eg. CESANA (see an example of that here).
      Each language of a bitext goes in its own file. Both files should have the same number of <P> groups of lines, and each corresponding group should have the same number of <S> (or <T> or <C>) tags, If a <T> tag is present, it is used for labeling the search result source.
      Paragraph and sentence alignment is done by hand. In general, paragraphs are easy to align automatically, but sentences sometimes require special consideration, as discussed below. As a rule, breaks in the source language are taken as the final arbiter.
      Segmentation points (in Thai files) are marked with <wbr>. This implementation of the Bitext Corpus is testing 'weak segmentation' tools - we produce text that is not fully segmented, but which is guaranteed not to contain segmentation errors. This is partly meant to cut down on text size increases caused by embedding allowable breakpoints.
      Since <wbr> isn't implemented properly in some Web browsers, we use the following workaround to mark allowable text breakpoints:  in effect, we set the point size to 2, insert a single space, then return to the default point size (in practice, we override the <B> tag). A 2-point space appears to be the smallest unit that's rendered consistently.
      The search feature uses Eric Bohlman's Text::Query::Advanced package (available from CPAN), with ~ added as a synonym for near, and with build_near overridden to handle Thai properly.
      In effect, we grab the pre-built side-by-side lines files, label each line with part of any <T>-tagged title, and return a random assortment of hits. We sort by line length to help suppress hits that, because of translation artifacts, would return long, multi-sentence items.

Markup Examples and Issues
In the first set of examples below, tags are placed in-line with text for convenience only:
  <P>
<T>A Tale of Two Paragraphs
<S>by The Author
<P>
<C>Chapter 1
<P>
<S>It was the best of lines ...
<S>And then blah blah ...
<P>
<S>The second paragraph.
<S>Yet more blah blah ...
   
this could also be an S tag
 
 
this could also be an S tag
Alignment:  There is not always an ordered, one-to-one correspondence in translation, especially at the sentence level. Because the sentence and paragraph marks are not numbered (in the interests of simplicity), this can cause alignment problems. For example, a sequence of sentences in L1 S1, S2, S3 may be translated into L2 in a different order (as S2, S1, S3), or using fewer sentences (S1, S2 translated as S1).
Workaround:  We let the tag <S> mark the smallest unit that can be reasonably aligned.

<S>      markup for L1 (sentence order differs)
S1, S2
<S>
S3
<S>
...
<S>      markup for L2
S2, S1
<S>
S3
<S>
...
<S>      markup for L1 (different number of sentences
S1, S2      translated below as S1
<S>
S3       translated below as S2
<S>
...
<S>      markup for L2
S1
<S>
S2
<S>