|
Implementation Details Files that underly the bitext corpus are marked up as simply as possible: <S> precedes any line of text, and <P> precedes any group of lines. Optionally, <T> can tag the title, and <C> can tag a chapter head, instead of <S> ( see examples). These simple tags are easily adapted to more sophisticated formats; eg. CESANA (see an example of that here). | ||||
| Each language of a bitext goes in its own file. Both files should have the same number of <P> groups of lines, and each corresponding group should have the same number of <S> (or <T> or <C>) tags, If a <T> tag is present, it is used for labeling the search result source. | ||||
| Paragraph and sentence alignment is done by hand. In general, paragraphs are easy to align automatically, but sentences sometimes require special consideration, as discussed below. As a rule, breaks in the source language are taken as the final arbiter. | ||||
| Segmentation points (in Thai files) are marked with <wbr>. This implementation of the Bitext Corpus is testing 'weak segmentation' tools - we produce text that is not fully segmented, but which is guaranteed not to contain segmentation errors. This is partly meant to cut down on text size increases caused by embedding allowable breakpoints. | ||||
| Since <wbr> isn't implemented properly in some Web browsers, we use the following workaround to mark allowable text breakpoints:  in effect, we set the point size to 2, insert a single space, then return to the default point size (in practice, we override the <B> tag). A 2-point space appears to be the smallest unit that's rendered consistently. | ||||
| The search feature uses Eric Bohlman's Text::Query::Advanced package (available from CPAN), with ~ added as a synonym for near, and with build_near overridden to handle Thai properly. | ||||
|
In effect, we grab the pre-built
side-by-side lines files, label each line with part of any
<T>-tagged title, and return a random assortment
of hits. We sort by line length to help suppress hits that,
because of translation artifacts, would return long, multi-sentence items.
Markup Examples and Issues
Workaround: We let the tag <S> mark the smallest unit that can be reasonably aligned. <S> markup for L1 (sentence order differs) S1, S2 <S> S3 <S> ... <S> markup for L2 S2, S1 <S> S3 <S> ... <S> markup for L1 (different number of sentences S1, S2 translated below as S1 <S> S3 translated below as S2 <S> ... <S> markup for L2 S1 <S> S2 <S> |