Thai On-Line Library - Bitext Corpus
Maintained by Doug Cooper (bugs to doug@th.net)
Center for Research in Computational Linguistics, Bangkok http://seasrc.th.net

ABOUT THE BITEXT CORPUS     WHY A BITEXT CORPUS?     IMPLEMENTATION DETAILS    RESEARCH PROJECTS

Research Projects / Tools to Develop
If you have any interest in working on or otherwise supporting any of these bitext-related projects, please contact us. Most are suitable as Master's-level projects, and are likely to lead to publishable results.
   If you are a Thai student, published work of this sort is the most persuasive possible argument you can supply to the admissions committee of the graduate school you'd like to attend. For overseas students interested in Thai, Lao, Burmese or Khmer, these projects let you make a real contribution to scholarship in a region that is woefully short on research funding and vision. Finally, overseas professionals who occasionally vacation in Thailand can combine sanuk with some interesting (not to mention socially productive) work, while building tools that will help their own language studies.
Image / text alignment We frequently find that while a text may be available in electronic form in one language, its translation is available in print only (and for one reason or another, OCR is not an option). We would like to be able to align scanned images with e-text. Primary considerations are speed and low cost - we'd like to scan a page, and click on (or automtically detect), sentence and paragraph boundaries. An e-text search tool should then return both matched e-text and the corresponding slice of image data.
Sense disambiguation This project is a necessary part of work in phrase alignment (see below), as well as more general issues of translation and semantic matching. Given an L1 term and its context, along with an L2 translation of the context (which presumably includes a translation of the term), can we guess at the term's meaning? Can we get additional information from other translated contexts in the bitext corpus?
Detection of parallel Web pages Some Web sites (particularly government and NGO sites) have a considerable amount of relatively high-quality parallel translation (at this point, corporate sites seem to be of much lower quality). We are interested in spydering Thai webspace, and finding and downloading likely candidate pages.
Web page disassembly Extracting useful (and potentially parallelizable) text content from Web pages is not as easy as it would seem. We are interested in this project as part of our larger Thai e-corpus as well.
Phrase alignment We want to focus returned contexts on the search query, eg:
Search Results for เหือก (95% @ English +/- 3 words, Thai +/- 15 chars):

. . . he had to choose between expanding the . . . [Man]

จุดที่ เขา จะต้อง   เลือก   เอา ระหว่างการบุก

instead of returning the entire sentence(s):

Finally it came to the point where he had to choose between expanding the former or caring for the latter. [Man]

ในที่ สุด ก็มา ถึงจุดที่ เขา จะต้อง เลือก เอาระหว่างการบุก เบิก พื้นที่อย่างแรก กับ การดูแล พื้นที่อย่างหลัง

   This problem is trivial in cases like the above, where the English context contains a word-for-word match (to choose) to the Thai word's most likely first sense definition. In many cases, though, the returned context will have to be considerably larger than the 3 English words / 15 Thai characters suggested here.

Parallel alignment We are interested in automating parallel alignment at the sentence and paragraph level (as seen in the Bitext Corpus).