|
Research Projects / Tools to Develop If you have any interest in working on or otherwise supporting any of these bitext-related projects, please contact us. Most are suitable as Master's-level projects, and are likely to lead to publishable results. | ||
|
If you are a Thai student, published work of
this sort is the most persuasive possible argument you can
supply to the admissions committee of the graduate school you'd
like to attend. For overseas students interested in Thai, Lao,
Burmese or Khmer, these projects let you make a real contribution
to scholarship in a region that is woefully short on research
funding and vision. Finally, overseas professionals who occasionally
vacation in Thailand can combine sanuk with some interesting
(not to mention socially productive) work, while building tools that
will help their own language studies.
Image / text alignment We frequently find that while a text may be available in electronic form in one language, its translation is available in print only (and for one reason or another, OCR is not an option). We would like to be able to align scanned images with e-text. Primary considerations are speed and low cost - we'd like to scan a page, and click on (or automtically detect), sentence and paragraph boundaries. An e-text search tool should then return both matched e-text and the corresponding slice of image data. | ||
| Sense disambiguation This project is a necessary part of work in phrase alignment (see below), as well as more general issues of translation and semantic matching. Given an L1 term and its context, along with an L2 translation of the context (which presumably includes a translation of the term), can we guess at the term's meaning? Can we get additional information from other translated contexts in the bitext corpus? | ||
| Detection of parallel Web pages Some Web sites (particularly government and NGO sites) have a considerable amount of relatively high-quality parallel translation (at this point, corporate sites seem to be of much lower quality). We are interested in spydering Thai webspace, and finding and downloading likely candidate pages. | ||
| Web page disassembly Extracting useful (and potentially parallelizable) text content from Web pages is not as easy as it would seem. We are interested in this project as part of our larger Thai e-corpus as well. | ||
Phrase alignment We want to focus returned contexts on
the search query, eg:
| ||
| instead of returning the entire sentence(s): | ||
This problem is trivial in cases like the above, where the English context contains a word-for-word match (to choose) to the Thai word's most likely first sense definition. In many cases, though, the returned context will have to be considerably larger than the 3 English words / 15 Thai characters suggested here. | ||
| Parallel alignment We are interested in automating parallel alignment at the sentence and paragraph level (as seen in the Bitext Corpus). |