Thai On-Line Library - Bitext Corpus
Maintained by Doug Cooper (bugs to doug@th.net)
Center for Research in Computational Linguistics, Bangkok http://seasrc.th.net

ABOUT THE BITEXT CORPUS     WHY A BITEXT CORPUS?     IMPLEMENTATION DETAILS    RESEARCH PROJECTS

Why a Thai Bitext Corpus?
Bitext plays an important role in language education and computational lingustics research around the world. Bitexts are readily available on-line for English, French, Chinese, Swedish, German, Norwegian, and a dozen other languages (see links, below).

In education . . . Bitexts can markedly increase student reading and comprehension in a second language. Because the raw volume of text they read jumps so dramatically, students are exposed to a much wider vocabulary; when text is easier to read, students can begin to understand large-scale features of style and grammar. Bitexts have long been a mainstay of second-language education for European languages, and are equally valuable for students of English and Thai.

   Bitext search tools are a cornerstone of data-driven learning. Calling up a dozen examples of a word, phrase, or construction helps students understand and retain subtle distinctions of meaning and usage. It is even more helpful in teaching writing than reading, because bitext searches let real-world experts - writer and translators - provide on-the-spot advice and examples.

In research . . . Thai IT tools are easily 10 years behind the times, and are falling further back every year. Building good tools requires a solid foundation of data - books, articles, speeches, stories, and all other forms of text - that just aren't available for Thai.

   Bitexts are an essential part of research in translation, word-sense disambiguation, and lexicography. Because they let us leverage tools and techniques from other languages, particularly English, they are extremely important for learning how to build search engines, summarize documents, align texts, and so on in Thai.

You can help - please add texts . . . If you are an author or translator, please consider including your work in the Thai Bitext Corpus, either as:

  • A text that can be read and searched on-line in parallel translation.
  • Part of an database that can only be searched for one-line usage examples. Text will be credited or anonymous at your option.
Frequently Asked Questions:
Is any special preparation required?  No. We'll do all setup necessary - all we need is the work either in plain text format, or in some standard program form (like .rtf).
Is my copyright affected?  No. You still retain all rights to print publication and any commercial use.
What kinds of text do you need?  Every kind. Novels, speeches, children's stories, journalism, articles all demonstrate particular features of English and/or Thai writing.

Links on bitexts:
Jean Veronis' Bibliography on Parallel Text Processing
Michael Barlow's Parallel Corpora page.

Links on data-driven learning:
Tim John's Data-Driven Learning Page; particularly his Virtual DDL Library Page.
Joseph Rezeau's site; particularly his concordance page.