<aside> <img src="/icons/translate_pink.svg" alt="/icons/translate_pink.svg" width="40px" /> The difference between the right word and the almost right word is really a large matter β€” it’s the difference between lightning ⚑ and a lightning bug πŸ›*. β€”* Mark Twain

</aside>

I. Translation Methods


All books featured on the LitMT website were translated using large language models, more specifically the GPT-3.5 family and GPT-4 . We are planning on adding more options in the future (such as *nllb-200-distilled-600M* with post-editing). Further details regarding the translation methodologies can be found in the expandable sections below.

The total cost of translations so far is about $675 (excluding current and future retranslations).

II. Quality Checks


Currently, the LitMT collection, contains $318$ different language pairs, though not all are uploaded on the website. This includes rare pairs like Catalan-to-Japanese or Serbian-to-Korean, which are difficult to evaluate. Our translation approach is passage-based, involving around $1,800$ tokens per passage. Consequently, we cannot employ automatic quality estimation metrics, such as **CometKiwi,** which are typically designed for shorter sequences of up to 512 tokens. Moreover, conventional metrics such as BLEU, known for their inherent limitations (Reiter, 2018; Mathur et al., 2020, Kocmi et al. 2021), necessitate a hypothesis for comparison β€” gold translation. This presents a challenge for us, as our focus is on bringing new, previously untranslated works to light, and thus we lack access to the gold translations. Given these constrains we use simplified filters to identify passages with serious issues:

We also conduct visual inspection of the uploaded translations. This won’t, of course, help us identify translations of terrible quality, but it will help us filter out instances that are clearly of poor quality even though they passed all the checks listed above (e.g., examples where the model clearly mixed languages).

Note that we could have also try using a quality estimation model (i.e., model which does not require references), like the CometKiwi mentioned above, with a sliding window (as we cannot fit the entire passage). However, that would produce only approximate scores (with potential errors) and may result in more inaccurate scores for low- than high-resource languages.

III. Translation Issues


So far, we have translated a total of $18$ books,* encompassing drama and prose. The translations were done in two batches. The first batch consisted of $11$ books, initially translated using predominantly GPT-35-turbo (azure). The second batch, comprising $7$ books, was primarily translated with GPT-3.5-turbo-16k (openai). However, translations into Mongolian, Tamil, Hausa, Urdu, Bengali, and Hindi in the second batch were exclusively done using GPT-4 (openai). In this discussion, we focus on this initial translations by the GPT-3.5 family and the translations by GPT-4. Since then we begun the process of retranslating certain passages where issues were identified, but these retranslations are not included in our current analysis.

model πŸ€– api πŸ–₯️ context πŸ“• ***batch-***1️⃣ [11 books] ***batch-***2️⃣ **[7 books]
GPT-35-turbo azure $8$k πŸ€·β€β™€οΈ 11,468 chunks 0 chunks
GPT-3.5-turbo-16k openai $16$k 45 chunks 7,319 chunks
GPT-4 openai $8$k 172 chunks 2,137 chunks