<aside> <img src="/icons/translate_pink.svg" alt="/icons/translate_pink.svg" width="40px" /> The difference between the right word and the almost right word is really a large matter β itβs the difference between lightning β‘ and a lightning bug π*. β* Mark Twain
</aside>
All books featured on the LitMT website were translated using large language models, more specifically the GPT-3.5 family and GPT-4 . We are planning on adding more options in the future (such as *nllb-200-distilled-600M* with post-editing). Further details regarding the translation methodologies can be found in the expandable sections below.
The total cost of translations so far is about $675 (excluding current and future retranslations).
Currently, the LitMT collection, contains $318$ different language pairs, though not all are uploaded on the website. This includes rare pairs like Catalan-to-Japanese or Serbian-to-Korean, which are difficult to evaluate. Our translation approach is passage-based, involving around $1,800$ tokens per passage. Consequently, we cannot employ automatic quality estimation metrics, such as **CometKiwi,** which are typically designed for shorter sequences of up to 512 tokens. Moreover, conventional metrics such as BLEU, known for their inherent limitations (Reiter, 2018; Mathur et al., 2020, Kocmi et al. 2021), necessitate a hypothesis for comparison β gold translation. This presents a challenge for us, as our focus is on bringing new, previously untranslated works to light, and thus we lack access to the gold translations. Given these constrains we use simplified filters to identify passages with serious issues:
langid and polyglot (for Hausa) in order to check whether the produced translation is simply in the correct language. This is also testing the ability of these models to follow basic instructions, that is βtranslate from X into Yβ. We use langid as we noticed that other options work poorly for Mongolian. We additionally use polyglot as it was the only option which worked reasonably for Hausa.We also conduct visual inspection of the uploaded translations. This wonβt, of course, help us identify translations of terrible quality, but it will help us filter out instances that are clearly of poor quality even though they passed all the checks listed above (e.g., examples where the model clearly mixed languages).
Note that we could have also try using a quality estimation model (i.e., model which does not require references), like the CometKiwi mentioned above, with a sliding window (as we cannot fit the entire passage). However, that would produce only approximate scores (with potential errors) and may result in more inaccurate scores for low- than high-resource languages.
So far, we have translated a total of $18$ books,* encompassing drama and prose. The translations were done in two batches. The first batch consisted of $11$ books, initially translated using predominantly GPT-35-turbo (azure). The second batch, comprising $7$ books, was primarily translated with GPT-3.5-turbo-16k (openai). However, translations into Mongolian, Tamil, Hausa, Urdu, Bengali, and Hindi in the second batch were exclusively done using GPT-4 (openai). In this discussion, we focus on this initial translations by the GPT-3.5 family and the translations by GPT-4. Since then we begun the process of retranslating certain passages where issues were identified, but these retranslations are not included in our current analysis.
| model π€ | api π₯οΈ | context π | ***batch-***1οΈβ£ [11 books] | ***batch-***2οΈβ£ **[7 books] |
|---|---|---|---|---|
GPT-35-turbo |
azure | $8$k π€·ββοΈ | 11,468 chunks | 0 chunks |
GPT-3.5-turbo-16k |
openai | $16$k | 45 chunks | 7,319 chunks |
GPT-4 |
openai | $8$k | 172 chunks | 2,137 chunks |