<aside> <img src="/icons/translate_pink.svg" alt="/icons/translate_pink.svg" width="40px" /> The difference between the right word and the almost right word is really a large matter — it’s the difference between lightning ⚡ and a lightning bug 🐛*. —* Mark Twain

</aside>

I. Translation Methods

All books featured on the LitMT website were translated using large language models, more specifically the GPT-3.5 family and GPT-4 . We are planning on adding more options in the future (such as *nllb-200-distilled-600M* with post-editing). Further details regarding the translation methodologies can be found in the expandable sections below.

Pre- and Post-processing
Models and Parameters
Prompts

The total cost of translations so far is about $675 (excluding current and future retranslations).

II. Quality Checks

Currently, the LitMT collection, contains $318$ different language pairs, though not all are uploaded on the website. This includes rare pairs like Catalan-to-Japanese or Serbian-to-Korean, which are difficult to evaluate. Our translation approach is passage-based, involving around $1,800$ tokens per passage. Consequently, we cannot employ automatic quality estimation metrics, such as **CometKiwi,** which are typically designed for shorter sequences of up to 512 tokens. Moreover, conventional metrics such as BLEU, known for their inherent limitations (Reiter, 2018; Mathur et al., 2020, Kocmi et al. 2021), necessitate a hypothesis for comparison — gold translation. This presents a challenge for us, as our focus is on bringing new, previously untranslated works to light, and thus we lack access to the gold translations. Given these constrains we use simplified filters to identify passages with serious issues:

[Language] — we use langid and polyglot (for Hausa) in order to check whether the produced translation is simply in the correct language. This is also testing the ability of these models to follow basic instructions, that is “translate from X into Y”. We use langid as we noticed that other options work poorly for Mongolian. We additionally use polyglot as it was the only option which worked reasonably for Hausa.
[Repetitions] — we measure the repetitions by tokenizing the translations and identifying passages with repeated n-grams. At the moment, we check for 20-grams repeated at least four times. This setup is not ideal as each language follows it’s own tokenization and what constitutes as 20 tokens varies greatly. Nevertheless, we noticed that this setup provides minimal number of false positives while also not returning many false negatives.
[Paragraphs] — we also check for the match in the number of paragraphs between the source text and translation. We do so by splitting both on line breaks and removing any empty lines. While this method is not ideal, i.e., two texts with the same paragraph count may not have the same paragraph structure, it gives us a rough estimate of how many translations potentially include merged or split paragraphs. We do NOT filter out or retranslate texts based on the outcome of this check.

We also conduct visual inspection of the uploaded translations. This won’t, of course, help us identify translations of terrible quality, but it will help us filter out instances that are clearly of poor quality even though they passed all the checks listed above (e.g., examples where the model clearly mixed languages).

Note that we could have also try using a quality estimation model (i.e., model which does not require references), like the CometKiwi mentioned above, with a sliding window (as we cannot fit the entire passage). However, that would produce only approximate scores (with potential errors) and may result in more inaccurate scores for low- than high-resource languages.

III. Translation Issues

So far, we have translated a total of $18$ books,* encompassing drama and prose. The translations were done in two batches. The first batch consisted of $11$ books, initially translated using predominantly GPT-35-turbo (azure). The second batch, comprising $7$ books, was primarily translated with GPT-3.5-turbo-16k (openai). However, translations into Mongolian, Tamil, Hausa, Urdu, Bengali, and Hindi in the second batch were exclusively done using GPT-4 (openai). In this discussion, we focus on this initial translations by the GPT-3.5 family and the translations by GPT-4. Since then we begun the process of retranslating certain passages where issues were identified, but these retranslations are not included in our current analysis.

model 🤖	api 🖥️	context 📕	*batch-*1️⃣ [11 books]	*batch-2️⃣ *[7 books]
`GPT-35-turbo`	azure	$8$k 🤷‍♀️	11,468 chunks	0 chunks
`GPT-3.5-turbo-16k`	openai	$16$k	45 chunks	7,319 chunks
`GPT-4`	openai	$8$k	172 chunks	2,137 chunks