More and more companies are using machine translation and then, ‘for safety’s sake,’ only plan for random human checks to save money and speed up the process of preparing texts in different languages. Modern models for automated, reference-free translation quality estimation, such as COMETKiwi, were intended to be a suitable means to an end: ‘Let the AI tell us which sentences are good and which ones require correction.’ Companies started implementing systems that accept or reject sentences based on these scores. Sounds cost-effective. But can you really rely on AI to identify which sentences are correct?
You can’t. At least, not for now. And we will now show you why—in simple terms and with concrete, real-life examples.
This article contains terms that may be unfamiliar to those not involved in AI translation quality assessment. Here are some quick explanations:
| Abbreviation / Term | What it means |
|---|---|
| MT (Machine Translation) | Machine translation—a translation produced automatically by an AI system, e.g., DeepL, Google Translate, ChatGPT, etc. |
| PE (post-editing) | The manual correction of a machine translation by a human translator. It can be full or partial. |
| AQE (Automated Quality Estimation) | Estimating the quality of a translation using another AI—without comparison to a reference (human) translation. It is intended to help assess which sentences ‘are good’ and which ones require correction. |
| COMETKiwi | A popular machine learning-based system for Translation Quality Estimation (TQE). It assigns a score to sentences, e.g., 0.84—the higher the score, the better the translation (at least in theory). |
| MQM (Multidimensional Quality Metrics) | A professional framework for human translation quality assessment—based on error types (critical, major, minor etc.) and categories (language, terminology, mistranslation etc.). |
Promises vs. Reality
In theory: ‘If AI (COMETKiwi) says a sentence has a score of 0.9, it must be correct—no need to check it!’ In practice:
Even sentences whose translations receive a high automated quality score can contain critical errors that should never reach the client.
A Real-Life Example—20,000 Translated and Scored Segments
The enthusiasm for reference-free AI translation quality estimation is not currently supported by reliable empirical research. Extensive, statistically significant tests on this subject were conducted by a team led by Fred Bane (Director of Data Science at TransPerfect), among others. The research was conducted on translations for one of the largest global clients (technical content, billions of words per year), which ensured that the results were highly representative. Tests for a language pair considered one of the easiest for machine translation (English to Spanish) compared:
- COMETKiwi scores (automated)
- the actual quality of the sentences as assessed by experienced linguists (using the MQM framework).
The result? At first glance, the results seem logical: average scores increase with quality. But when we look closer, at the segment level (which are usually sentences), the distinctions become blurred. Segments with critical errors receive scores that are just as high as those for correct segments. Links to the original presentation and a secondary, detailed analysis of the research results can be found in the sources at the end of this article.
Analyses in many other language pairs (e.g., Polish to English, English to Polish, or English to German), which we periodically conduct with our team of localization engineers and AI specialists at Studio Gambit, also confirm these observations. To date, we have not found any significant correlation between AQE scores and the scores produced by estimators that use reference translations.
The conclusion, therefore, is:
Current automated quality estimation tools cannot guarantee the effective identification of text that is free of errors—including critical ones.
What Does This Mean for Your Company?
If you rely on automated quality estimation (AQE), you must accept the risk that at least 5% of critical errors will definitely reach the client. For companies translating medical, legal, financial or technical content, this is an unacceptable risk. This is why professional AI translation requires a hybrid approach, combining modern AI solutions with the skills of professional specialist translators who perform full post-editing (PE).
Common Problems with Automated Quality Estimation (AQE)
| Problem | Why it happens |
|---|---|
| Incorrectly assigned high scores | Incorrectly translated sentences receive good scores because the AI operates based on ‘patterns’ rather than understanding meaning. |
| Missing translation | The AI ‘doesn’t notice’ that a segment has not been translated because it looks identical to the source. |
| Wrong language | A segment is in the wrong language (e.g. French in a German document), but the AI does not detect this. |
Random PE? NOT if you care about quality
Random human checks of translations are only viable if you can accept the risk of major errors. If your content must be correct every time, the only safe solution at present is have 100% of your content post-edited by professional translators.
Our Recommendations
At this stage, COMETKiwi and similar tools are not accurate enough for sentence-level quality assessment. There is no safe quality threshold that would allow for their ‘automatic approval,’ which is why random post-editing is a risk, not a cost-saving measure.
Here are our recommendations for managing machine translation quality:
- Avoid automated quality assessment at the sentence level—it will not reduce the risk of errors.
- Plan for quality control at the document level, not the segment level.
- Invest in people, not just algorithms.
- If you cannot accept the risk of major errors, commission full post-editing in accordance with the ISO 18587 standard.
Not sure how to assess the quality of machine translations? Don’t want to delve into the complexities of professional AQE analysis? We will be happy to help you choose and implement a solution that considers data security, quality and cost-effectiveness—tailored to your specific languages, subject matter and industry.
👉 Write to us or get in touch with the experts at Studio Gambit.
Sources:


