Dive into the myths and realities of translation quality estimation and assurance as perceived through the lens of the MQM (Multidimensional Quality Metrics) methodology. MQM is a comprehensive system designed to assess and monitor the quality of translated content. MQM serves as a standardized Linguistic Quality Assurance (LQA) framework to evaluate translation quality across various categories. Assessing translations under the MQM framework can help identify strengths in your localization process and opportunities to improve.
In this fireside chat, we explore the common mistakes and best practices employed to ensure top-tier linguistic quality. Discover how the MQM methodology can empower localization managers and linguists alike to minimize errors, remove subjectivity and improve their translation output.
Our experts for this session are:
- Olga Beregovaya | VP of AI and Machine Translation
- Valerie Dehant | Senior Director, Language Services
- Alex Yanishevsky | Director of AI and Machine Translation Deployments
Translation Quality: Understanding MQM methodology
The translation industry, like any other, thrives on quality. But how do you evaluate the quality of translations? Episode seven of Smartling’s ‘Reality Series’ provided valuable insights on translation quality. Essential aspects ranging from machine translation (MT), human translation (HT), and MQM (Multidimensional Quality Metrics) framework are used to throw light on this complex issue.
Myth: A native speaker can evaluate quality The speakers started off by debunking the persistent myth that any native speaker can evaluate translation quality. The measurement of 'translation quality' is indeed much more complex. In fact, quality evaluation is quite subjective, demanding a keen understanding of the context and nuances of both the source and target languages.
MQM Framework The main topic of the session was the introduction of the MQM (multidimensional quality metrics) framework. This model steps away from traditional adequacy and fluency evaluations, providing a more objective method for assessing translation quality. It does take into account factors like adequacy, fluency, and actionability while also encouraging blind evaluation. The speakers stressed the importance of blind evaluation in MQM, in which evaluators remain unaware of whether the translation was conducted by a human or a machine. They underlined the vital role of this technique in eliminating any bias from the evaluation.
How does MQM differ from conventional methods? Olga Beregovaya stated that it's all about the classification and quantification of 'translation errors'. In the MQM model, errors are categorized, and severity weights are assigned to calculate an overall quality score. This methodology allows us to quantify the concept of translation quality, transforming it into a numerical value that can be utilized for improvement.
The speakers touched upon other relevant industry evaluation metrics like BLEU, TER, and quality estimation with large language models (LLMs). These tools, combined with ongoing experimentation with LLMs for quality estimation and semantic evaluation, significantly boost our understanding of engine behavior.
Olga Beregovaya brought to light the difference between textual and semantic scoring. Textual scoring primarily considers the difference in characters or words needed to make a change, while semantic scoring investigates the associations between words and concepts in sentences. She also emphasized the significance of human involvement in identifying scoring statistical outliers and exceptions.
Alex Yanishevsky raised the issue of data quality in the context of deploying Large Language Models (LLMs). He asserted that high-quality data is fundamental and underscored the need for capturing hallucinations when the model significantly deviates from the actual meaning.
Arbitration and KPIs Valérie Dehant emphasized arbitration's role in resolving disagreements among linguists and achieving consistent labeling of errors. She highlighted the pivotal role of the MQM methodology in facilitating arbitration in scenarios where conflicting labels of error categories harm model learning. The MQM's unique arbitration capability offers a clear distinction between errors, enabling a seamless model training process.
Alex Yanishevsky remarked that Key Performance Indicators (KPIs) for machine translation and human translation are content-purpose specific. He sparked interest by citing emotional engagement, user satisfaction, conversions, and support ticket resolution as potential KPIs depending on the content type and how it was serviced (MT or HT).
Valérie Dehant introduced Smartling’s toolkit that streamlines the creation of schemas, logging errors, and promoting collaboration between evaluators through a dashboard, equipped with MQM scores, which provides detailed insights into errors and potential areas for improvement. This granular analysis of errors facilitates devising action plans for quality improvement.
The Verdict By understanding the science behind translation quality, and by implementing the MQM framework, we can approach evaluation of quality with a standardized, reliable method. In addition, episode seven reinforces that the combination of automation and human analysis is essential for enhancing models, identifying anomalies, and advancing the scalability of the evaluation process. Watch the episode in full above!