back

Translating with AI isn’t enough. You need to know how.

There’s a widespread assumption among people who work with international markets: AI models improve from one generation to the next, so machine translations will improve accordingly. It sounds logical. But it’s not quite how it works.

Last week we were at the CIUTI Conference 2026 at IULM University in Milan — one of the world’s most important academic conferences on translation, interpreting, and the language industry, with over 70 speakers from 17 countries. Davide Maestroni from Landoor, together with PhD candidate Antonella Bove from Ca’ Foscari University of Venice and Paola Di Cataldo, presented the findings of a study conducted as part of the PNRR-funded doctoral research project that Landoor carries out in partnership with Ca’ Foscari.

The starting question was simple: if you change the way you ask an AI to translate — the so-called prompt — does the quality of the output change too? And does that difference hold in the same way across different model generations?

Where the research came from

The starting point wasn’t a lab. It was the daily work. During our monthly update meetings with translators, a shared perception had emerged: translations produced by newer models didn’t seem better. In some cases, they seemed worse. Not in raw numbers, but in their ability to capture style, nuance, and the communicative intent of the original text.

Rather than filing this away as subjective, we decided to verify it with scientific rigor. That’s how the study presented at IULM was born.

How the study was conducted

The corpus comprised 8 advertising texts — the most demanding domain in translation, because it requires both terminological precision and creative flair. The language pairs tested were Italian-English, English-Italian, Spanish-Italian, and Italian-Spanish.

For each text, 6 progressively more detailed prompts were used: from the most generic (“Translate into Italian”) to the most complete, enriched with information about domain, specific topic, purpose, target audience, and desired style. This approach draws on Reiss and Vermeer’s functionalist translation theory: the same information a human translator would receive in a professional brief.

For each text-prompt combination, 5 translations were generated, for a total of 240 texts analyzed. Evaluation was conducted using both quantitative metrics (Levenshtein distance to measure how much the model varied its responses) and human qualitative analysis following the MQM framework — the international standard for translation quality assessment — with annotation by the researcher and validation by expert translators.

The findings: three things that matter

The first is about the prompt. Enriching instructions with contextual information — exactly as you would in a brief for a professional translator — improves output quality. But not linearly, and not in the same way across all models. With GPT-4o, the best results are distributed across different levels of prompt detail. With GPT-5.3, the prompt that includes target audience information (P5) produces the highest scores in 5 out of 8 cases.

The second is about the generational comparison. In terms of number and severity of errors, GPT-4o and GPT-5.3 are largely on par. But there’s a significant qualitative difference: the newer model loses stylistic range. It produces more uniform translations, less capable of capturing the creative nuances of the source text. For a technical document, that might not matter. For an advertising text — where tone, rhythm, and word choice are everything — it’s a serious problem.

The third is about the errors themselves. Most serious errors recur regardless of which prompt is used. This means the prompt alone doesn’t resolve the models’ structural weaknesses: accuracy errors (Italian culinary terms left untranslated or mistranslated), register inconsistencies (switches between informal and formal address within the same text), failure to convert units of measurement for the target market.

Why this matters to everyone

Anyone responsible for international communications in a company faces a choice today: rely entirely on automation, or work with partners who know how to manage it.

This research shows that AI-assisted translation is not an automatic process. It’s a process that requires linguistic expertise upstream — in prompt design — and downstream — in reviewing and validating the output. The newest model isn’t necessarily the best model for every type of text. And final quality depends on the skill of whoever is steering the process, not on the power of the engine.

That’s why we invest in research. Not as an academic exercise, but as the foundation of the service we deliver to our clients every day. Because our mission — breaking down language barriers for companies operating in global markets — requires powerful tools and the expertise to use them.

AI accelerates. We deliver.

back

Need our help?

Request a quote

First and last name*
Email*
From*
To*
Notes*
To help us provide you with an accurate and tailored quote, please specify the type of document, the approximate length, and the file format.
For example: technical manual, 20 pages, Word file
Word count
Data consent*
I authorize the use of my data in accordance with European Regulation 679/2016 (GDPR)
Read our Privacy Policy.
CAPTCHA