ChatGPT is a chatbot developed by OpenAI. It is based on instructGPT: it has been trained to respond to instructions or “prompts” written by users.

ChatGPT shows an impressive ability to provide detailed, consistent and relevant answers. It appears to be particularly good at natural language processing (NLP) tasks such as summarising, answering questions, generating speech and machine translation.

However, as a very new system, ChatGPT still needs to be scientifically evaluated to compare its natural language processing performance with previous work.

To this end, Tencent AI has published a preliminary study on ChatGPT’s translation capabilities:

Is ChatGPT a good translator? A Preliminary Study by Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang and Zhaopeng Tu (Tencent AI)

The Tencent team answers this question by looking at a limited dataset. The team explained: “Getting translation results from ChatGPT is time-consuming because it can only be interacted with manually and cannot respond to large batches. Therefore, we randomly sample 50 sentences from each set for evaluation”. So let’s see what information the team gathered from those 50 sentences.

According to the paper, ChatGPT performs “comparably” to commercial machine translation (MT) solutions such as Google Translate, DeepL and Tencent’s own system on high-use European languages, but struggles with low-resource language pairs.

For this “preliminary study”, researchers from Tencent’s artificial intelligence lab, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang and Zhaopeng Tu, evaluated translation prompts, multilingual translation and translation robustness.

Translation prompt development
For this “preliminary study”, researchers from Tencent’s Artificial Intelligence Lab, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang and Zhaopeng Tu, evaluated translation prompts.

When using generative language models, one of the most important steps is the design of the prompt.

We need to find an appropriate natural language formulation to query the model for our target task. Here we want ChatGPT to translate a sentence in a source language, denoted by “[SRC]”, into a target language, denoted by “[TGT]”.

To find good prompts, Tencent AI directly asked ChatGPT to give 10 prompts, with the following prompt

Give ten concise prompts or templates that you can translate.

ChatGPT returned 10 prompts as expected, but with only a few differences between them. You decide to keep only the following 3 prompts, which are the most representative of the original 10 prompts returned by ChatGPT:

– Prompt 1: Translate these sentences from [SRC] to [TGT]:

– Prompt 2: Answer without quotation marks. What do these phrases mean in [TGT]?

– Prompt 3: Give the [TGT] translation of these sentences:

Prompt 1: Translate these sentences from [SRC] into [TGT]:

Prompt 2: Answer without quotation marks. What do these phrases mean in [TGT]?

Prompt 3: Please give the [TGT] translation for these sentences:

The prompt that produced the best Chinese-English translations (prompt 3) was then used for the rest of the study – 12 directions in total between Chinese, English, German and Romanian.

The researchers were curious to see how ChatGPT’s performance would vary depending on the language pair. While ChatGPT performed “comparably” to Google Translate and DeepL for English-German translation, its BLEU score for English-Romanian translation was 46.4% lower than Google Translate.

The team attributed this poor performance to the marked difference in the monolingual data for English and Romanian, which “limits the linguistic modelling capability of Romanian”.

Romanian-English translation, on the other hand, “can benefit from the strong linguistic modelling capability of English, so that the lack of parallel data resources can be somewhat compensated for”, with a BLUE score only 10.3% lower than Google Translate.

Differences between language families

Beyond differences in resources, the authors write, translation between language families is considered more difficult than translation within the same language family. The difference in the quality of ChatGPT results for German-English and Chinese-English translations seems to confirm this.

The researchers observed an even larger performance gap between ChatGPT and commercial MT systems for low-resource language pairs belonging to different families, such as Romanian-Chinese.

“Since ChatGPT processes different tasks in a single model, low-resource translation tasks not only compete with high-resource translation tasks, but also with other NLP tasks for model capacity, which explains their poor performance,” they wrote.

Google Translate and DeepL both outperformed ChatGPT in translation robustness on two of the three test sets: WMT19 Bio (Medline abstracts) and WMT20 Rob2 (Reddit comments), likely due to their continued improvement as real-world applications powered by domain-specific and noisy sentences.

However, ChatGPT outperformed Google Translate and DeepL “significantly” on the WMT20 Rob3 test set, which contained a speech recognition corpus from the general public. The authors believe that this result suggests that ChatGPT is “capable of generating more natural spoken languages than these commercial translation systems”, suggesting a possible future area of study.

Future prospects

While machine translation models are inherently predictive, i.e. they are expected to be both accurate and precise, generative models like ChatGPT can open up new avenues for translators and the localisation industry.

The cost of content creation will fall, which means that more content will be created. This creates a new demand for linguistic services to review, adapt and certify the results obtained by AI. The concept of post-modification of machine translation will extend to linguistic validation, cultural adaptation, tone adjustment, fact checking and bias removal.