ChatGPT is a chatbot developed by OpenAI. It is based on instructGPT: it has been trained to respond to instructions or “prompts” written by users.

ChatGPT shows an impressive ability to provide detailed, consistent and relevant answers. It appears to be particularly good at natural language processing (NLP) tasks such as summarising, answering questions, generating speech and machine translation.

However, as a very new system, ChatGPT still needs to be scientifically evaluated to compare its natural language processing performance with previous work.

To this end, Tencent AI has published a preliminary study on ChatGPT’s translation capabilities:

Is ChatGPT a good translator? A Preliminary Study by Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang and Zhaopeng Tu (Tencent AI)

The Tencent team answers this question by looking at a limited dataset. The team explained: “Getting translation results from ChatGPT is time-consuming because it can only be interacted with manually and cannot respond to large batches. Therefore, we randomly sample 50 sentences from each set for evaluation”. So let’s see what information the team gathered from those 50 sentences.

According to the paper, ChatGPT performs “comparably” to commercial machine translation (MT) solutions such as Google Translate, DeepL and Tencent’s own system on high-use European languages, but struggles with low-resource language pairs.

For this “preliminary study”, researchers from Tencent’s artificial intelligence lab, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang and Zhaopeng Tu, evaluated translation prompts, multilingual translation and translation robustness.

Translation prompt development
For this “preliminary study”, researchers from Tencent’s Artificial Intelligence Lab, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang and Zhaopeng Tu, evaluated translation prompts.

When using generative language models, one of the most important steps is the design of the prompt.

We need to find an appropriate natural language formulation to query the model for our target task. Here we want ChatGPT to translate a sentence in a source language, denoted by “[SRC]”, into a target language, denoted by “[TGT]”.

To find good prompts, Tencent AI directly asked ChatGPT to give 10 prompts, with the following prompt

Give ten concise prompts or templates that you can translate.

ChatGPT returned 10 prompts as expected, but with only a few differences between them. You decide to keep only the following 3 prompts, which are the most representative of the original 10 prompts returned by ChatGPT:

– Prompt 1: Translate these sentences from [SRC] to [TGT]:

– Prompt 2: Answer without quotation marks. What do these phrases mean in [TGT]?

– Prompt 3: Give the [TGT] translation of these sentences:

Prompt 1: Translate these sentences from [SRC] into [TGT]:

Prompt 2: Answer without quotation marks. What do these phrases mean in [TGT]?

Prompt 3: Please give the [TGT] translation for these sentences:

The prompt that produced the best Chinese-English translations (prompt 3) was then used for the rest of the study – 12 directions in total between Chinese, English, German and Romanian.

The researchers were curious to see how ChatGPT’s performance would vary depending on the language pair. While ChatGPT performed “comparably” to Google Translate and DeepL for English-German translation, its BLEU score for English-Romanian translation was 46.4% lower than Google Translate.

The team attributed this poor performance to the marked difference in the monolingual data for English and Romanian, which “limits the linguistic modelling capability of Romanian”.

Romanian-English translation, on the other hand, “can benefit from the strong linguistic modelling capability of English, so that the lack of parallel data resources can be somewhat compensated for”, with a BLUE score only 10.3% lower than Google Translate.

Differences between language families

Beyond differences in resources, the authors write, translation between language families is considered more difficult than translation within the same language family. The difference in the quality of ChatGPT results for German-English and Chinese-English translations seems to confirm this.

The researchers observed an even larger performance gap between ChatGPT and commercial MT systems for low-resource language pairs belonging to different families, such as Romanian-Chinese.

“Since ChatGPT processes different tasks in a single model, low-resource translation tasks not only compete with high-resource translation tasks, but also with other NLP tasks for model capacity, which explains their poor performance,” they wrote.

Google Translate and DeepL both outperformed ChatGPT in translation robustness on two of the three test sets: WMT19 Bio (Medline abstracts) and WMT20 Rob2 (Reddit comments), likely due to their continued improvement as real-world applications powered by domain-specific and noisy sentences.

However, ChatGPT outperformed Google Translate and DeepL “significantly” on the WMT20 Rob3 test set, which contained a speech recognition corpus from the general public. The authors believe that this result suggests that ChatGPT is “capable of generating more natural spoken languages than these commercial translation systems”, suggesting a possible future area of study.

Cost aspects

The cost aspect of using models like GPT for translation should also be taken into account. While GPT-3 Turbo is found to be cheaper than Google Translate and DeepL, GPT-4 is more expensive than other available translation services. The lower cost of GPT-3.5 Turbo might be attributed to economies of scale due to its heavy usage in chat traffic, as well as the implementation of a smaller, more efficient model for chat-based tasks. On the other hand, the higher cost of GPT-4 can be attributed to its larger size and complexity.

The performance of large language models (LLMs) with low-resource languages is also considered. The models are found to perform significantly worse with these languages. However, the generation of synthetic data is suggested as a potential method to improve their performance.

Future prospects: what is the most effective approach?

While machine translation models are inherently predictive, i.e. they are expected to be both accurate and precise, generative models like ChatGPT can open up new avenues for translators and the localisation industry.

The cost of content creation will fall, which means that more content will be created. This creates a new demand for linguistic services to review, adapt and certify the results obtained by AI. The concept of post-modification of machine translation will extend to linguistic validation, cultural adaptation, tone adjustment, fact checking and bias removal.

The future direction of building translation systems, suggests that the most effective approach might involve utilizing a general-purpose language model and subsequently fine-tuning it for specific translation tasks. This approach leverages the broad training of a general-purpose language model, which has been exposed to diverse linguistic structures and vocabulary, and then narrows its focus to the task of translation.

One of the challenges of using general-purpose models like GPT for translation is the occasional production of odd outputs. For example, the models may output transliterated text in addition to the actual translation. This could be a result of the training data, which often includes transliterations, especially for languages that use non-Latin scripts. These anomalies can lead to incorrect translations if not addressed appropriately.

One proposed solution to this issue is the application of post-edit checks, as the models sometimes produce unexpected outputs. The suggestion is to add a layer on top of these models to manage and control their outputs better, thereby improving the quality of translations. The development of quality estimation metrics, which assess the quality of model outputs, can also help identify and correct these anomalies.

Looking towards the future of machine translation, we might then move towards training a general-purpose language model first and then fine-tuning it for translation.