Submitted:
14 March 2023
Posted:
14 March 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction

- Our proposed EA Prompting outperforms standard prompting [1] at the segment level, achieving human-like evaluations at both the system level and segment level.

- When designing prompts, itemized responses are better than lengthy and detailed explanations of errors. Moreover, splitting the instruction into two identifying errors and scoring translation can improve evaluation stability.

- The boosted performance from EA prompting is observed in the zero-shot scenario on text-davinci-003 rather than in the few-shot scenario, which indicates that we need to adjust our settings when utilizing other GPT models.

- Despite its good performance, we show that ChatGPT is NOT a stable evaluator and may score the same translation differently.

- It is NOT advisable to combine multiple translations into a single query input, as ChatGPT has a preference for former translations.
2. ChatGPT As An Evaluation Metric
2.1. Experiment Setup
Dataset
Human Evaluation
Meta Evaluation
Baseline
Large Language Models
2.2. ChatGPT as a metric attains SOTA performance at the system level
- at the system level, ChatGPT achieves SOTA performance compared with existing evaluation metrics for both language pairs. However, text-davinci-003 obtains inferior results compared with other metrics. Our results are consistent with the findings of Kocmi and Federmann [1], who tested the performance of large language models on full test set of the WMT22 metric task.
- ChatGPT and text-davinci-003 lag behind state-of-the-art metrics for En-De at the segment level. For Zh-En, while text-davinci-003 remains suboptimal, ChatGPT with EA prompting exhibits superior performance relative to all other metrics, with the exception of COMET.
2.3. Error analysis prompting with ChatGPT is better than standard prompting at the segment level
2.4. Error analysis prompting empowers ChatGPT to produce human-like evaluations
(i) ChatGPT becomes more adept at identifying errors when instructed by error analysis.
(ii) Itemized template response is better than detailed illustration.
(iii) Separating the scoring process from error identification may improve the stability of ChatGPT.
3. Case Study
3.1. ChatGPT is unstable when conducting evaluation process
3.2. ChatGPT prefers former inputs when provided with multiple translations
3.3. ChatGPT may directly adopt existing evaluation metrics
4. Conclusion
Limitations
Appendix A. Prompt Contexts

References
- Kocmi, T.; Federmann, C. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. arXiv preprint arXiv:2302.14520 arXiv:2302.14520 2023.
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; Zhou, D. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 arXiv:2201.11903 2022.
- Lu, Q.; Ding, L.; Xie, L.; Zhang, K.; Wong, D.F.; Tao, D. Toward Human-Like Evaluation for Natural Language Generation with Error Analysis. arXiv preprint 2022. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; others. Language models are unsupervised multitask learners. OpenAI blog 2019. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; others. Language models are few-shot learners. NeurIPS 2020. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; others. Training language models to follow instructions with human feedback. arXiv preprint 2022. [Google Scholar]
- Qin, C.; Zhang, A.; Zhang, Z.; Chen, J.; Yasunaga, M.; Yang, D. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? arXiv preprint 2023. [Google Scholar]
- Zhong, Q.; Ding, L.; Liu, J.; Du, B.; Tao, D. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT. arXiv preprint 2023. [Google Scholar]
- Hendy, A.; Abdelrehim, M.; Sharaf, A.; Raunak, V.; Gabr, M.; others. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv preprint 2023. [Google Scholar]
- Freitag, M.; Foster, G.; Grangier, D.; others. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. TACL 2021. [Google Scholar]
- Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. EMNLP, 2020.
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. ICLR, 2020.
- Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. ACL, 2020.
- Freitag, M.; Rei, R.; Mathur, N.; Lo, C.k.; Stewart, C.; Avramidis, E.; Kocmi, T.; Foster, G.; Lavie, A.; Martins, A.F.T. Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust. WMT, 2022.
- Zerva, C.; Blain, F.; Rei, R.; Lertvittayakumjorn, P.; C. De Souza, J.G.; Eger, S.; Kanojia, D.; Alves, D.; Orăsan, C.; Fomicheva, M.; Martins, A.F.T.; Specia, L. Findings of the WMT 2022 Shared Task on Quality Estimation. WMT, 2022.
- Kocmi, T.; Federmann, C.; Grundkiewicz, R.; Junczys-Dowmunt, M.; Matsushita, H.; Menezes, A. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation. WMT, 2021.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a Method for Automatic Evaluation of Machine Translation. ACL, 2002.
- Ding, L.; Wang, L.; Liu, X.; Wong, D.F.; Tao, D.; Tu, Z. Understanding and Improving Lexical Choice in Non-Autoregressive Translation. ICLR, 2021.
- Popović, M. Relations between comprehensibility and adequacy errors in machine translation output. CoNLL, 2020.
- Barrault, L.; Bojar, O.; Costa-jussà, M.R.; Federmann, C.; Fishel, M.; others. Findings of the 2019 Conference on Machine Translation (WMT19). WMT, 2019.
- Anastasopoulos, A.; Bojar, O.; Bremerman, J.; Cattoni, R.; Elbayad, M.; Federico, M.; others. Findings of the IWSLT 2021 evaluation campaign. IWSLT, 2021.
- Ding, L.; Wu, D.; Tao, D. The USYD-JD Speech Translation System for IWSLT2021. IWSLT, 2021.
- Kocmi, T.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Gowda, T.; others. Findings of the 2022 Conference on Machine Translation (WMT22). WMT, 2022.
- Zan, C.; Peng, K.; Ding, L.; Qiu, B.; others. Vega-MT: The JD Explore Academy Machine Translation System for WMT22. WMT, 2022.
- Specia, L.; Raj, D.; Turchi, M. Machine translation evaluation versus quality estimation. Machine translation 2010. [Google Scholar]
- Qiu, B.; Ding, L.; Wu, D.; Shang, L.; Zhan, Y.; Tao, D. Original or Translated? On the Use of Parallel Data for Translation Quality Estimation. arXiv preprint 2022. [Google Scholar]





| Language Pair | Segments | Systems | Systems Selected |
|---|---|---|---|
| En-De | 40 | 7 | Tohoku-AIP-NTT, OPPO, eTranslation, Tencent_Translation, Huoshan_Translate, Online-B, Online-A |
| Zh-En | 40 | 8 | Huoshan_Translate, WeChat_AI, Tencent_Translation, OPPO, THUNLP, DeepMind, DiDi_NLP, Online-B |
| Metrics | En-De | Zh-En | ||
|---|---|---|---|---|
| System(%) | Segment(%) | System(%) | Segment(%) | |
| BLEU [17] | 71.43 | 3.55 | 21.43 | 14.71 |
| BERTscore [12] | 76.19 | 12.30 | 25.00 | 26.75 |
| BLEURT [13] | 76.19 | 33.44 | 57.14 | 32.76 |
| COMET [11] | 71.43 | 33.47 | 50.00 | 38.97 |
| text-davinci-003 | 42.86 | 11.86 | 53.57 | 23.08 |
| ChatGPT-EA | 76.19 | 26.40 | 60.71 | 36.73 |
| Instruction | Response | Separation | Score - Segment#38 | Total | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Standard | EA | Detailed | Itemized | sys1 | sys2 | sys3 | sys4 | sys5 | sys6 | sys7 | sys8 | ||
| ✓ | ✓ | - | -3 | -2 | 0 | -3 | -1 | -3 | -1 | -2 | -15 | ||
| ✓ | ✓ | - | -3 | -3 | -2 | 0 | -2 | -2 | -3 | -2 | -17 | ||
| ✓ | ✓ | ✗ | -1 | -1 | -3 | -1 | -1 | 0 | -1 | -2 | -10 | ||
| ✓ | ✓ | ✓ | -2 | -2 | -2 | -3 | -1 | -2 | -2 | -2 | -16 | ||
| ✓ | ✓ | ✗ | -5 | -5 | -3 | -4 | -5 | -4 | -4 | -3 | -28 | ||
| ✓ | ✓ | ✓ | -4 | -4 | -3 | -6 | -3 | -4 | -4 | -3 | -26 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).