Version 1
: Received: 14 March 2023 / Approved: 14 March 2023 / Online: 14 March 2023 (10:01:51 CET)
How to cite:
Lu, Q.; Qiu, B.; Ding, L.; Xie, L.; Tao, D. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. Preprints2023, 2023030255. https://doi.org/10.20944/preprints202303.0255.v1
Lu, Q.; Qiu, B.; Ding, L.; Xie, L.; Tao, D. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. Preprints 2023, 2023030255. https://doi.org/10.20944/preprints202303.0255.v1
Lu, Q.; Qiu, B.; Ding, L.; Xie, L.; Tao, D. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. Preprints2023, 2023030255. https://doi.org/10.20944/preprints202303.0255.v1
APA Style
Lu, Q., Qiu, B., Ding, L., Xie, L., & Tao, D. (2023). Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. Preprints. https://doi.org/10.20944/preprints202303.0255.v1
Chicago/Turabian Style
Lu, Q., Liping Xie and Dacheng Tao. 2023 "Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT" Preprints. https://doi.org/10.20944/preprints202303.0255.v1
Abstract
Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks such as machine translation, question answering, text summarization, and natural language understanding. Recent research has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conducted an investigation into several prompting methods. Our results indicate that by combining Chain-of-Thoughts and Error Analysis, a new prompting method called Error Analysis Prompting, LLMs like ChatGPT can \textit{generate human-like MT evaluations at both the system and segment level}. Additionally, we discovered some limitations of ChatGPT as an MT evaluator, such as unstable scoring and biases when provided with multiple translations in a single query. Our findings aim to provide a preliminary experience for appropriately evaluating translation quality on ChatGPT while offering a variety of tricks in designing prompts for in-context learning. We anticipate that this report will shed new light on advancing the field of translation evaluation with LLMs by enhancing both the accuracy and reliability of metrics. The project can be found at https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt.
Keywords
ChatGPT; Machine Translation
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.