Version 1
: Received: 11 October 2021 / Approved: 13 October 2021 / Online: 13 October 2021 (11:37:27 CEST)
Version 2
: Received: 7 February 2023 / Approved: 7 February 2023 / Online: 7 February 2023 (12:12:55 CET)
How to cite:
Kim, A.; Bak, Y.; Sun, J.; Lyu, S.; Lee, C. The Suboptimal WMT Test Sets and Its Impact on Human Parity. Preprints.org2021, 2021100199. https://doi.org/10.20944/preprints202110.0199.v2
Kim, A.; Bak, Y.; Sun, J.; Lyu, S.; Lee, C. The Suboptimal WMT Test Sets and Its Impact on Human Parity. Preprints.org 2021, 2021100199. https://doi.org/10.20944/preprints202110.0199.v2
Cite as:
Kim, A.; Bak, Y.; Sun, J.; Lyu, S.; Lee, C. The Suboptimal WMT Test Sets and Its Impact on Human Parity. Preprints.org2021, 2021100199. https://doi.org/10.20944/preprints202110.0199.v2
Kim, A.; Bak, Y.; Sun, J.; Lyu, S.; Lee, C. The Suboptimal WMT Test Sets and Its Impact on Human Parity. Preprints.org 2021, 2021100199. https://doi.org/10.20944/preprints202110.0199.v2
Abstract
With the advent of Neural Machine Translation, the more the achievement of human-machine parity is claimed at WMT, the more we come to ask ourselves if their evaluation environment can be trusted. In this paper, we argue that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim. First of all, we report nine types of so-called technical contaminants in the data set, originated from an absence of meticulous inspection after web-crawling. Our empirical findings show that when they are corrected, about 5% of the segments that have previously achieved a human parity claim turn out to be statistically invalid. Such a tendency gets evident when the contaminated sentences are solely concerned. To the best of our knowledge, it is the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity. We cast evidence for such phenomenon that according to sentence-level TER scores, those trivial errors change a good part of system translations. We conclude that to overlook it would be a mistake, especially when it comes to an NMT evaluation.
Keywords
human parity; NMT evaluation; WMT
Subject
Computer Science and Mathematics, Analysis
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Commenter: Ahrii Kim
Commenter's Conflict of Interests: Author