Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

The Suboptimal WMT Test Sets and Its Impact on Human Parity

Version 1 : Received: 11 October 2021 / Approved: 13 October 2021 / Online: 13 October 2021 (11:37:27 CEST)

How to cite: Kim, A.; Bak, Y.; Sun, J.; Lyu, S.; Lee, C. The Suboptimal WMT Test Sets and Its Impact on Human Parity. Preprints 2021, 2021100199 (doi: 10.20944/preprints202110.0199.v1). Kim, A.; Bak, Y.; Sun, J.; Lyu, S.; Lee, C. The Suboptimal WMT Test Sets and Its Impact on Human Parity. Preprints 2021, 2021100199 (doi: 10.20944/preprints202110.0199.v1).

Abstract

With the advent of Neural Machine Translation, the more the achievement of human-machine parity is claimed at WMT, the more we come to ask ourselves if their evaluation environment can be trusted. In this paper, we argue that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim. First of all, we report nine types of so-called technical contaminants in the data set, originated from an absence of meticulous inspection after web-crawling. Our empirical findings show that when they are corrected, about 5% of the segments that have previously achieved a human parity claim turn out to be statistically invalid. Such a tendency gets evident when the contaminated sentences are solely concerned. To the best of our knowledge, it is the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity. We cast evidence for such phenomenon that according to sentence-level TER scores, those trivial errors change a good part of system translations. We conclude that to overlook it would be a mistake, especially when it comes to an NMT evaluation.

Supplementary and Associated Material

Keywords

human parity; NMT evaluation; WMT

Comments (1)

Comment 1
Received: 21 October 2021
Commenter: Tom Kocmi
The commenter has declared there is no conflict of interests.
Comment: Hello, I have analyzed your results and is it possible that you have a low inter-annotator agreement and therefore the claims are not possible to conclude? Please, see my full description here:
https://github.com/ahrii-kim/suboptimal_test_set/issues/2
+ Respond to this comment

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 1
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.