The Suboptimal WMT Test Sets and Its Impact on Human Parity

Ahrii Kim; Yunju Bak; Jimin Sun; Sungwon Lyu; Changmin Lee

doi:10.20944/preprints202110.0199.v1

Submitted:

11 October 2021

Posted:

13 October 2021

Read the latest preprint version here

Abstract

With the advent of Neural Machine Translation, the more the achievement of human-machine parity is claimed at WMT, the more we come to ask ourselves if their evaluation environment can be trusted. In this paper, we argue that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim. First of all, we report nine types of so-called technical contaminants in the data set, originated from an absence of meticulous inspection after web-crawling. Our empirical findings show that when they are corrected, about 5% of the segments that have previously achieved a human parity claim turn out to be statistically invalid. Such a tendency gets evident when the contaminated sentences are solely concerned. To the best of our knowledge, it is the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity. We cast evidence for such phenomenon that according to sentence-level TER scores, those trivial errors change a good part of system translations. We conclude that to overlook it would be a mistake, especially when it comes to an NMT evaluation.

Keywords:

human parity

;

NMT evaluation

;

WMT

Subject:

Computer Science and Mathematics - Analysis

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

The Suboptimal WMT Test Sets and Its Impact on Human Parity

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe