Guidance to Pre-tokeniztion for SacreBLEU: Meta-Evaluation in Korean

SacreBLEU, by incorporating a text normalizing step in the pipeline, has been well-received as an automatic evaluation metric in recent years. With agglutinative languages such as Korean, however, the metric cannot provide a con-ceivable result without the help of customized pre-tokenization. In this regard, this paper endeavors to examine the influence of diversified pre-tokenization schemes –word, morpheme, character, and subword– on the aforementioned metric by performing a meta-evaluation with manually-constructed into-Korean human evaluation data. Our empirical study demonstrates that the correlation of SacreBLEU (to human judgment) fluctuates consistently by the token type. The reliability of the metric even deteriorates due to some tokenization, and MeCab is not an ex-ception. Guiding through the proper usage of tokenizer for each metric, we stress the significance of a character level and the insignificance of a Jamo level in MT evaluation. Link to our code is available at http:// github.com/ko_sacrebleu

In this regard, this paper endeavors to examine the influence of diversified pre-tokenization schemes -word, morpheme, character, and subword-on the aforementioned metric by performing a meta-evaluation with manuallyconstructed into-Korean human evaluation data.
Our empirical study demonstrates that the correlation of SacreBLEU (to human judgment) fluctuates consistently by the token type. The reliability of the metric even deteriorates due to some tokenization, and MeCab is not an exception. Guiding through the proper usage of tokenizer for each metric, we stress the significance of a character level and the insignificance of a Jamo level in MT evaluation.
Link to our code is available at http:// github.com/ko_sacrebleu

Introduction
For almost two decades, BLEU (Papineni et al., 2002) has played a vital role in Machine Translation (MT) evaluation as an all-time favored automatic metric, whether we actually "like" it or not. Marie et al. (2021) statistically backed up such a trend, reporting that in the past decade (2010-2020), about 98.8% of research papers submitted in ACL under the title of MT regarded BLEU as a prime evaluation metric. Although we kept getting stern warnings against its use (Tan et al. 2015;Callison-Burch et al. 2006), 89% of new cutting-edge metrics that exhibited better correlation with human judgment (than other metrics including BLEU) were never deployed in actual researches to date (Marie et al., 2021). The research community opted for stabilizing it instead of exploring new ones, and the best alternative seemed to be SacreBLEU (Post, 2018).
The biggest strength of SacreBLEU was that it reduced the influence of pre-processing scheme on the score computation that could have otherwise fluctuated upon any minor changes such as a type of tokenizers, a split of compound nouns, use of unknown tokens for rare words or casing (Post, 2018). By embracing the text normalizing step in the architecture, SacreBLEU could provide more trustworthy evaluation scores.
While it was gaining weight in the literature, the trust issue was still prominent in agglutinate languages such as Korean. Languages of such typology by design required language-specific tokenization to consider semantic features, which was absent from the pipeline of SacreBLEU. The rule of thumb was to use MeCab-ko 1 beforehand as directed in Workshop on Asian Translation (Nakazawa et al., 2020), but its correlation to human judgment in MT evaluation was not officially confirmed.
In the context where Korean is not capable of taking advantage of SacreBLEU's protective layer, we shed light on the influence of a varied pre-tokenization pipeline on the given automatic metric that features three lexical-similarity-based types: BLEU, TER (Snover et al., 2006), and ChrF (Popović, 2015). At the same time, we share empirical lessons for SacreBLEU when applying it in the Korean language in MT evaluation, some of which are as follows. At the segment level: 1. The pre-tokenization enhances the credibility of BLEU and TER in almost all cases.

Level Denominationß Particle Ending Affix Example
Word Table 1: Definition of a Korean word (Nam et al., 2019) in comparison to other meta-levels.
when it comes to ChrF, there is a good chance that it damages its reliability.
3. The influence of the subword level is insignificant. In the worst case, the segmentation at this level could even degrade the credibility of ChrF and TER. The general definition of word, as shown above, conjectures that it is separated by space. Such assumption, however, is arguable in Korean, whose word does not always come with a space on either side.
As displayed in Table 1, there are three approaches in defining a word: comprehensive, compromising, and analytic. They diverge on the endowment of qualification as an independent element to three components -post-positional particle, ending, and affix (Nam et al., 2019). Following the comprehensive standpoint, what is typically understood as a word in Western languages is equivalent to Eojeol in Korean. The compromising perspective sees that endings and affixes are not independent while the analytic approach recognizes the self-reliance of the endings. That much active discussion is possible with the morpheme boundary, which makes the Korean word decomposition complex.
In this paper, we describe two peculiar aspects of the Korean letter (or Hangeul). In the first place, a character has a sub-layer. The word read, for instance, is composed of four characters: r-e-a-d. The equivalent Korean word 읽 in Table 1 is also a character, but at the same time it is a combination of two consonants (ㅇ, ㄺ) and one vowel (ㅣ). We call this sub-layer Jamo (ㅇ-ㅣ-ㄺ).
Secondly, Jamo are position-wise; they are situated in a fixed position of Choseong (initial, ㅇ), Jungseong (middle, ㅣ), and Jongseong (final, ㄺ), respectively. Some affixes or morphemes take the form of Jongseong, making a diversified token scenario between the morpheme and Jamo level possible. The elaborate example is given in Table 2.

Token Level
We define four meta-levels of segmentation for our experiment: word, morpheme, character, and subword. As discussed in Section 2.1, particle (or Josa), ending, and affix are the fork of a road to the classification of the tokens.
• Word: We adopt the comprehensive perspective. Hence, this token level is conceptually equivalent to Eojeol, which does not consider particles, ending or affixes as an independent element.
• Morpheme: This token level considers particles, endings, and affixes independent. The level of segmentation varies from tokenizer to tokenizer.

Tag Set
Most Korean morphological analyzers have their roots in the 21st Century Sejong Project launched in 1998 intending to build a national framework for large-scale Korean corpora (21st Sejong Project, 1999). The tokenizers feature a different number of tag sets derived from the Sejong tag sets, as described in Appendix A. The prototype tag set is preserved in Komoran and similarly in MeCab and Khaiii. The tokenizer with the most diminutive tag set is Kkma with 56 tags. It can provide a detailed analysis of endings (Eomi). The reverse case is Okt, a tokenizer targeted for Twitter, with 19 tags. Woo and Jung 3 Although SPM is considered as a subword tokenizer in general, such a concept does not coincide with Korean linguistics. Hence, while its output takes the shape of morpheme, we categorize SentencePiece in the subword level. 4 https://github.com/bab2min/Kiwi 5 https://github.com/kakao/khaiii 6 https://github.com/JDongian/ python-jamo (2019) report its outstanding performance with typos, emojis, and punctuations. Hannanum also features a small-sized tag set (22 tags). Its exceptionally compressed division of particle tags is noticeable, among others. As expected, the central divergence of the tag sets is observed in particles, endings, and affixes.

Tokenization Scenario
As displayed in Table 2, the exemplary sentence is extracted from our data set to show a various tokenization scenario. It accumulates all possible scenarios of the meta-levels studied in our experiment.
The instance shows that the segmentation is most diversified in verbs (활보했다) with nine possibilities. It is also intriguing that some tokenizers consider Jongseong as an independent token (하, -ㄴ). Such case is Hannanum, Kkma, Komoran, Khaiii and Kiwi.

Experiment Setup
As Korean evaluation data is rarely available, we organized a human evaluation of four commercial NMT systems for the English-to-Korean translation direction with Direct Assessment (DA), the conventional human metric at Conference on Machine Translation (Barrault et al., 2020). Consequently, a machine evaluation with BLEU, TER, and ChrF of SacreBLEU is performed. With the resources at hand, the correlation between the two evaluation results is computed at the segment and corpus level.

Dataset
• Source Test Set: The original English texts are borrowed from WMT 2020 English IIItype test set, composed of 2,048 sentences (61 documents) with segment split. The segmentsplit format allows them the freedom of translating beyond a sentence level (Barrault et al., 2020).
• Reference Translation: The Korean reference translation is created by a group of professional translators. The translators are advised not to post-edit MT. To guarantee the highest translation quality, we double-check the final revision. The revision is implemented only if the sentence is semantically erroneous.
• System Translation: We employ four online MT models including our own -Kakao i 7 -. They are denominated as Sys A , Sys B , Sys P and Sys Q in order to keep their anonymity for legal reason. The system translations are obtained on July 21, 2021.
In terms of normalizing data, errors in the source test sets and their subsequent impact on the system translations (Kim et al., 2021) remain undealt with. Only some minor technical issues, i.e. different single quotations (' and '), are normalized.

Human Evaluation
DA is a metric where an evaluator scores each sentence on a continuous scale of 0-100 in the category of Adequacy and Fluency. We hire 25 professional translators and assign each person a HIT of more or less 300 translated sentences with the context of the documents maintained. They are advised to consider the context when making a judgment. They are allowed to reverse their previous decisions within a document.
They are either holders of a master's degree in interpretation and translation in the English-Korean language pair or freelance translators with a minimum of two years of experience. In light of the fact that all participants are new to MT evaluation, we provide a detailed guideline for the experiment. One judgment per system translation is gathered, amounting to 16,384 (8,192 of Adequacy and Fluency) evaluation data. The judgment on Fluency is only utilized as supplementary information.

Quality Control
Out of the 8,192 Adequacy judgments, the first ten judgments of each evaluator are considered an adjusting step, and so, they are removed. The scores are then normalized with judge-wise Zscores. With them, Inter-Quartile Range (IQR) is computed as in Equation 1, where Q 1 and Q 3 signify the first and third quartile values. The outlier x belongs to the range that meets the condition. 7 https://www.translate.kakao.com Having removed 5.67% of the data, we base our observation on 7,727 judgments.

Computation
The hypothesis and reference translations are tokenized by the aforementioned 11 token units without applying any additional normalization. Consequently, the scores of the automatic metrics are computed, and their Pearson's correlation coefficient r to the the human Adequacy judgment are measured as such: (2) H and M refer to the machine and human DA score, respectively, and H and M , their mean values. The Pearson's r measures the linear relationship between them.
During the process, some of the issues have concerned us: • Do we adjust n-gram parameters?
By default, the BLEU score is a geometric mean of four-grams. As the token unit is divergent, on the one hand, we attempt to avoid a circumstance where any tokenizer benefits from the n-gram parameter. On the other hand, the default n-grams of ChrF at the word level are zero. To make the consequence of the token unit visible and compatible, we acquire in advance the best-correlated n-grams of each token unit to the human judgment. The ngrams of the token unit, therefore, is specified together with the outcome.
• TER scores over 1.0? Theoretically speaking, T ER = 1.0 represents a perfect match between reference and hypothesis sentences. However, we detect that when a hypothesis is too short for its reference, the (sentence-level) TER score exceeds 1.0. Such cases are avoided by being normalized to 1.0 afterward.
• Is there a tie rank?
We run a Wilcoxon ranksum test to identify BLEU TER ChrF   n  500  1000  1500  2000  2500  3000 4000 5000  500  1000  1500  2000  2500  3000  4000  5000  500  1000  1500  2000  2500  3000  4000 Table 3: Average segment-wise ranks of the token unit for SacreBLEU resampled by n samples. The highest ranking is in bold while the lowest is marked with asterisk (*). the statistical significance between the ranks. First, we set a cluster boundary with the pvalue based on the assumption that two token types whose p-value is less than 0.05 (p < 0.05) are statistically different. After counting such cases for all combinations, those with the same number of counts are considered a tie.
• Is the sample size enough?
To yield a credible result, we apply Bootstrap Resampling suggested by Koehn (2004). Out of 7,727 sentences, different blocks of subsamples are extracted in a binary round (N out of 7,727 and M out of N ) on a random basis. Iterating I times, M out of M samples are randomly selected to print the final result. Koehn (2004) reported that they reached a 95% confidence level with 394 samples (N ) and near 100% with 3,000 samples when assessing MT systems with BLEU. While we follow the precedence by setting up the parameters similarly as M = 6000, N = 3000, I = 1000, we provide additional results with variations in N and I in Table 3.

Experiment Result
BLEU, TER, and ChrF scores and their correlation results are analyzed at the segment/corpus level. While finding the best pre-tokenization scheme is intriguing, our primary focus is to examine that SacreBLEU results are susceptible to the pretokenization scenarios.  Table 3.

BLEU.
Without resampling, the result shows that the highest correlation is observed in a broad spectrum of token units: Jamo, Kiwi, Komoran and Character. The lowest correlation is when pre-tokenization is absent, which is consistently witnessed in the resampled outcome, except for one instance of Hannanum (n = 2500). It is safe to conclude, therefore, that any tokenizer can enhance the credibility of this metric, but Hannanum is not an option. Besides, we also report the moderate impact of MeCab.
TER. Before resampling, the morpheme level (Kiwi and Kkma) goes best with this metric, followed by MeCab. Considering the lowest correlation of the word level, it is a reasonable guess that pre-tokenization is essential in this metric. Such trend is still valid when the data is resampled, aside from the two negative cases in Jamo. Among the tokenizers, the positive influence of Kkma is noteworthy as the sample size grows. Moreover, MeCab seems to be a good fit for this metric in Korean. The least recommendable token unit is Jamo, not only because its correlation is low on average but also the increased token size is markedly disadvantageous to the computational cost of this metric.
ChrF. With the 7,727 data set, the optimal token unit for this metric is morpheme (Komoran and Khaiii). Komoran is remarkably well-fitted to this metric even when the data is resampled. However, the most differential aspect of this metric is that tokenization can be harmful. For instance, in a pilot study we find that the best-correlated word n-grams for Kkma and Hannanum are zero, meaning that it is best when they are not taken into consideration. All but the character level have a history of deteriorating the correlation of the metric. In that respect, Hannanum and Jamo produce the most unstable result. It draws our attention that when the sample size is small (n=500), most of the pre-tokenization worsens its correlation, and MeCab is not an exception.
All in all, we conclude that at the segment level, the correlation of the three metrics fluctuates by the pre-tokenization scheme, but there is no direct relationship between the token shape and the correlation of the metrics. Nevertheless, if we are to draw a meaningful conclusion from the result, any pre-tokenization is better than the word level in BLEU and TER, but in ChrF the token unit should be carefully selected. The combination of MeCab and BLEU/TER stands out, while the popular tokenizer can detriment the credibility of ChrF.
Unlike our expectations, the subword level does not serve as a dependable token unit for the Korean MT evaluation. Instead, there is a good chance that it harms the correlation of ChrF and TER. Furthermore, the expanded vocab size increases the computational cost exponentially. As a substitute, we highlight the effectiveness of the character-level segmentation, which guarantees a fast deployment without any tokenizer and, at the same time, is proven to be as reliable as or often better than MeCab in all metrics.

Corpus Level
Table 4 -6 show human DA scores and Sacre-BLEU scores of four online systems in relation to the tokenization scheme. The noticeable finding is that in all three cases the spectrum of the score is substantial. The highest-ranked system in the human z-score (Sys A ) obtained a 28.09 BLEU score without segmentation, but 48.71 when with the character-level tokenization. Likewise, the most overestimated version of TER is before tokenization (82.33 -89.69), as expected, while the most underestimated version of score is on the Jamo level (51.96 -54.69). In ChrF, the range mentioned above is from 42.74 -45.72 on the None level to 51.19 -53.80 on the Jamo level. We, thus, confirm the possibility that the absolute score of SacreBLEU can be doubled just by selecting a different tokenizer.
While so, the more severe problem is that the ranks by score, irrelevant from the pre-tokenization typology, do not comply with human perception. The human average scores place the systems in the order of [Sys A = 1, Sys B = 2, Sys P = 3, Sys Q = 4], but almost all automatic scores position them as [Sys A = 2, Sys B = 1, Sys P = 3, Sys Q = 4]. Moreover, in many cases the BLEU scores are prone to rank them as [Sys A = 2, Sys B = 1, Sys P = 4, Sys Q = 3] except when tokenized by MeCab, Kiwi and Khaiii. Such an unreliable performance of this metric can be derived from either the small number of systems or the existence of outlier systems in the comparison (Mathur et al., 2020). We raise the issue of a questionable SacreBLEU score at the corpus level, leaving its verification to our future work. Figure 2 shows the Pearson correlation of Sacre-BLEU when with the pre-tokenization. The most faithful score is achieved with Khaiii and Kiwi in all three metrics. The correlation of MeCab is also striking. Despite their discernible performance on this level, however, we reiterate that none of the options represent the system rankings as humans perceive. In this respect, we propose the mean value of segment-wise SacreBLEU score as a substitute.

Related Works
Recently, a word segmentation got the limelight with the outstanding achievement of subword-level pipelines such as SPM or Byte-Pair Encoding (BPE) (Sennrich et al., 2015) in many NLP tasks (Zhang et al. 2015;). In MT, in specific, the segmentation is tightly entangled with the translation quality mainly due to the handling of unseen vocabulary. In that respect, many studies observed that identifying the best tokenization is languagedependant. Huck et al. (2017) discovered that their model displayed the highest performance when BPE was coupled with a suffix split in German. In a similar manner, Lee et al. (2016) suggested that their fully character-level NMT model outperformed BPE models, especially in the Finnish-English pair. Domingo et al. (2018)      enizer could lead to a more refined translation quality for all languages. They made a remark that such phenomenon was striking in morphologically rich languages such as Japanese.
In relation to Korean in this regard, the concept of detailed segmentation intrigued many researchers in NLP in general (Park et al. 2018;Kim et al. 2020;Yongseok and Lee 2020;Park et al. 2020), for a Korean word had a large volume of affixes and morphemes. In MT in specific, various combinations of token units were suggested. Among them, Park et al. (2019) stated that when their NMT model was trained with a dataset of subtitles tokenized with SPM Unigram after removing postpositional particles, it obtained a higher BLEU score than when with simple BPE.
While they mentioned that smaller token unit was not always an answer in the case of Korean, recent studies paid attention to a smallest unit, the Jamo. Moon and Okazaki (2020) introduced Jamo-Pair Encoding, combining Jamo with BPE. Eo et al. (2021) utilized Jamo from a functional viewpoint by regarding Choseong and Jungseong as one token and leaving Jongseong as another. They demonstrated that the model with such segmentation outperformed the model of Park et al. (2019).
We differ from the aforementioned studies in that we explore the impact of tokenization on the MT evaluation. As the gold standard in this field is human, we prioritize the examination of the pattern of its impact on SacreBLEU, instead of discovering a superior token unit. Assuming that the matching system of N-grams in BLEU and ChrF or edit distance in TER is exceptionally vulnerable to the lexical shape of tokens, we observe how such changes are in line with human perception. We believe it is a pitfall to the MT evaluation of synthetic languages 8 such as Korean.

Conclusion
On the condition that pre-tokenization is obligatory for the agglutinative language such as Korean when computing automatic evaluation metrics, we endeavor to demonstrate the influence of dynamic token units on the credibility of SacreBLEU, including BLEU, TER, and ChrF at both the segment and corpus level in the into-Korean translation direction.
For the meta-evaluation, we perform a human evaluation with 25 professional translators on system translations produced by four NMT models. When the Pearson correlation is measured, the result shows that the impact of token type differs from metric and its computation level. At the segment level, we report that any pre-tokenization enhances the correlation of BLEU and TER while it should be carefully selected in ChrF. At the corpus level, ranking by the SacreBLEU scores turns out to be inaccurate regardless of the pre-tokenization scheme. Contrary to our expectation, the diminutive segmentation by the subword level shows signs of ineffectiveness. Instead, we put an emphasis on the role of the character level.