Dual-Source Transformer Model for Neural Machine Translation with Linguistic Knowledge

Incorporating source-side linguistic knowledge into the neural machine translation (NMT) model has recently achieved impressive performance on machine translation tasks. One popular method is to generalize the word embedding layer of the encoder to encode each word and its linguistic features. The other method is to change the architecture of the encoder to encode syntactic information. However, the former cannot explicitly balance the contribution from the word and its linguistic features. The latter cannot flexibly utilize various types of linguistic information. Focusing on the above issues, this paper proposes a novel NMT approach that models the words in parallel to the linguistic knowledge by using two separate encoders. Compared with the single encoder based NMT model, the proposed approach additionally employs the knowledge-based encoder to specially encode linguistic features. Moreover, it shares parameters across encoders to enhance the model representation ability of the source-side language. Extensive experiments show that the approach achieves significant improvements of up to 2.4 and 1.1 BLEU points on Turkish→English and English→Turkish machine translation tasks, respectively, which indicates that it is capable of better utilizing the external linguistic knowledge and effective improving the machine translation quality.


Introduction
Neural machine translation (NMT) models based on encoder-decoder architecture are widely used for high-resource language pairs [1][2][3][4][5]. The NMT model employs the encoder to map the source sentence to a dense representation vector. Then it feeds the resulting vector to the decoder to produce the desired target sentence. In recent years, by exploiting the advanced neural network mechanisms, such as gating [2] and attention [3], the NMT model surpasses the previously dominant statistical machine translation (SMT) model [6] and achieves the state-of-the-art performance on many wellknown machine translation tasks.
Instead of explicitly modeling the linguistic features in the SMT model, the NMT model directly learns the translation knowledge from the bilingual parallel sentences. Shi et al. indicated that the encoder captures both the local and global syntactic information of the source language, and different types of source syntax store in different layers with varying concentration degrees [7]. However, they also pointed out that the NMT model cannot capture the in-depth syntactic or semantic information of the source-side language. Therefore, it is difficult to perform machine translation tasks on the lowresource language pairs, especially for the morphologically-rich languages with a considerable vocabulary and complex morphology. For Turkish→English machine translation task, the main problem is that the bilingual parallel sentences are far from sufficient, which cannot provide enough linguistic information for the NMT model to effectively learn the relationship between these two distinct languages. Moreover, the source-side Turkish is a morphologically-rich language with derivational morphology [8]. It has a large vocabulary even in the low-resource training corpus, which makes great difficulty in translation tasks and leads to many inaccurate translation results [9][10]. For English → Turkish machine translation task, many English words have the same surface form with different word types and meanings, which corresponding to different translations. We consider that the external linguistic knowledge is benefit for the NMT model to enhance its representation ability on the source-side language. The first one is lemma, which is widely used for the information retrieval tasks. Since the vocabulary of morphologically-rich language is large, lemmatization can make better generalization by allowing the inflected and morphological variants of the same word to share their representations. The second one is part-of-speech (POS) tag. It can provide the syntactic role for each word in the context, which is helpful in extracting information and reducing data ambiguity. The third one is morphological tag. Since different word types have diverse sets of morphological features, morphology analysis can effectively reduce data sparseness.
Recently, incorporating source-side linguistic knowledge into the NMT model has achieved impressive performance on machine translation tasks. One popular method proposed by Sennrich and Haddow is to generalize the word embedding layer of the encoder to encode each word and its linguistic features without changing the other parts of the conventional NMT model [11]. However, since their method reduces the word embedding size to accommodate the additional input features, the model lacks an effective mechanism to balance the contribution from the word and its linguistic features. Instead, we propose a NMT approach that models the word and its linguistic knowledge, respectively. The other method proposed by Currey and Heafield is to change the architecture of the encoder to encode syntactic information [12]. They incorporates the source-side syntax into the NMT model by employing two separate encoders for the word sequence and its linearized parse. However, since their method combines the vector representations of word and syntax on the sentence-level, it lacks the flexibility to utilize other types of linguistic information. Instead, we employ the additional encoder to specially incorporate various linguistic features on the word-level.
Focusing on the above limitations, this paper proposes a novel NMT approach that models the words in parallel to the linguistic knowledge by using two separate encoders. Compared with the single encoder based NMT model, the proposed approach follows the multi-source framework [13] that additionally employs the knowledge-based encoder to specially encode linguistic knowledge. This is achieved by generalizing its word embedding layer to allow for an arbitrary number and type of linguistic features. Moreover, the proposed approach shares all parameters across encoders to enhance the model representation ability of the source-side language. Experimental results show that the approach achieves significant improvements in both Turkish→English and English→Turkish machine translation tasks.

Related Work
Many researchers show great interest in explicitly utilizing source-side syntactic and linguistic information as prior knowledge to improve their models. Eriguchi et al. modified the NMT model by building a tree-based encoder following the source-side parsed tree [14]. The model has an attention mechanism that allows the encoder to align both the input words and the input phrases with the output words. Yang et al. extended this work by encoding each node of the syntactic tree with both the local and global context information, and presented a weighted variant of the attention model to adjust the proportion of the conditional information [15]. Li et al. linearized the source-side syntactic structure into a label sequence and combined it with the word sequence, which makes the model automatically learning useful information [16]. Aqlan et al. integrated linguistic features on top of the word surface form in the translation model of the SMT, and iteratively trained the model to find the most optimized parameters [17]. Li et al. introduced a knowledge-aware approach that jointly models the word and the linguistic features, and added an attention gate to control the contributions from each encoder [18].
Multi-source NMT model is firstly proposed by Zoph and Knight for multilingual translation [13]. It is a many-to-one setting in the multi-task learning (MTL) approach [19]. The model consists of multiple encoders with one encoder per source language and combines the resulting sentence representations before feeding them into the decoder. Based on the multi-source NMT model, Currey and Heafield incorporated the source-side syntax into the NMT model by employing two separate encoders, one encoder for the word sequence and the other encoder for the linearized parse [12]. Junczys-Dowmunt and Grundkiewicz employed a Transformer model with two encoders that ties word embeddings and shares parameters across encoders for automatic post-editing task [20].
Moreover, many semantic paring tasks also benefit from linguistic knowledge and multi-source framework. Liu et al. used the linguistic information of lemma under the conventional NMT model [21]. Duong et al. utilized multi-source NMT model with multiple encoders to represent the languages [22]. Noord et al. investigated to exploit linguistic knowledge in a multi-encoder setup for neural semantic parsing [23]. Different from previous models that simply combine the additional linguistic features, the proposed approach provides a reasonable and flexible way to incorporate the external linguistic knowledge into the NMT model. It generalizes the word embedding layer of the knowledge-based encoder to allow for an arbitrary number and type of linguistic features to enrich their representations.

Transformer Model
The Transformer model with self-attention mechanism is used in the proposed approach [5]. The model consists of an encoder and a decoder. The encoder maps the source sequence = ( 1 , … , ) to a continuous representation vector = ( 1 , … , ) by employing the multi-head attention component. And then the decoder produces the target sequence = ( 1 , … , ) based on all the previously generated symbols, the representation vector z, and a attention model.

Knowledge-based Encoder
Inspired by the NMT model proposed by Sennrich and Haddow [11], we generalize the input embedding layer of the knowledge-based encoder to incorporate external linguistic features into the NMT model. The architecture of the knowledge-based encoder is shown in Figure 1.
where ⋃ is a vector concatenation operator, Et is the feature embedding matrice, W and U are weight matrices. The length of the concatenated vector in the knowledge-based encoder equals to the source word embedding size.

Dual-Source Transformer Model
Inspired by the work of Zoph and Knight [13], the multi-source NMT model is employed to combine the vector representations of the word sequence and its corresponding linguistic features. Since the proposed model consists of two separate encoders, it can be treated as a dual-source model. The architecture of the proposed dual-source Transformer model is shown in Figure 2.
where Wc is a weight matrice. The cell state is the sum of the two cell states from each encoder:

Linguistic Knowledge for Turkish and English
For Turkish→English machine translation task, the java toolkit Zemberek 1 with morphological disambiguation [24] is utilized to annotate the Turkish linguistic features of lemma, POS tag and morphological feature. Each word's morphological features is concatenated as its morphological tag. For English→Turkish machine translation task, the python toolkit NLTK 2 is utilized to annotate the English linguistic features of lemma and POS tag.
We use the BPE technique [25] to segment both the words and its lemmas into sub-word units, and add "@@" behind each non-final sub-word. We annotate the segmented lemma sequence with other linguistic features by coping the original word's feature value to all the sub-word units of its lemma. All the linguistic feature sequences have the same length. Examples of the word sequence and its corresponding linguistic feature sequences in the proposed dual-source model for Turkish and English are shown in Table 1 and Table 2, respectively.

Data Preparation
Following Sennrich et al. [26], we merge the WIT corpus [27] that consists of TED talks and the SETimes corpus [28] that consists of news as training corpus, merge dev2010 and tst2010 as validation corpus, and use tst2011, tst2012, tst2013, tst2014 as test corpus. The detailed training and validation corpus statistics of Turkish-English machine translation task are shown in Table 3.

Training Parameter
We implemented the proposed dual-source Transformer model by using OpenNMT-tf 3 toolkit. Both the encoder and decoder have 6 layers. The number of hidden units is 512. The number of heads for self-attention is 8. Both the source and target word embedding size are 512, and the number of hidden units in feed-forward layers is 1024. The batch size is 48 sentences. The maximum sentence length is 100. The label smoothing is 0.1. The dropout rate of Transformer is 0.1. The length penalty is 0.6, and the clip gradient is 5.0 [29]. Our parameters are uniformly initialized in [-0.1, 0.1]. We train the model for 120,000 steps by using the Adam optimizer [30] with the learning rate of 0.0002, and we report the result of averaging the 5 last saved checkpoints (saved every 5,000 steps). Decoding is performed by using beam search with the beam size of 5.
We normalize and tokenize the training data, and we use BPE to segment the words and lemmas in Turkish and English by learning separate vocabulary with 32K merge operations. Moreover, we report the case-sensitive BLEU [31] score with tokenization and the ChrF3 [32] score to evaluate the translation performance. The model parameters for Turkish→English (TR-EN) and English→Turkish (EN-TR) machine translation tasks are shown in Table 4.

Baseline Models
•

NMT baseline
The Transformer model is employed as our NMT baseline [5]. The model utilizes an encoder to encode the word sequence. • Single-source model We employ the single-source model with linguistic knowledge for comparison [11]. We keep the total embedding size of the linguistic features fixed to 512 as the same with the source word embedding size. More specifically, the embedding sizes of lemma, POS tag, and morphological tag for Turkish are 352, 64, 96, and the embedding sizes of lemma and POS tag for English are 384, 128 as the same with the settings in the proposed dual-source model. In consideration of model efficiency, we only keep the top 50K most frequent lemmas for model training [33].

Results
Experimental results of Turkish→English (TR-EN) and English→Turkish (EN-TR) machine translation tasks are shown in Table 5 and Table 6 respectively. For Turkish→English machine translation task, we can see from Table 5 that the proposed dual-source model outperforms both the NMT baseline and the single-source model. It achieves the highest BLEU and ChrF3 scores on all the test datasets, which indicates that the approach is capable of effective improving machine translation quality. Moreover, the dual-source model achieves the highest improvements on tst2014 of 2.4 BLEU points and 1.6 ChrF3 points. For English→Turkish machine translation task, we can see form Table 6 that the dual-source model achieves the highest BLEU and ChrF3 scores on tst2012, tst2013, tst2014. It achieves the highest improvements on tst2012 of 1.1 BLEU points and 1.5 ChrF3 points. As for the test dataset of tst2011, the proposed model is worse than the NMT baseline on BLEU score, but it is better than the NMT baseline on ChrF3 score. The main reason is that BLEU score is based on the precision of the Turkish words while ChrF3 score is based on both the precision and recall, so the two metrics may occasionally disagree. ChrF3 score was found to correlate well with human judgments, especially for the translation results out of English [34]. Thus, we consider that our model is still better than the NMT baseline on translation performance. Nevertheless, it is not better than the single-source model. The main reason is that the data in tst2011 is not suitable for our dual-source model since its linguistic features of English are not accurate or sufficient enough.
To further evaluate the effectiveness of using various linguistic features in the dual-source model, we separately incorporate single linguistic feature for comparison. Experimental results of using different features for Turkish → English (TR-EN) and English→ Turkish (EN-TR) machine translation tasks are shown in Table 7 and Table 8, respectively.

Discussion
For Turkish→English machine translation task, we can see from Table 7 that incorporating the lemma feature of Turkish into the proposed dual-source model achieves the highest BLEU and ChrF3 scores on tst2011, tst2012, tst2013 while incorporating morphological tag achieves the highest BLEU and ChrF3 scores on tst2014. For English→Turkish machine translation task, we can see from Table  8 that incorporating the lemma feature of English into the proposed dual-source model achieves the highest BLEU and ChrF3 scores on tst2013 while incorporating POS tag achieves the highest BLEU and ChrF3 scores on tst2011, tst2012, and tst2014. This fact indicates that the external linguistic knowledge is benefit for the conventional NMT model to enhance its representation ability on the source-side language, and various linguistic features are appropriate for different translation tasks and test datasets.
In addition, we also find that incorporating single linguistic feature into the dual-source model sometimes cannot yield improvements on BLEU or ChrF3 scores. The main reason is that the single linguistic feature has not enough effective information for the model training, so the dual-source framework is not conducive to improving the model representation ability of the source-side language. This fact indicates that the approach is capable of better utilizing the external linguistic knowledge and effective integrating them together.

Conclusions
This paper proposed a novel dual-source NMT approach that models the words in parallel to the linguistic knowledge by using two separate encoders. The proposed approach is based on the multi-source framework that additionally employs the knowledge-based encoder to specially encode linguistic knowledge. It generalizes the word embedding layer of the knowledge-based encoder to allow for an arbitrary number and type of linguistic features, and it shares all parameters across two encoders to enhance the source-side language representation ability. Moreover, we evaluated the effectiveness of separately incorporating linguistic feature into the proposed dual-source model. We found that the external linguistic knowledge is benefit for the NMT model. Extensive experiments show that the proposed dual-source model achieves significant BLEU and ChrF3 improvements on both Turkish→English and English→Turkish machine translation tasks.
In the future, we plan to employ a more effective combination method of the hidden states and cell states from each encoder to further enhance the dual-source model to learn the representation ability of the source-side language. In addition, we also plan to perform machine translation tasks on the other high-resource language pairs and morphologically-rich languages.