Unified Model for Paraphrase Generation and Paraphrase Identification

Paraphrase Generation is one of the most important and challenging tasks in the field of 1 Natural Language Generation. The paraphrasing techniques help to identify or to extract/generate 2 phrases/sentences conveying the similar meaning. The paraphrasing task can be bifurcated into 3 two sub-tasks namely, Paraphrase Identification (PI) and Paraphrase Generation (PG). Most of 4 the existing proposed state-of-the-art systems have the potential to solve only one problem at a 5 time. This paper proposes a light-weight unified model that can simultaneously classify whether 6 given pair of sentences are paraphrases of each other and the model can also generate multiple 7 paraphrases given an input sentence. Paraphrase Generation module aims to generate fluent and 8 semantically similar paraphrases and the Paraphrase Identification system aims to classify whether 9 sentences pair are paraphrases of each other or not. The proposed approach uses an amalgamation 10 of data sampling or data variety with a granular fine-tuned Text-To-Text Transfer Transformer (T5) 11 model. This paper proposes a unified approach which aims to solve the problems of Paraphrase 12 Identification and generation by using carefully selected data-points and a fine-tuned T5 model. 13 The highlight of this study is that the same light-weight model trained by keeping the objective of 14 Paraphrase Generation can also be used for solving the Paraphrase Identification task. Hence, the 15 proposed system is light-weight in terms of the model’s size along with the data used to train the 16 model which facilitates the quick learning of the model without having to compromise with the 17 results. The proposed system is then evaluated against the popular evaluation metrics like BLEU 18 (BiLingual Evaluation Understudy):, ROUGE (Recall-Oriented Understudy for Gisting Evaluation), 19 METEOR, WER (Word Error Rate), and GLEU (Google-BLEU) for Paraphrase Generation and 20 classification metrics like accuracy, precision, recall and F1-score for Paraphrase Identification 21 system. The proposed model achieves state-of-the-art results on both the tasks of Paraphrase 22 Identification and paraphrase Generation. 23


26
Natural Language Generation (NLG) can be viewed as a task of developing systems 27 that can automatically write summaries, explanations, or narratives in either English or 28 other languages. These NLG systems aim to generate or produce unambiguous and clear repositories. The most recent advancements in solving the task of Paraphrase Generation 52 tasks involves using Generative Adversarial Networks (GANs), sequence-to-sequence 53 based-models, encoder-decoder based-models, and transformer-based models. 54 This paper presents a unified system that combines the data selection variety param-55 eter along with a custom fine-tuned T5 model especially for the Paraphrase Generation  Then, for both the tasks, a mathematical problem formulation is done and the proposed 63 unified system architecture is discussed. This paper gives a thorough results analysis 64 and concludes with a future scope and directions to improve. Version April 21, 2021 submitted to Journal Not Specified 3 of 14 value tending to 1 depicts the sentence pair as a paraphrase of each other otherwise not.

71
In some cases, the identification system outputs a semantic score which when normalized 72 can help to discriminate between sentences pair. The Paraphrase Generation task aims 73 to automatically generate one or multiple candidate paraphrases given the reference or 74 input sentence. The aim is to generate semantically same and fluent paraphrases. The PI task is viewed as a supervised machine learning task and is modeled as 77 follows: Given a sentence pair (S 1 , S 2 ), the aim is to find the target (1 or 0 which 78 depicts the given sentence pair is paraphrase of each other or not respectively) where 79 the sentence S 1 = {w 1 , w 2 , w 3 , ..., w n } and S 2 = {w 1 , w 2 , w 3 , ..., w m }. It depicts that both 80 the sentences length may vary. The output can be a probability between 0 and 1 or some 81 normalized semantic scoring mechanism. In the PI task, the aim is to generate a candidate sentence given an input sentence.

3.
In Quora and MSRP dataset, selecting only sentence pairs which are labelled as 1.

158
(Here 1 denotes that the sentence pairs are paraphrases of each other)

159
By performing this step, the three main important parameters of diversity, seman-

259
The Paraphrase Generation task is evaluated by using the following metrics:      proper threshold value, the Paraphrase Identification task achieved respectable results.

340
The proposed approach is designed in such a way that the end-to-end system can be