3.1. Overview of the PDAM-FAQ Framework
To address the issues of insufficient training data and limited domain-specific semantic understanding in FAQ question-answering systems, this paper proposes a paraphrasing-based data augmentation model that integrates syntactic information and edit vectors. First, a rule-based approach is used to retrieve template sentences corresponding to the original sentences from the corpus. Next, these template sentences undergo part-of-speech tagging, and special characters are used to mask words with relevant parts of speech, such as nouns, verbs, adjectives, and adverbs. Finally, Glove word vectors are used to construct edit vectors between the original and reference paraphrase sentences. These edit vectors are incorporated into the encoding layer of a pre-trained model to enhance the model’s learning of the differences between the original and reference paraphrase sentences. The augmented dataset is then used for training, validating, and testing a semantic matching model with hybrid features. The mixed-feature semantic matching model proposed in this paper, based on SimBERT, extracts keyword features from the text. Special characters replace the keywords in the text to construct intention features. The user question, keyword features, and intention features are concatenated as the model’s input.
Figure 1.
PDAM-FAQ framework
Figure 1.
PDAM-FAQ framework
3.3. Mixed-Feature Semantic Matching Model Based on SimBERT
Since there are a large number of professional terms in different specific fields, some of which are composed of multiple nouns into a new noun, but the SimBERT pre-training model uses the WordPiece method to segment sentences, which makes it more difficult for the model to understand the deep semantics of the text. In order to enable the model to focus on the semantic information of specific words and the intention of user questions, this paper proposes a mixed-feature semantic matching model based on SimBERT, which is divided into keyword features and intent features. Keywords are representative and important words extracted from a text, which can indicate the theme of the text to a certain extent. Adding keyword features can reduce the problem of semantic focus of the model when retrieving similar questions.
Let
denote a dataset ,where
denotes the first sentence, and
denotes the second sentence.
is the similarity label.
. Given the sentence and the sentence ,the goal is to learn a text matching model for determining whether sentence
x and sentence
y are similar. A mixed-feature semantic matching model based on SimBERT is illustrated in
Figure 5, which mainly contains a text input layer, a model layer and a result output layer:
(1)Input Layer: the input layer is mainly used to splice different features, including keywords and Masked text. Below are the steps to build the input layer:
a. Keywords: Use keyword identification tools to extract keywords from the first and second sentences respectively;
b. Masked text: Use the "[MASK]" special character to replace the keywords in the sentence. For example, if the sentence is "What is the basic principle of ship anti-sinking?", the keywords are "anti-sinking, basic principle", and the masked text is "What is the [MASK] of ship [MASK]?" The keywords here get the two words with the highest scores;
c. Concatenating Input Text: Use the special token “[unused1]” from the pretrained model’s vocabulary to connect sentences, keywords, and masked text, forming the structure “sentence[unused1] keyword|masked sentence”. Between the first and second sentences, use the “[SEP]” separator to concatenate the two text segments, indicating two pieces of input text. Before the first sentence, add the special token “[CLS]”. This results in the sequence “[CLS]sentence1[unused1]keyword|masked sentence[SEP]sentence2 [unused1]keyword|masked sentence" as the input for the pretrained model. After inputting into the model, obtain the token vectors, position vectors, and segment vectors of the input text, and concatenate them into a single vector, which represents the input text’s embedding.
(2)Model Layer: In natural language processing, the length of input sequences can be very long, and different parts of the sequence may have varying levels of importance. The attention mechanism improves the model’s expressive power by dynamically assigning different weights to different parts of the input, without increasing the number of parameters. Self-attention mechanism is used to capture internal information within the sequence. It treats each position in the input sequence as a query, calculates the attention scores with other positions, and then uses these attention scores as corresponding weights. By performing a weighted sum over all positions, it obtains the representation for each position. The computation process of the self-attention mechanism is as follows:
a. Calculate the Q, K and V vectors: Denote N input messages by , the input encoder obtains the vectors and performs linear transformation to obtain Query vector, Key vector and Value vector, which are all word vectors obtained by multiplying the word vectors with the 3 parameter matrices respectively:
where
,
and
are the parameter matrices.
b. Calculate the Attention score: it is obtained from the dot product of the Query vectors corresponding to each word and the Key vectors of each word in other positions:
c. In order for the network to seek gradient stability during backpropagation, the Attention scores are divided by
, with
being the dimension of the key vector, and these scores are then subjected to a normalisation operation by softmax to ensure that these scores are equal to 1 when added together:
d. Finally, the scores for each word vector are multiplied with the corresponding Value vectors, and the larger values after the multiplication are where the model needs to pay more attention:
(3)Output layer: softmax function is added at the end of the model for classification of similar results, the ‘[CLS]’ vector output from the model represents the vector of sentences, and the category probability vector can be obtained through the softmax function, and the one with the largest prediction probability is finally selected as the predicted category.