4.1. Overview of Temporal Knowledge Graph BERT
BERT (Bidirectional Encoder Representations from Transformers) [
27] is a pre-trained language model based on multi-layer Transformer encoder [
35]. BERT learns deep bidirectional representations from unlabeled text by jointly conditioning on both the left and right context in all layers. The same as other language model, BERT consists of two steps: pre-training and fine-tuning. During pre-training, BERT is trained on large scale unlabeled general domain corpus. In fine-tuning phase, BERT is initialized with pre-traind parameters, and then is fine-tuned on specific domain corpus and tasks such as named entity recognition, question answering, and sentence pair classification.
To leverage the rich language patterns and contextual representations effectively, we fine-tune the pre-trained BERT model for temporal knowledge completion tasks. We concatenate the entity tokens, relation tokens, and timestamp tokens as word sequences into BERT for fine-tuning. Such architecture is called TKG-BERT (Temporal Knowledge Graph based on BERT). TKG-BERT utilize pre-trained BERT (BERT_base), and are fine-tuned on sequence classification with temporal knowldge graph corpus.
The left part of
Figure 2 shows the architecture of TKG-BERT for modeling knowledge represented by tuple (triple or quadruple). For each tuple, we represent the entities and relation as their text word sequences. TKG-BERT take entity and relation word sequences as the input sentence for fine-tuning. As shown in
Figure 2, we concatenate the word sequences of (
s,
r,
o) as a single input sequence, i.e., the input token sequence to BERT. This is the general universal architecture, because the inpput tokens and the output labels maybe different according to different modeling modes. For example, there is an temporal quadruple:
(
Islamic Rebirth Party,
Make a visit,
Tatarstan,
2014-03-21),
Figure 2.
Illustrations of fine-tuning TKG-BERT with different time moedling ways on various tasks.
Figure 2.
Illustrations of fine-tuning TKG-BERT with different time moedling ways on various tasks.
KG-BERT takes the following token sequences as an input:
([CLS], Islamic, Rebirth, Party, [SEP], Make, visit, [SEP], Tatarstan, [SEP], 2014-03-21, [SEP]).
In original BERT, [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences. In our TKG-BERT, the first token of each input sequence is always [CLS], denoting the tuple representation is fed into an output layer for classification. The word sequences of entities and relations are seperated by [SEP].
Token sequences of knowledge are input into pre-trained BERT to generate embeddings (blue blocks marked with
E in
Figure 2). The embedding of each given token is generated by summing token embedding, segment embedding, and position embedding. Token embedding is the original word embedding of current token. Segment embedding is the embedding to distinguish the tokens in different segment. The tokens seperated by
[SEP] have different segment embeddings, whereas tokens within one same entity or relation have the same segment embedding. Position embedding aims to fuse position information, so different tokens at the same position have the same position embedding. The token embeddings are fed into BERT, generate the final hidden vectors (green blocks marked with "
C" and "
T" in
Figure 2) after Transformer encoding. The final hidden vector "
C" is used for aggregating sequence representation for computing the final label. Other hidden vectors marked "
T" corresponds to entity tokens, relation tokens, and
[SEP] tokens. Label denotes the final output given input triple, which is different due to different training task and mode.
The pre-trained BERT layer consists of 12 bidirectional Transformer encoders. Each bidirectional Transformer encoder implements a multi-head self-attention. The multi-head attention generate multiple sets of (
Q,
K,
V) according to different weight matrices.
Q,
K,
V refers to query, key, and value in multi-head self-attention. Transformer calculate attention according to Equation (
1). The output of Multi-head attention is Equation (
2). The final hidden vector
T are calculated by Equation (
3), wherein
is the output of normalized multi-head after residual, calculated by Equation (
4).
Building on the above, we designed three approaches for fine-tuning TKG-BERT on temporal knowledge graph reasoning tasks. This enables us to investigate whether temporal information plays a role when using language models for knowledge graph reasoning, as well as the extent to which different temporal modeling methods affect the reasoning outcomes.
4.2. Vanilla Knowledge Embedding of TKG-BERT
The vanilla knowledge embedding design of TKG-BERT (Abbreviated as TKG-BERT (Van.)) intends to investigate its performance on temporal knowledge graph tasks without incorporating temporal information. The task modeling approach for TKG-BERT (Van.) is illustrated in the middle section of
Figure 2. Under this configuration, TKG-BERT does not model the temporal information present in the temporal knowledge graph but instead trains and predicts using only the static triple components of the temporal quadruples: (
s,
r,
o). The tasks are identical to those performed on static knowledge graphs.
Original BERT randomly masks some tokens of the input sequences and then predict those masked token. Inspired by this masked language modeling, TKG-BERT adopts masking entity or relation in triple to learning their embeddings. As depicted, the three tasks include entity prediction, relation prediction, and triple classification. For the entity prediction task, masked entity modeling is employed, while for the relation prediction task, masked relation modeling is used.
Masked entity modeling is to construct positive and negative tuple samples by randomly corrupt the subject entity s or the object entity o. TKG-BERT will learning the optimal embeddings to make the triple scores of positive and negative samples seperated as far as possible. Then during the test phase, the masked entity would be predicted towards the correct scoring.
Masked relaiton modeling is to delete the relation in input tuple sequence. Only the subject entity and object entity are input into fine-tuning. The relations are regarded as labels. TKG-BERT learns to embedding the entities towards fitting the relation label representations.
The architecture of TKG-BERT(Van.) for triple classification mode is shown in
Figure 2. On this task, TKG-BERT also take the concatenation of word sequences of entities and relation as token sequence input, whereas the output label denotes the quadruple is true or false.
4.3. Explicit Time Modeling of TKG-BERT
Explicit Temporal Modeling of TKG-BERT (Abbreviated as TKG-BERT (Exp.)) refers to the process of explicitly incorporating time-related information into models designed for handling data that has a temporal component. This modeling method typically involves the explicit representation and utilization of timestamps or other temporal features in the learning and inference mechanisms of the model.
Temporal knowledge graph is usually formally represented as quadruple: (
s,
r,
o,
t), wherein
t is the time that the triple fact happens. Explicit time modeling is to treat the timestamp as individual elements as entity and relaiton, and learn the embedding of the timestamp. Compared to TKG-BERT(Van.) with no temporal modeling, TKG-BERT(Exp.) embeds timestamps alongside entities and relations, appending the timestamp token after the entity-relation triple, thereby inputting the temporal quadruples into the model. Tasks under explicit temporal modeling include entity prediction with timestamps, relation prediction with timestamps, and quadruple classification. The inputs and outputs for these three tasks are illustrated in the right part of
Figure 2.
TKG-BERT(Exp.) for predicting entity takes the concatenation of subject entity, relation, object entity, and timestamp in quadruple as token sequence input. Embedded by the pre-trained BERT layer, the token embeddings are transformed to final hidden vectors. The hidden vector C of the special token [CLS] aggregates the sequence representation, then calculate the quadruple score as the model output.
TKG-BERT(Exp.) for predicting relation only use the tokens of subject entity
s, the object entity
o, and timestamp
t to predict the relation
r between them. In preliminary of KG-BERT [
36], predicting with two entities directly is better than using entity prediction mode with relation corruption. So we adopt the same way for predicting relation. After encoding and final hidden vector generating, the model output the relation label
of given entity pair.
TKG-BERT(Exp.) for quadruple classification takes the quadruple as token sequence input. The only difference between quadruple classification and triple classification is the addition of the timestamp. Quadruple classification also adopt binary classification, distinguish positive and negative quadruple samples.
4.4. Implicit Time Modeling of TKG-BERT
In previous research on temporal knowledge graphs, the modes of knowledge prediction include interpolation and extrapolation. As shown in the left part of
Figure 3, "interpolation" involves randomly selecting a portion of the knowledge for model training and speculating on the missing knowledge. In this mode, the model may infer missing historical knowledge based on knowledge from future time points. "Interpolation" mode corresponds to vanilla knowledge modeling of TKG-BERT. "Extrapolation", on the other hand, involves training the model using historical data and then reasoning or predicting knowledge at future time points. In the previous subsection on explicit temporal modeling, we adopted the interpolation setting. However, in practice, the extrapolation mode is more aligned with practical applications. Therefore, we designed a time modeling approach under the extrapolation setting, which is the implicit temporal modeling (Abbreviated as TKG-BERT (Imp.)).
As illustrated in the right part of
Figure 3, we restructured the two datasets used in this study according to their temporal order, selecting 80% of the historical data for the training set and the more recent data for the test set. Under this setting, similar to the TKG-BERT(Van.), we do not explicitly embed temporal information such as timestamps. Instead, we implicitly model time through the restructuring of the dataset , learning from history to predict the future.
TKG-BERT(Imp.) captures temporal dynamics within a KG without explicitly encoding or representing time-related information. In this method, the model learns to infer temporal patterns and dependencies from the input data itself, rather than relying on explicit timestamps or time intervals. For the given temporal knowledge graph, we reconstruct the graph, create training set by selecting the fact quadruples that occurred relatively earlier, and conduct entity or relation prediction of the fact quadruples which occurr in a relatively future time.
4.5. Training Loss
The three modeling ways of TKG-BERT use different training objectives. For entity prediction tasks and tuple classification tasks, we train the model by learning the scores of tuples, so that the scores of positive samples are much higher than those of negative samples. The acquisition of negative samples is generated through negative sampling.
The scoring function for a quadruple (
s,
r,
o,
t) is
, where
W is the classification layer weights and, "
C" is the output embedding of token
[CLS]. Given the positive quadruple set
and a negative triple set
, constructed accordingly, we calculate cross-entropy loss with
and triple labels:
where
is the label (negative or positive) of that quadruple. The negative quadruple set
is simply generated by replacing subject entity
s or object entity
r in a positive quadruple
with a randomly entity
or
.
where
is the set of entities. A quadruple will not be treated as a negative example if it is already in positive set
. The pre-trained parameter weights and new weights
W can be updated via gradient descent.
For relation prediction task, the final hidden state "C" corresponding to [CLS] is used as the representation of the two entities. The only new parameters introduced in relation prediction fine-tuning are classification layer weights , where R is the number of relations in a KG. The scoring function for a quadruple is .
We compute the following cross-entropy loss with
and relation labels:
where
is an observed positive triple,
is the relation indicator for the triple
when
and
when