2. A (Semantic) Space Odyssey
Fortunately, natural language processing (NLP) and machine learning are progressing at a very fast pace and new tools and techniques are being developed that could greatly enhance efficiency and effectiveness in this process. Several such systems are now available on the market, even for literature searches [
6]. At the very root of all the algorithms for the analysis of text documents is the ability of these algorithms to create numerical representations of the meaning of texts, through the use of vectors, or, as they are commonly known, embeddings [
7].
The concept that a word - or a sentence, or a whole document - can be represented as a vector in a semantic space, based on its co-occurrence with other words within a given lexical context, has origins in the work of Leonard Bloomfield and Zellig Harris that is the basis of Distributional semantics [
8,
9]. According to this approach, the meaning of a word is defined and can be represented by its context. According to a famous quote by J.R. Firth “You shall know a word by the company it keeps!”.
An example of this could be the somewhat quirky sentence:
“She took a xylinth, and applying a very light pressure, she was able to remove most of the calculus”.
The readers may not be aware what a xylinth is (it is admittedly a made-up word), they may have not heard this word at all, but based on the context, the majority of them would probably convene that this weird word is very likely to denote some kind of tool, maybe a surgical instrument, that can be used in dental practice to remove calculus – maybe from alien teeth.
Distributional semantics undoubtedly possesses an advantage that can be leveraged in NLP, i.e. that all the elements that are necessary to describe the meaning of a word are present in the sentence itself. A meaning representation rooted in distributional semantics does not need to refer to any other knowledge repository, lexicon, or ontology, outside the very sentence that is being represented, and this makes computing leaner and faster.
There are several approaches for constructing semantic vectors from a document or corpus of documents in the field of NLP. One approach to sentence vectors is utilizing co-occurrence matrices that measure the frequency of words appearing together in the text [
10]. These vectors are typically long, sparse - i.e. contain a lot of zeros - and – frankly, quite inefficient.
Let’s take two simple sentences, to get a better grasp of the process:
- a)
Red blood cells transport oxygen.
- b)
Inflamed tissues are red.
We can build a vocabulary W from these two sentences, using all the occurring words, listing them by lemma (i.e. their dictionary entry):
W= [Red, blood, cell, to transport, oxygen, inflamed, tissue, to be]
We can now turn each sentence into a vector, possibly ignoring those common grammatical words such as articles or prepositions, which are commonly known in NLP as stopwords [
11], by assigning the value 1 for each word of
W when it is present in the sentence and 0, when it is absent (
Table 1).
The first sentence now possesses a vectorial representation [1,1,1,1,1,0,0,0], while the second is [1,0,0,0,0,1,1,1]! Such vectors can be sometimes effective in certain simple NLP tasks but suffer with strong limitations The length of W is given by the size of the vocabulary considered, which can be quite large (in the order of tens or hundreds of thousands, depending on the language and the topic considered). Another important limitation is that the representations of sentences is independent of the order of the words, so that the sentences:
- a)
Red blood cells transport oxygen
- a’)
Oxygen transports red blood cells
have identical representations, even though their meaning is different. For this reason, this technique is called “Bag of words” (BoW) [
12].
It is easy to guess that very common words, which have little bearing on the meaning of the sentence, can easily receive extra weight in the vectorial representation of a sentence in a document-term matrix. To obviate that, weighing procedures have been developed, such as Tf/Idf, which increases the weight of rarer words [
13], while reducing the bearing of more frequent words (to highlight the peculiarities of individual sentences).
Embeddings are
dense vectors, and several algorithms exist to compute such embeddings [
14]. An example of such an approach is the Word2vec algorithm introduced by Mikolov in 2013, which employs shallow neural networks with just one hidden layer to create word embeddings and has been widely influential in this area [
15]. Word2vec relies on the fundamental premise of the distributional hypothesis, assuming that words appearing in similar contexts share similar meanings. The first step is to set a context window, i.e. how many words around a given word we should consider; it could be
1 word, or more, although there is consensus that the context should not exceed
3-4 words [
16]. Let’s assume we wish to analyze the – probably quite uncontroversial - sentence:
- a)
Neurology is a difficult but interesting topic
We could eliminate “is”, “a”, “but” (because they are stopwords and do not presumably add much to the semantic content of the embedding), and we would be left with 4 semantically ‘heavier’ lexical elements:
- b)
Neurology difficult interesting topic
For each word in the sentence, we identify its context; so, assuming a 1 context window, if we take the word “neurology”, its context is “difficult”; if we consider “difficult” as target word, its context will be “neurology” and “interesting” and so on. We can thus create tuples of (target, context) words:
(neurology, difficult),
(difficult, neurology), (difficult, interesting)
(interesting, difficult), (interesting, topic)
(topic, interesting)
Now that we have tuples of target-context words, Word2Vec offers two primary methods for generating word embeddings: the Continuous Bag of Words (CBOW) and Skip-gram models. In the CBOW model, the objective is to predict a target word based on its context words within a specified window. Conversely, the Skip-gram model flips this process by aiming to predict context words given a target word. Word2vec uses a neural network architecture where the input layer corresponds to the context (or target) words, and the output layer contains its counterpart in the tuple. Training involves adjusting the weights between these layers so that the model gets better at predicting one word given the other. The word embeddings are extracted from the hidden layer's weights, representing words as dense, continuous vectors in a high-dimensional space.
Practically, the algorithm would assign an arbitrary vector to each word (usually relying on the so called “one-hot encode” approach), e.g. as follows:
Neurology = [1,0,0,0]
Difficult = [0,1,0,0]
Interesting = [0,0,1,0]
Topic = [0,0,0,1]
Word2vec would then apply weights (w
1 in
Figure 1, for skip-gram architecture) to the input vector that would then get adjusted to maximize the similarity of the output of the neural network to the vector of the target word.
Once trained, Word2Vec generates word embeddings by discarding the original input vectors and taking only the weights associated with each word in the input layer of the neural network (
Figure 3). The nature of the learning task makes it so similar weights (and therefore similar embeddings) are assigned to words that can occur interchangeably within similar contexts. For instance, the word cat is closer to the word dog than the word airplane. Moreover, semantic structures tend to be preserved by the embedding, so it is typical to observe relations like
King - Man + Woman ≈ Queen
By using those weights to produce embeddings that encode semantic similarities and relationships among words. Word2Vec's efficiency and ability to capture semantic meaning have made it a cornerstone of natural language processing, enabling a wide range of downstream applications such as text classification, sentiment analysis, and recommendation systems [
17,
18,
19,
20].
Once word embeddings are generated, a whole sentence, or even a whole document can be represented, e.g. by joining, or averaging across the embeddings of the words that compose it.
Word2vec suffered with some limitations such as its inability to represent entire documents and its dependency on pre-defined context frames. Word2vec was unable to represent the different meanings a word could have in a context. So, for instance, the word "bank" could refer to a financial institution or the edge of a river, but Word2Vec would assign it a single embedding [
21].
Figure 3.
Word2Vec creates word embeddings by discarding the input one-hot encode vector and retaining the weights.
Figure 3.
Word2Vec creates word embeddings by discarding the input one-hot encode vector and retaining the weights.
5. Far Away, so Close
A literature search requires the operator to be able to discriminate among papers [
29]. This usually involves the investigator to possess some questions that has be answered. The process of choosing or discarding literature is then usually a process whereby an article retrieved by the search is compared to the query [
30]. To put it simply, if they topics are closely related, then it is a match and paper can be saved, otherwise it will be discarded. For an A.I. to work, it is imperative that the algorithm possess a way to measure the relatedness of the document meaning to other documents [
31]. Fortunately, once embeddings are created, it is possible to measure their similarity mathematically and get a sense of how similar the meaning of words or sentences is. One such methods is cosine similarity, a widely used concept in NLP that measures the similarity between two non-zero vectors in a multi-dimensional space. Specifically, it quantifies the cosine of the angle between these vectors when they are represented as points in this space. In simpler terms, cosine similarity evaluates how closely aligned the directions of two vectors are, regardless of their magnitudes. This property makes it particularly valuable for comparing documents or text data, where the vectors can represent the frequency of words or other features. To compute cosine similarity, the cosine of the angle between two vectors is calculated as the dot product of the vectors divided by the product of their magnitudes [
32].
Figure 4.
Principal Component Analysis is one algorithm for dimensionality reduction, which works by picking the components (i.e. the dimensions) along which the data show the greatest degree of variation.
Figure 4.
Principal Component Analysis is one algorithm for dimensionality reduction, which works by picking the components (i.e. the dimensions) along which the data show the greatest degree of variation.
The resulting similarity score ranges from -1 to 1, with 1 indicating perfect similarity (when the vectors are in the same direction), 0 indicating no similarity (when they are orthogonal), and -1 indicating perfect dissimilarity (when they are in opposite directions). Cosine similarity offers advantages over other similarity metrics, such as the Jaccard similarity [
33], because it takes into account not only the presence or absence of elements but also their relative frequencies. Moreover, cosine similarity is computationally efficient and can handle high-dimensional data effectively.
Similarity between embeddings can also be represented graphically, though there is a notable obstacle to that: we cannot plot graphs in more than 3 dimensions, due to the physical constraints of the reality we live in, and vectors can have hundreds, and thousands of dimensions. Fortunately, there are mathematical methods to reduce the number of dimensions of a vector, preserving as much information as possible. One common approach to dimensionality reduction is Principal Component Analysis (PCA), whose mathematical basis was defined at the beginning of the 20
th century [
34] and which identifies orthogonal axes (principal components) along which the data varies the most. By projecting the data onto a subset of these principal components, PCA reduces dimensionality while minimizing information loss (
Figure 4) [
35].
Another widely used technique is t-Distributed Stochastic Neighbor Embedding (t-SNE), which focuses on preserving the pairwise similarity structure between data points, making it particularly useful for visualization and clustering tasks [
36]. Linear Discriminant Analysis (LDA) is another method that seeks to maximize class separability in supervised learning settings, making it valuable for classification problems [
53]. Non-linear dimensionality reduction techniques, such as Isomap and Locally Linear Embedding (LLE), address scenarios where data relationships are more complex and cannot be captured by linear transformations [
54]. Isomap constructs a low-dimensional representation by estimating the geodesic distances between data points on a manifold, preserving the intrinsic geometry of the data. LLE, on the other hand, seeks to retain local relationships between data points, making it suitable for preserving the fine-grained structure of the data. More recently, UMAP (Uniform Manifold Approximation and Projection) has been introduced, with the purpose of preserving the topological structure of the high-dimensional graph at lower dimensions, by constructing a high-dimensional graph representation of the data, where each data point is connected to its nearest neighbors and then creating a lower-dimensional graph that best approximates this structure [
55].
As an example, we can take a set of 10 titles from the literature:
Porous titanium granules in the treatment of peri-implant osseous defects-a 7-year follow-up study, reconstruction of peri-implant osseous defects: a multicenter randomized trial [
56],
Porous titanium granules in the surgical treatment of peri-implant osseous defects: a randomized clinical trial [
57],
D-plex500: a local biodegradable prolonged release doxycycline-formulated bone graft for the treatment for peri-implantitis. a randomized controlled clinical study [
58],
Surgical treatment of peri-implantitis with or without a deproteinized bovine bone mineral and a native bilayer collagen membrane: a randomized clinical trial [
59],
Effectiveness of enamel matrix derivative on the clinical and microbiological outcomes following surgical regenerative treatment of peri-implantitis. a randomized controlled trial [60],
Surgical treatment of peri-implantitis using enamel matrix derivative, an rct: 3- and 5-year follow-up [61],
Surgical treatment of peri-implantitis lesions with or without the use of a bone substitute-a randomized clinical trial [62],
Peri-implantitis - reconstructive surgical therapy [63].
and turn them into embeddings using the freely available pre-trained all-MiniLM-L6-v2 transformer model. This model has a transformer-based architecture that can turn any sentence into an embedding.
As an example, (part of) the 384-dimensional embedding for the first title of the list is reported below:
-2.29407530e-02, 1.49187818e-03, 9.16108266e-02, 1.75204929e-02, -8.36422145e-02, -6.10548146e-02, 8.30101445e-02, 3.96910682e-02, 1.58667186e-04, -2.62408387e-02, -7.69069120e-02, 4.60811984e-03, 8.64421800e-02, 7.87990764e-02, -4.33325134e-02, 2.49587372e-02, 2.24952400e-02, -2.90464610e-02, 3.59166898e-02, 4.27976809e-02, 7.94209242e-02, -5.87367006e-02, -6.49892315e-02, -8.70294198e-02, -5.51731326e-02, 4.95349243e-03, -3.01233679e-02, -3.23325321e-02, -1.54273247e-03, 5.24741262e-02, -7.11492598e-02, 5.16711324e-02, -4.42666225e-02, -6.38814121e-02, 6.46011531e-02, -4.63259555e-02, -9.23364013e-02, -3.56980823e-02, -9.30937752e-02, 1.27522862e-02, 5.05894162e-02, 5.07237464e-02, -9.00633708e-02, 6.91129547e-03, 4.79323231e-02, -6.69493945e-03, 1.27279535e-01, -6.33438602e-02, 2.78936550e-02, -3.34392674e-02, -6.21677283e-03, -4.32619415e-02, 5.89787960e-02, -9.10086110e-02, -2.79910862e-02, -5.80033176e-02, -5.82423434e-02, -6.41746866e-03, 4.17577056e-03, -1.90278993e-03, 6.72984421e-02, -4.39932309e-02, …. 1.52898552e-02, 9.40597132e-02, -3.60338315e-02
The only way we can plot such an embedding is by reducing its dimensions to 3 or 2, so that we can use those as cartesian coordinates in a 3D or 2D graph respectively. By applying PCA the very same embedding can be reduced to:
0.5363835, 0.1012947
Each title is therefore now represented by only 2 values, which are less effective in describing all the semantic nuances of the title but can be used as cartesian coordinates in an x, y space, i.e. these reduced embeddings can now be plotted as dots within a semantic space in a 2D scatterplot (
Figure 5A), where the distance between dots is a visual representation of the semantic distance between titles, or, for an even more detailed representation of the relationship between titles, in a 3D plot (
Figure 5B).
Figure 5.
Embeddings can be reduced to two or three dimensions and used as Cartesian coordinates within a semantic space. The closer the points in the scatterplot, the closer the semantics of the words.
Figure 5.
Embeddings can be reduced to two or three dimensions and used as Cartesian coordinates within a semantic space. The closer the points in the scatterplot, the closer the semantics of the words.
However, dimensionality reduction is not without challenges and trade-offs [64]. Aggressive reduction can lead to information loss, and choosing the appropriate dimensionality reduction method and the number of dimensions to retain requires careful consideration and experimentation [65]. Moreover, it may not always be suitable for all datasets; its effectiveness depends on the data's inherent structure and the specific task at hand. Nevertheless, when applied with discernment, dimensionality reduction techniques can be invaluable tools for simplifying complex data, improving model performance, and gaining insights from high-dimensional datasets in a wide range of applications [66].