Comparison of Author Keywords Clustering with and Without Term Normalization
The easiest way to compare the clustering of author keywords with and without term normalization is to use the VOSviewer program and construct a co-occurrence network of author keywords for both cases.
The co-occurrence network of author keywords using the original values of the “Author Keywords” field and constructed using the following parameters: the total number of author keywords defined by the program is 16102, of which 966 occur 5 or more times. Using 400 keywords with the highest overall strength of connections and a minimum cluster size of — 50, 5 clusters were obtained, a fragment of which is shown in
Figure 1.
The issue of lack of keyword normalization is visible, for example, “enhanced oil recovery” and “eor” are both found in the blue cluster, while “coalbel methane” and “cbm” are found in the green cluster.
For detailed review of this network, the file AuKWsGrt4Top400Min50.json placed in the archive can be used at
https://app.vosviewer.com/.
The network of co-occurrence of author keywords during normalization of terms of the “Author Keywords” field is constructed under the following parameters: the total number of normalized author keywords defined by the program is 15215, of which 969 occur 5 or more times. Using 400 keywords with the greatest total link strength and Minimum cluster size = 50, 5 clusters were obtained, a fragment of which is shown in
Figure 2.
There are notable similarities between Figs. 1 and 2. For example, the most commonly used terms “enhanced oil recovery”, “permeability” could still serve as the names of their clusters, but the term “heavy oil” got into the red cluster during normalization, while in the first figure it is in the purple cluster.]
For detailed review of this network, the file AuKWsGrt4Top400Min50norm.json placed in the archive can be used at
https://app.vosviewer.com/.
Let's compare terms in closely related clusters of author keywords obtained by the above two methods.
Note. In the table headings we will use AuKWs to abbreviate “Author Keywords”, and N for occurrence of a given author keyword.
Table 1.
Comparison of two clusters, conventionally labeled as “permeability”.
Table 1.
Comparison of two clusters, conventionally labeled as “permeability”.
| Normalized AuKWs |
N |
Original AuKWs |
N |
| permeability |
392 |
permeability |
392 |
| numerical simulation |
250 |
numerical simulation |
245 |
| hydraulic fracturing |
187 |
hydraulic fracturing |
187 |
| coalbed methane |
125 |
shale gas |
104 |
| porous media |
80 |
coalbed methane |
99 |
| enhanced geothermal system |
71 |
horizontal well |
88 |
| shale |
69 |
porous media |
80 |
| fracture |
66 |
shale |
63 |
| natural gas hydrate |
51 |
stress sensitivity |
63 |
| two phase flow |
44 |
fracture |
48 |
| natural fracture |
42 |
natural gas hydrate |
48 |
| tight gas reservoir |
42 |
adsorption |
45 |
| depressurization |
41 |
tight oil reservoirs |
45 |
| effective stress |
40 |
depressurization |
41 |
| gas hydrate |
38 |
effective stress |
40 |
| geothermal energy |
35 |
hydraulic fracture |
40 |
| temperature |
32 |
two-phase flow |
37 |
| mathematical model |
31 |
geothermal energy |
35 |
| stimulated reservoir volume |
30 |
enhanced geothermal system |
34 |
| fines migration |
29 |
temperature |
32 |
| fracture propagation |
29 |
mathematical model |
31 |
| gas production |
28 |
apparent permeability |
29 |
| methane hydrate |
28 |
fines migration |
29 |
| fracturing fluid |
27 |
tight gas reservoir |
29 |
| pore network model |
27 |
fracture propagation |
28 |
| coalbed methane reservoir |
26 |
gas hydrate |
28 |
| modeling |
26 |
gas production |
28 |
| supercritical carbon dioxide |
26 |
stimulated reservoir volume |
28 |
| finite element method |
25 |
low permeability reservoir |
27 |
| permeability anisotropy |
24 |
methane hydrate |
26 |
The terms “permeability”, “numerical simulation”, “hydraulic fracturing” having the same spelling occur in both columns with the same frequency. But already “coalbed methane” occurs 125 times in the left column and 99 times in the right column. Similarly, “two-phase flow” — 44 and “two-phase flow” — 37 and so on.
Table 2.
Comparison of two clusters, conventionally labeled as “reservoir simulation”.
Table 2.
Comparison of two clusters, conventionally labeled as “reservoir simulation”.
| Normalized AuKWs |
N |
Original AuKWs |
N |
| reservoir simulation |
120 |
relative permeability |
116 |
| relative permeability |
116 |
reservoir simulation |
115 |
| carbon dioxide storage |
96 |
machine learning |
90 |
| machine learning |
91 |
co2 storage |
85 |
| heterogeneity |
79 |
heterogeneity |
79 |
| carbon dioxide sequestration |
72 |
co2 sequestration |
66 |
| petroleum engineering |
65 |
petroleum engineering |
65 |
| carbon dioxide injection |
61 |
oil/gas reservoirs |
58 |
| oil/gas reservoir |
58 |
sensitivity analysis |
57 |
| sensitivity analysis |
57 |
capillary pressure |
55 |
| artificial neural network |
55 |
viscosity |
52 |
| capillary pressure |
55 |
waterflooding |
52 |
| carbon dioxide enhanced oil recovery |
52 |
co2 injection |
50 |
| viscosity |
52 |
optimization |
50 |
| waterflooding |
52 |
asphaltene |
44 |
| optimization |
50 |
gas injection |
43 |
| asphaltene |
45 |
reservoir characterization |
43 |
| in situ combustion |
43 |
artificial neural network |
41 |
| reservoir characterization |
43 |
carbon dioxide |
40 |
| history matching |
40 |
deep learning |
38 |
| deep learning |
38 |
history matching |
35 |
| recovery factor |
29 |
co2 |
32 |
| artificial intelligence |
28 |
asphaltene precipitation |
31 |
| genetic algorithm |
28 |
unconventional reservoir |
31 |
| minimum miscibility pressure |
28 |
tight reservoirs |
28 |
| bitumen |
27 |
bitumen |
27 |
| carbon capture utilization storage |
27 |
ccus |
27 |
| numerical modeling |
27 |
numerical modeling |
27 |
| unconventional petroleum |
27 |
oil recovery factor |
27 |
| crude oil |
26 |
unconventional petroleum |
27 |
In the original records, the term “reservoir simulation” was referred to in both singular and plural, which caused the observed difference in the data columns; even more frequently, the term “artificial neural network” appears in both singular and plural.
On the other hand, terms with the single spelling “heterogeneity,” “viscosity,” “deep learning,” “numerical modeling,” and “bitumen” occur the same number of times in both columns.
An example of a significant difference in occurrence is the term “gas injection”, which appears on the right side of
Table 2 and on the left side of
Table 4.
The term “tight reservoir” does not appear in the cluster of normalized keywords, but it appears in the right-hand column in the plural “tight reservoirs”.
The term “waterflooding” appears 52 times in both columns of this table and in the spelling “water flooding” in
Table 3. This term was not normalized in the process of preparing the author's keywords.
In the right part of the table there are different spellings of the term: “enhanced oil recovery”, “eor”, and “enhanced oil recovery (eor)”, which can significantly affect the clustering results due to the wide use of this term.
The importance of writing the term with and without hyphen can be seen in this table. “low-permeability reservoir” occurs 114 times in the left column and ‘low-permeability reservoirs’ occurs 29 times in the right column, while ‘low-permeability reservoir’ occurs 27 times but already in
Table 1.
The term “heavy oil reservoir” occurs 68 times in the left column, and “heavy oil reservoir” 44 times and “heavy oil reservoirs” 22 times in the right column.
The importance of abbreviation to the singular: “nanoparticle” occurs 54 times on the left side, and '“nanoparticles” 36 times on the right side.
Table 4.
Comparison of two clusters, conventionally labeled as “tight oil reservoir”.
Table 4.
Comparison of two clusters, conventionally labeled as “tight oil reservoir”.
| Normalized AuKWs |
N |
Original AuKWs |
N |
| tight oil reservoir |
116 |
heavy oil |
192 |
| horizontal well |
107 |
tight oil reservoir |
69 |
| shale gas |
107 |
co2 flooding |
52 |
| fractured reservoir |
81 |
imbibition |
49 |
| unconventional reservoir |
69 |
heavy oil reservoir |
44 |
| stress sensitivity |
66 |
simulation |
41 |
| low permeability |
58 |
low-permeability reservoir |
40 |
| shale reservoir |
52 |
unconventional reservoirs |
38 |
| hydraulic fracture |
48 |
sagd |
34 |
| adsorption |
45 |
air injection |
32 |
| gas injection |
43 |
geomechanics |
32 |
| improved oil recovery |
33 |
fractured reservoirs |
30 |
| geomechanic |
32 |
reservoir heterogeneity |
30 |
| fracture network |
31 |
threshold pressure gradient |
30 |
| threshold pressure gradient |
31 |
multiphase flow |
29 |
| shale gas reservoir |
30 |
oil reservoir |
28 |
| apparent permeability |
29 |
recovery |
26 |
| multiphase flow |
29 |
steam injection |
25 |
| carbon dioxide huff n puff |
27 |
water injection |
24 |
| diffusion |
25 |
fractal theory |
23 |
| pressure transient analysis |
25 |
analytical model |
22 |
| fractal theory |
23 |
pressure transient analysis |
22 |
| huff n puff |
23 |
heat transfer |
20 |
| flowback |
19 |
steam flooding |
20 |
| production performance |
19 |
flowback |
19 |
| dual porosity |
18 |
numerical model |
19 |
| gas condensate |
18 |
production performance |
19 |
| non darcy flow |
18 |
thermal recovery |
19 |
| flow regime |
16 |
water cut |
19 |
| rate transient analysis |
16 |
water saturation |
19 |
The most important thing to note in this table is that the term “heavy oil” is the most frequent term in the right column, but it appears in the left column in
Table 3.
The terms “fractured reservoir” and “unconventional reservoir” occur in the singular in the left column, but in the plural in the right column.
The term “shale gas reservoir” does not appear in the right-hand column.
Table 5.
Comparison of two clusters, conventionally labeled as “shale oil” and “porosity”.
Table 5.
Comparison of two clusters, conventionally labeled as “shale oil” and “porosity”.
| Normalized AuKWs |
N |
Original AuKWs |
N |
| nuclear magnetic resonance |
156 |
shale oil |
137 |
| shale oil |
137 |
porosity |
114 |
| porosity |
114 |
ordos basin |
113 |
| ordos basin |
113 |
pore structure |
101 |
| pore structure |
101 |
tight oil |
92 |
| tight oil |
94 |
nmr |
74 |
| tight sandstone |
77 |
tight sandstone |
70 |
| tight reservoir |
70 |
nuclear magnetic resonance |
69 |
| diagenesis |
66 |
diagenesis |
66 |
| spontaneous imbibition |
62 |
spontaneous imbibition |
61 |
| reservoir quality |
58 |
reservoir quality |
58 |
| reservoir |
56 |
fractured reservoir |
51 |
| shale oil reservoir |
41 |
reservoir |
50 |
| junggar basin |
40 |
tight reservoir |
42 |
| tight sandstone reservoir |
37 |
junggar basin |
40 |
| sandstone |
34 |
sandstone |
33 |
| biomarker |
33 |
pore size distribution |
31 |
| pore size distribution |
33 |
sichuan basin |
31 |
| sichuan basin |
31 |
tarim basin |
28 |
| tarim basin |
28 |
yanchang formation |
28 |
| yanchang formation |
28 |
shale reservoir |
27 |
| controlling factor |
27 |
carbonate |
26 |
| source rock |
26 |
shale oil reservoir |
24 |
| oil shale |
23 |
tight sandstone reservoir |
24 |
| lithofacy |
22 |
controlling factors |
23 |
| reservoir rock |
22 |
oil shale |
22 |
| fractal dimension |
21 |
fractal dimension |
21 |
| reservoir characteristic |
21 |
lithofacies |
21 |
| sandstone reservoir |
21 |
reservoir characteristics |
21 |
| songliao basin |
20 |
reservoir rock |
20 |
This table is interesting because the lack of disclosure of abbreviations (nmr→nuclear magnetic resonance) can cause that the most frequently occurring term is different on the left and right side of the table.
This table lists the names of the basins “Ordos Basin”, “Junggar Basin”, “Sichuan Basin”, “Tarim Basin”, “Songliao Basin” - all these basins are located in China. This is consistent with the large number of Chinese publications in the studied sample of bibliometric records.
Note: Keep in mind that the above tables only summarize the 30 most frequent terms from the full keyword occurrence tables, so the term occurrence amounts may not match.
The comparison of the terms in the tables above helped to illustrate the importance of term normalization on the results of term clustering.
The importance of term normalization will become even more evident in the next section.
Using the Scimago Graphica program to identify promising research tasks
Within the framework of this article, the promising research tasks are assumed to be described by the author's keywords that occur more often in new publications that have a high citation rate and connection with other terms. Technically, this can be visualized as a slice of the network of co-occurrence of terms represented in the coordinates “Average publication year of the documents in which a keyword occurs” and “Average normalized number of citations”.
Let's explain the concepts used by quoting from the VOSviewer manual
2: “Avg. norm. citations. The average normalized number of citations received by the documents in which a keyword or a term occurs.” “The normalized number of citations of a document equals the number of citations of the document divided by the average number of citations of all documents published in the same year and included in the data that is provided to VOSviewer.”
Note: For a better understanding of the content of the publications selected as examples, their titles and my abbreviated version of the abstract are given in the body of the paper; they are enclosed in quotation marks to emphasize that this is not plagiarism but citation.
Figure 3 shows the graph of the term co-occurrence network constructed for author keywords taking into account their normalization. The following constraints were used to construct the graph in Scimago Graphica software: clusters 1,2,3 out of five obtained for normalized author keywords by VOSviewer software; total_link_strength >= 44; avg._norm._citations >= 1. Degree >= 25. Degree was calculated for the pre-built network, then the data from this network was exported, the co-occurrence network was re-built using the exported data, and then the degree >= 25 filter was implemented to limit the number of terms displayed in the graph. This filter leaves only terms that are well related to other terms in the graph, which is important for identifying a local topic described by a number of co-occurring terms. A analogous result can be obtained by applying the “link strength” and “total link strength” filters to the original graph, but all these parameters are calculated for the network in question, so the resulting values may differ. In this case, the capabilities of Scimago Graphica were demonstrated. Filters are used to display only the terms most relevant to the analysis being performed to improve the readability of the graph in the publication.
Definition: The degree of a node is the number of links it has with other nodes in the network.
According to the VOSviewer manual — “Total link strength attribute indicate the total strength of the links of an item with other items.”
Let us consider examples of identifying publications revealing the topics of two clusters (3 and 1) with terms having a high value of the average normalized number of citations.
Third cluster. The results of a sequential search (search in found) for terms appearing in the text fields of all records: methane hydrate → 49 records AND gas production → 33 records AND depressurization → 20 AND fracture → 4. Of the four papers, the most cited is “Enhancement of gas production from methane hydrate reservoirs by the combination of hydraulic fracturing and depressurization method” [
15]. According to ScienceDirect, this publication has been cited 171 times as of February 12, 2025. Short summary of annotation: “A fracturing and depressurisation method is proposed to improve the efficiency of gas production from methane hydrate (MG) reservoirs. A model of a fractured MG reservoir was created and the gas production behaviour under different temperature conditions was studied. The effect of increasing fracture zone size and permeability on gas production rate was more prominent in the early stage of depressurisation for the high-temperature reservoir, while the increase in overall gas production was minimal in the low-temperature reservoirs.”
First cluster. The results of a sequential search (search in found) for terms appearing in the text fields of all records: nanofluid → 120 nanoparticle → 93 carbonate → 33 → enhanced oil recovery → 24 → wettability alteration → 16. The most cited paper “Comparative study of using nanoparticles for enhanced oil recovery: Wettability alteration of carbonate rocks” [
16], this publication has been cited 261 times as of February 12, 2025. Short summary of annotation: “Exposure of various nanofluids from zirconium dioxide (ZrO2), calcium carbonate (CaCO3), titanium dioxide (TiO2), silicon dioxide (SiO2), magnesium oxide (MgO), aluminium oxide (Al2O3), cerium oxide (CeO2) and carbon nanotubes (CNT) on the wettability of carbonate rocks were investigated for enhanced oil recovery from oil reservoirs. The results of spontaneous imbibition tests and core flooding experiments confirm the active role of CaCO3 and SiO2 nanoparticles in enhanced oil recovery. It is shown that both irreducible water saturation and inlet capillary pressure increased after treatment with CaCO3 nanofluid.”
Figure 4 shows a graph constructed under the same conditions as the previous one, but for the original author's keywords.
The lack of normalization of term spelling affected the clustering. This is particularly noticeable for the second cluster in this figure. The term methane hydrate is not reflected on it, and it is found in articles with high citations according to the previous figure. In the third cluster, the terms nanoparticles and nanofluids are present, but the term carbonates, as in the previous figure, is not present, and it is found in more recent publications. The first cluster in this figure does not contain the term deep learning, which is found in newer publications with good citations. If we refer to the interactive web page provided in the archive, we can see those two spellings “enhanced oil recovery” and “enhanced oil recovery (eor)” are presented in the graph. This example illustrates well the importance of pre-processing of text fields on the results obtained.
Co-occurrence network analysis of terms in the text of annotations
The VOSviewer program allows to build a co-occurrence network not only for author and index keywords, but also for terms present in annotation texts. VOSviewer generates a list of noun phrases and brings the terms to the singular number, more details can be found in the manual of the program.
This approach to defining key terms works well, but in the text of abstracts of scientific papers it is quite often found abbreviations of the most commonly used terms in the given research area, e.g. EOR → Enhanced Oil Recovery.
Therefore, the previously compiled list of abbreviations (thesaurus_terms.txt file) was used to construct a network of co-occurrence of terms in the annotation texts. It should be noted that VOSviewer performs whole term replacement, i.e. if in file thesaurus_terms.txt we specified that “eor → enhanced oil recovery”, the term “co2 eor process” will not be replaced, but the individual term “eor” will be replaced by “enhanced oil recovery”. Going beyond this example, it is possible to perform an additional iteration of term normalization by analyzing the terms used to build the network and expanding the list of replacements in thesaurus_terms.txt. Another way is to perform term substitution in the original annotation text (or titles).
A private observation of the author of this paper shows that the use of textual fields of titles and annotations is underestimated in many publications, the construction of a network of keywords is significantly more frequent. At the same time, the use of title and annotation fields to build a network of terms co-occurrence, in my opinion, gives more coherent clusters and their number is smaller.
The point is that subject matter experts assess the need for further study of an article not by its keywords, but by its title and abstract. The abstract and title fields in the dataset we studied were filled in completely, while the author's keywords had 1470 blank fields out of 8051 total fields.
Another example, exporting from OnePetro downloads titles and abstracts, but not keywords.
The situation is even worse for The Lens database,
on the query — Filters: Year Published = (2024 - 2024) Subject = (Geochemistry and Petrology) received Scholarly Works (12,482). In 12482 records a little more than half (6851) of the annotations are missing and almost all Keywords fields are empty (11911).
Another example is using the “Publish or Perish” program, which can query a large number of resources, including OpenAlex, and the saved files have Title and Abstract fields, but no keywords.
Also, RSS feeds from topical sites have header and body fields, but not keyword fields.
For example, on the old Springer site (link.springer.com/search), you can use RSS with the fields: title and description (which includes an abstract), but not keywords.
According to the text fields of the given examples, it is possible to build a network of term co-occurrence and evaluate the topicality of the collected materials using VOSviewer.
In accordance with the above,
Figure 5 presents the network of terms co-currencies contained in the texts of the abstracts, taking into account the disclosure of abbreviations.
First Cluster (Red)
Most frequently appearing terms (Occurrences score): accuracy (499), stress (461), algorithm (384), horizontal well (326), hydraulic fracture (292), reservoir model (288), coal (280).
Terms most frequently used in new publications (Avg. pub. year score): marine hydrate reservoir (2023.1), extreme gradient (2022.6429), seepage field (2022.6), test set (2022.5455), extreme gradient boosting (2022.5), long short term memory (2022.4375), hbs (2022.4)
Here: hbs → hydrate-bearing sediment, the term HBS in annotations this term for the first time can meet with the decoding, and further used only as an abbreviation, it was not included in the list of exceptions, made from the author's keywords.
Terms most frequently used in highly cited publications (Avg. norm. citations score): rock strength (3.1761), slip length (2.8046), co2 ecbm process (2.5662), gas property (2.5647), ugs (2.4861), hydrate bearing layer (2.4357), water permeability (2.4206)
Here: ugs → underground gas storage (UGS), in annotations this term for the first time can meet with the decoding, and further used only as an abbreviation, it was not included in the list of exceptions, made from the author's keywords.
Here: ecbm → enhanced coalbed methane, this term is present in the dictionary of abbreviations compiled using the author's keywords, but it is part of a compound term, so it has not been modified. This feature should be taken into account when compiling the thesaurus_terms.txt file when performing term clustering based on title and abstract texts.
The abbreviations in the following clusters are similar and are not explained further.
Second cluster (green).
Most frequently appearing terms (Occurrences score): oil recovery (1269), enhanced oil recovery (878), concentration (670), crude oil (630), viscosity (560), wettability (422), interfacial tension (410).
Terms most frequently used in new publications (Avg. pub. year score): dispersion stability (2022.6667), promising alternative (2022.4), microfluidic experiment (2022.25), imbibition efficiency (2022.2308), oilwater interfacial tension (2022.1818), md simulation (2022.0833), experimental finding (2022). Here: md → molecular dynamics (MD).
Terms most frequently used in highly cited publications (Avg. norm. citations score): potential solution (2.9926), disjoining pressure (2.8678), viscoelastic property (2.5119), pam (2.4066), nanotechnology (2.1721), early breakthrough (2.1437), co2 foam (2.1364). Here: pam → polyacrylamide (PAM).
Third cluster (blue).
Most frequently appearing terms (Occurrences score): pore (749), content (487), sandstone (460), basin (450), hydrocarbon (434), dissolution (410), pore structure (348).
Terms most frequently used in new publications (Avg. pub. year score): development efficiency (2022.7857), shale oil resource (2022.7727), qingshankou formation (2022.3333), seepage capacity (2022.3214), basalt (2022.2308), movable oil (2022.1579), baikouquan formation (2022.125).
Terms most frequently used in highly cited publications (Avg. norm. citations score): macro pore (2.324), winland (2.2597), pore geometry (2.198), pore distribution (2.185), burial history (2.1773), ct scanning (2.1137), host rock (2.1036). Here: ct → computed tomography (CT).
Fourth cluster (khaki).
Most frequently appearing terms (Occurrences score): heavy oil reservoir (394), heavy oil (326), steam (246), oil recovery factor (197), oil viscosity (188), co2 flooding (158), solvent (154).
Terms most frequently used in new publications (Avg. pub. year score): underground hydrogen storage (2023.2609), hydrogen production (2023.1333), formation energy (2022.6429), nuclear magnetic resonance technology (2022.6), recovery degree (2022.5882), co2 storage efficiency (2022.4615), cushion gas (2022.4545).
Terms most frequently used in highly cited publications (Avg. norm. citations score): cushion gas (3.6855), underground hydrogen storage (3.5665), hydrogen storage (3.5626), hydrogen production (2.897), good potential (2.7012), co2 geological storage (2.317), residual trapping (2.2258).
Explanation: the term “underground hydrogen storage” is more common in new highly cited publications, but the publications themselves are still few, so this term is not reflected in
Figure 6 and
Figure 7, where the term occurrence restriction is used.
Application of the Scimago Graphica Program To Identify Promising Research Tasks Using Different Restrictive Filters
Below are the graphs plotted with Scimago Graphica software in the coordinates “Average normalized number of citations” and “Average publication year of the documents in which a keyword occurs” using different restriction filters.
Figure 6 is plotted using the following sampling constraints: cluster → 1,2,3; total_link_strength → 100; occurrences → 100; avg.
norm.citations → 1.
The term of the first cluster “nanopore” has the highest number of links with the terms “shale oil” and “pore size” of the third cluster and the term “molecula” of the second cluster. A sequential search for the occurrence of terms in the text fields of all records yields: nanopore → 178, shale oil → 49, pore size → 14, molecula → 4. Of the four articles, the most relevant is “Molecular dynamics simulations of oil transport through inorganic nanopores in shale” [
17] which has been cited 390 times (14-Feb-2025). Short summary of annotation: “The transport of liquid hydrocarbon through nanopores of inorganic minerals is crucial for the development of fluid-rich shale reservoirs and for understanding oil migration from deep-seated source rocks with extremely low permeability. The authors used non-equilibrium molecular dynamics to study the flow of octane in quartz slits under pressure based on the Navier-Stokes equation with slip boundary and viscosity corrections”.
Note: Leiden algorithm attempts to optimize modularity in extracting communities from networks [
18]. “The modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random” → [
19]. Based on the above, the current example is interesting because well-connected terms can belong to different clusters, since the membership of a cluster is determined by the set of links in it.
The fact that the terms “nanopore” and “nanoparticle” belong to different clusters is also interesting; they are close in spelling but have different contexts.
The term “nanoparticle” often occurs with the terms “enhanced oil recovery”, “interfacial tension”, “wettability”, referring already to the same cluster.
A sequential search for the occurrence of terms in the text fields of all records yields: nanoparticle → 248, enhanced oil recovery → 145, interfacial tension → 69, wettability → 53, rock surface → 9.
The fact that a sequential search for the five terms yielded 9 results indicates the stability of the topic they are describing.
Of the nine articles, the most relevant is “Adsorption analysis of natural anionic surfactant for enhanced oil recovery: The role of mineralogy, salinity, alkalinity and nanoparticles” [
20] which has 228 citations.
Short summary of annotation: “Anionic surfactants are widely used as effective chemical reagents for enhanced oil recovery. This study deals with the equilibrium adsorption and kinetics of anionic surfactant synthesised from soapnut fruit on sandstone, carbonate and bentonite clay, which are reservoir rocks. The presence of alkali and nanoparticles reduces the loss of surfactant during adsorption and has a synergistic effect in reducing the interfacial tension, which is favourable for the application of surfactant in oil recovery.”
Figure 7 is plotted using the following sampling constraints: cluster → 1 and 3; total_link_strength → 100; occurrences → 20; avg.
norm.citations → 1.5.
This graph shows only the two clusters shown in
Figure 5 — 1 and 3. A significant difference with
Figure 6 is that this data slice includes more rarely occurring terms (occurrences → 20 vs occurrences → 100) and the threshold for the average normalized citation is raised (avg.
norm.citations → 1.5 vs avg.
norm.citations → 1). Also note the slight shift of the right-most value on the average publication time axis (2022.8 vs 2921.6).
Thus, this slice can be seen as a reflection of the more promising topics: there are not many publications yet, but they are more cited and published more recently.
For the sake of brevity, let us consider only one new topic that appeared in the first cluster and is described by the terms: methane hydrate, hydrate saturation, hydrate bearing sediment.
A sequential search for the occurrence of terms in the text fields of all records yields: methane hydrate → 49, hydrate saturation → 24, hydrate bearing sediment → 5. Note that the number of publications is lower than in the examples in the previous figure.
Publication “Experimental study on the gas phase permeability of methane hydrate-bearing clayey sediments” [
21] cited 74 times. Short summary of annotation: “The permeability of hydrate-bearing sediments is one of the important parameters affecting the rate of gas production in hydrate reservoirs. In this paper, a series of experiments were conducted to investigate the gas-phase permeability of kaolin clay under different hydrate saturation. The results showed that the gas-phase permeability of kaolin clay firstly decreases and then increases with increasing hydrate saturation. The gas-phase permeability of hydrate-saturated clay samples at high effective axial stress is lower than that at low stress, which is due to pore space compaction.”
It was interesting to find the area of application of the method “extreme gradient boosting”; such publications turned out to be 18 such publications.
It was interesting to find the area of application of the “extreme gradient boosting” method. 18 such publications were found.
The most interesting article [
22] has been cited 86 times. However, none of the 18 publication abstracts contained the term “hydrate”, suggesting that the inclusion of the term “extreme gradient boosting” in this cluster is due to co-occurrence with other terms, such as “permeability”, but not “hydrate”. The term “permeability” may be relevant in the context of records relating to “cbm recovery” which in turn has a relationship to the term “methane hydrate”.
This example shows that a terms co-occurrence network, although a clear method for identifying relevant or promising research tasks, is not always convenient for extending literature collection by composing queries from co-occurring terms. For query generation, the direct method of determining the co-occurrence of terms (by algorithms like Apriori) may be more promising than assigning terms to a single cluster.