Submitted:
14 May 2024
Posted:
15 May 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. WSD framework
- You Text: The initial stage begins by receiving raw text input.
- spaCy NLP: This component processes the text, extracting word lemmas and determining grammatical forms through spaCy's NLP models [11].
- KerasNLP BERT: Leveraging the KerasNLP library [12], the system generates contextual BERT neural network embedding vectors for each word in the sentence.
- Words Alignment: Alignment results from both spaCy and KerasNLP models are combined, creating an index that maps corresponding words between the two representations.
- NLTK WordNet: Employing NLTK [13] WordNet synset tags, this module annotates all words in the text. The algorithm selects the most frequent synset for each word.
- NCA Projections: The system projects each word into a n dimensional space derived from the training data, capturing essential semantic relationships.
- kNN classification: Employing the kNN method for the classification of words potentially having multiple semantic meanings, where their lemma can be described by several WordNet synset values.
- Final classification: This model consolidates the information from previous stages, providing the final, refined classification for each word.
- Tokenization and Lemmatization. spaCy segments the input text into individual words (tokens) and identifies the lemma (base form) for each token. This is crucial for WSD as the synset classifier operates on lemmas, ensuring consistent representation across different word inflections.
- Noun Phrase Chunking. spaCy identifies noun phrases (NPs) within the text. NPs play a twofold role: first, they help pinpoint potential word combinations aligning with distinct synsets in WordNet. Second, chunking facilitates the identification of the head noun within the NP, enabling the selection of the most relevant synset for the entire phrase.
- Part-of-Speech (POS) Tagging. spaCy assigns grammatical tags (POS tags) to each word, indicating its role within the sentence (e.g., noun, verb, adjective). POS tags are valuable features for the synset classifier, providing additional context alongside the lemma for more accurate WSD.
- Dependency Parsing. spaCy can also extract dependency relationships between words in a sentence. While not currently utilized in this WSD framework, dependency information holds promise for future development. Analyzing these relationships could refine word sense disambiguation by considering the syntactic structure of the sentence.
- Named entity recognition (NER): NLP libraries often include NER modules that identify pre-defined named entities like people, locations, and organizations. However, NER typically covers a limited range of semantic categories compared to WSD. In spaCy and similar libraries (e.g., Stanford CoreNLP [14]), NER addresses only a subset of noun phrases. This framework proposes aligning NER categories with their corresponding hypernyms in WordNet, potentially leveraging existing NER functionalities to support WSD. By contrast, the WordNet WSD module independently identifies words and phrases suitable for disambiguation, going beyond the limitations of NER.
- Sequentially scanning the spaCy word table, each word is matched to the beginning of a token from the list generated by BERT tokenizer.
- To identify the corresponding beginning of the word (token) from the BERT module, consecutive tokens are successively combined and compared against the given spaCy word. Upon finding a match, the index of the first located BERT token is returned. This iterative process ensures the alignment of spaCy words with their corresponding tokens in KerasNLP, facilitating integration within the WSD framework.
- Using the spaCy library, we extract noun and verb phrases.
- Using spaCy's grammatical relation analysis, we find the main (head) word in each phrase.
- Using the WSD module, we select noun phrases where the head word indicates a physical object. We also mark verb phrases where the head word expresses physical body movement.
- Only the selected phrases that meet these criteria are used for the final NLP task, which involves phrases related to physical objects and stage movement, thus enhancing the efficiency and effectiveness of subsequent NLP operations.
3. Training
- Test Dataset Discrepancies: The first source of error was discrepancies within the test datasets themselves. Variability in annotation quality, data distribution, and contextual diversity posed challenges for accurate disambiguation.
- Annotation Errors in Training Data: Secondly, we identified annotation errors within the training dataset [24]. These inaccuracies propagated through the model, affecting its performance during inference. Addressing and rectifying such errors became crucial for enhancing model robustness.
- WordNet Synset Limitations: Our investigation highlighted inadequacies within WordNet synset relations and descriptions. While WordNet serves as a valuable lexical resource, it occasionally fails to capture subtle nuances required for comprehensive sense disambiguation. Notably, we observed an overly granular segmentation of certain words into synsets, hindering accurate sense resolution. Therefore, when modifying WordNet synset lists or relations, it is crucial to consider the intended application of natural language processing. In this work, synset merging decisions were guided by the ultimate goal of using NLP to generate 2D and 3D models, as described in [1]. Over the past five decades, starting with the pioneering SHRDLU system [25], numerous systems have attempted natural language manipulation of computer graphics objects (see [26] for a review of 26 such systems). Many of these systems process multiple sentences to identify physical objects within a 3D scene, often leveraging spatial knowledge to resolve ambiguities (e.g., SceneSeeer [27]). This specific NLP application guided the WSD solution presented in this paper.
- Data Scarcity for Specific Words: Another critical constraint emerged—the lack of sufficient data to construct reliable classifiers for specific words. This limitation hindered the model’s ability to generalize effectively across the entire lexicon, particularly for low-frequency or domain-specific terms.
4. Debugging and Knowledge Base Management System

4.1. Select data

4.2. Clustering
- Selection of vector projection method: The clustering algorithm offers flexibility in vector projection methods, including Principal Component Analysis or Neighborhood Component Analysis.
- Vector projection: Embedded vectors undergo projection, a crucial preparatory step for subsequent calculations.
-
Cluster determination and evaluation:
- 3.1.
- Application of KMeans clustering algorithm: Employing the KMeans algorithm on the projected vectors enables the delineation of clusters and the assignment of each sample to its corresponding cluster.
- 3.2.
- Metric calculation: Two pivotal metrics, namely silhouette score and adjusted mutual information score, are computed for each potential cluster configuration, providing quantitative insights into the quality and coherence of resultant clusters.
- Optimal cluster selection: The ultimate determination of the ideal number of clusters hinges upon selecting the configuration yielding the highest product of silhouette score and adjusted mutual information score, indicative of optimal clustering performance.
4.3. Clasification
- Data Partitioning: The initial step entails determining the allocation ratio for test and training data. By default, this parameter is set to 0.1, signifying that 10 percent of the total dataset will be randomly earmarked for testing purposes, with the remainder designated for training.
- k-Nearest Neighbors Exploration: Subsequently, an experimental phase ensues, leveraging the k-nearest neighbors (KNN) method on the designated test and training datasets.
- Iterative Experimentation: This experimentation process is iterated N times, facilitating the accumulation of a broader spectrum of measurements.
- Confusion Matrix Computation: Following experimentation, the confusion matrix is computed based on the amassed data. These results are then presented to the knowledge engineer, who assumes the pivotal role of making final determinations concerning potential synset redesigns.
4.4. Interactive Analysis for Informed Synset Editing
- Visualizing Semantic Clusters: The module offers a 2D scatter plot depicting the projections of word embedding vectors onto the first two principal components (PCs) derived from Principal Component Analysis applied to the training data. Test data projections are overlaid for comparison. Users can interactively select data points or regions on the plot to retrieve the corresponding sentence examples. Additionally, the ability to overlay WordNet lexnames, synset labels, or hypernym labels further aids in comprehending whether specific data clusters align with these semantic categories. This visual exploration empowers knowledge engineers to identify potential synset refinements by uncovering clusters that deviate from expected semantic groupings.
- Dimensionality Reduction and Neighborhood Analysis: While PCA provides a valuable 2D visualization for initial analysis, Neighborhood Component Analysis can be employed to assess the information retained in higher dimensions. The NCA projection, presented as another interactive 2D graph, allows users to evaluate the feasibility of disambiguating the target word based on the inherent structure of the data. This visualization can also serve as a valuable tool for identifying potential annotation errors within the training data by highlighting points that deviate significantly from their expected semantic neighbors.
- 3D Exploration for Enhanced Insights: For a more comprehensive perspective, the system offers the option to explore the data in a 3D scatter plot. Similar to the 2D view, users can select between PCA and NCA for dimensionality reduction, allowing for in-depth examination of potential semantic relationships across multiple dimensions. This interactive 3D environment provides additional opportunities for knowledge engineers to refine their understanding of the data and make informed decisions regarding synset modifications.
5. Results
- Synset Refinement: We replaced specific synsets with more contextually appropriate alternatives to improve the accuracy of sense representation.
- Pruning: Synsets with negligible semantic value, particularly those limited to niche domains, were removed to streamline the knowledge base.
- Lexicon Expansion: Novel synsets were introduced to broaden the scope and capture the evolving nature of language.
6. Discussion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Laukaitis, A.; Ostašius, E.; Plikynas, D. Deep semantic parsing with upper ontologies. Applied Sciences 2021, 11, 9423. [Google Scholar] [CrossRef]
- Navigli, R. ; Word sense disambiguation: A survey. ACM computing surveys (CSUR). 2009, 41, 1–69. [Google Scholar] [CrossRef]
- Loureiro, D.; Jorge, A. Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5682–5691. [Google Scholar] [CrossRef]
- Baker, C.F.; Fillmore, C.J.; Lowe, J.B. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, QC, Canada, 10–14 August 1998; Volume 1, pp. 86–90. [Google Scholar]
- Fellbaum, C. WordNet. In Theory and Applications of Ontology: Computer Applications; Poli, R., Healy, M., Kameas, A., Eds.; Springer: Dordrecht, The Netherlands, 2010; pp. 231–243. [Google Scholar]
- Chang, A.; Savva, M.; Manning, C.D. Learning spatial knowledge for text to 3D scene generation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 26–28 October 2014; pp. 2028–2038. [Google Scholar] [CrossRef]
- Niles, I.; Pease, A. Towards a standard upper ontology. In Proceedings of the International Conference on Formal Ontology in Information Systems, Ogunquit, ME, USA, 17–19 October 2001; pp. 2–9. [Google Scholar]
- Laukaitis, A.; Plikynas, D.; Ostasius, E. Sentence Level Alignment of Digitized Books Parallel Corpora. Informatica 2018, 29, 693–710. [Google Scholar] [CrossRef]
- Das, D.; Chen, D.; Martins, A.F.; Schneider, N.; Smith, N.A. Frame-semantic parsing. Comput. Linguist. 2014, 40, 9–56. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vasiliev, Y. Natural language processing with Python and spaCy: A practical introduction. No Starch Press. 2020.
- Watson; Matthew; Qian; Chen; Bischof; Jonathan; Chollet; Francois and others. KerasNLP. 2022. https://github.com/keras-team/keras-nlp.
- Bird, S.; Klein, E.; Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. O'Reilly Media, Inc. 2009.
- Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar] [CrossRef]
- Lesk, M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation. 1986, 24–26. [Google Scholar]
- Goldberger, J.; Hinton, G. E.; Roweis, S.; Salakhutdinov, R. R. Neighbourhood components analysis. Advances in neural information processing systems, 2004, 17.
- Moro, A.; Navigli, R. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 4–5 June 2015; pp. 288–297. [Google Scholar]
- Raganato, A.; Camacho-Collados, J.; Navigli, R. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; pp. 99–110. [Google Scholar]
- Loureiro, D.; Jorge, A.M.; Camacho-Collados, J. LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond. arXiv 2021, arXiv:2105.12449. [Google Scholar] [CrossRef]
- Edmonds, P.; Cotton, S. Senseval-2: Overview. In Proceedings of the SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, Toulouse, France, 5–6 July 2001; pp. 1–5. [Google Scholar]
- Snyder, B.; Palmer, M. The English all-words task. In Proceedings of the SENSEVAL-3, Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, 25–26 July 2004; pp. 41–43. [Google Scholar]
- Pradhan, S.; Loper, E.; Dligach, D.; Palmer, M. Semeval-2007 task-17: English lexical sample, srl and all words. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, 23–24 June 2007; pp. 87–92. [Google Scholar]
- Navigli, R.; Jurgens, D.; Vannella, D. Semeval-2013 task 12: Multilingual word sense disambiguation. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, 14–15 June 2013; pp. 222–231.
- Miller, G.A.; Chodorow, M.; Landes, S.; Leacock, C.; Thomas, R.G. Using a semantic concordance for sense identification. In Proceedings of the Workshop Human Language Technology, Plainsboro, NJ, USA, 8–11 March 1994; pp. 8–11. [Google Scholar]
- Winograd, T. Understanding natural language. Cogn. Psychol. 1972, 3, 1–191. [Google Scholar] [CrossRef]
- Hassani, K.; Lee, W.S. Visualizing natural language descriptions: A survey. ACM Comput. Surv. 2016, 49, 1–34. [Google Scholar] [CrossRef]
- Chang, A.X.; Eric, M.; Savva, M.; Manning, C.D. SceneSeer: 3D scene design with natural language. arXiv 2017, arXiv:1703.00050. [Google Scholar]
- Doval, Y.; Vilares, J.; Gómez-Rodríguez, C. Towards robust word embeddings for noisy texts. Appl. Sci. 2020, 10, 6893. [Google Scholar] [CrossRef]
- Castro-Bleda, M.J.; Iklódi, E.; Recski, G.; Borbély, G. Towards a Universal Semantic Dictionary. Appl. Sci. 2019, 9, 4060. [Google Scholar] [CrossRef]
- Lenat, D.B. CYC: A large-scale investment in knowledge infrastructure. Commun. ACM 1995, 38, 33–38. [Google Scholar] [CrossRef]
- Schulz, S.; Sutcliffe, G.; Urban, J.; Pease, A. Detecting inconsistencies in large first-order knowledge bases. In Proceedings of the International Conference on Automated Deduction, Gothenburg, Sweden, 6–11 August 2017; pp. 310–325. [Google Scholar]
- Pease, A.; Sutcliffe, G.; Siegel, N.; Trac, S. Large theory reasoning with SUMO at CASC. Ai Commun. 2010, 23, 137–144. [Google Scholar] [CrossRef]



| Method | POS | Senseval All |
|---|---|---|
| MFS | Verb | 49.6 |
| Noun | 67.5 | |
| BERT-2NCA 1-NN | Verb | 61.7 |
| Noun | 74.1 | |
| BERT-NCA k-NN | Verb | 64.4 |
| Noun | 76.7 | |
| BERT1024 1-NN | Verb | 63.9 |
| Noun | 76.4 |
| WordNet Modification | Number of records |
|---|---|
| Synset-to-synset | 1452 |
| Delete synset | 335 |
| New synset | 7 |
| New sentences from FrameNet | 3091 |
| ChatGPT sentences | 4355 |
| Train sentence corrections | 579 |
| Test sentence corrections | 126 |
| Method | POS | Senseval All |
|---|---|---|
| BERT-1NCA 1-NN | Verb | 61.7 |
| Noun | 90.1 | |
| BERT-NCA k-NN | Verb | 64.4 |
| Noun | 91.7 | |
| BERT1024 1-NN | Verb | 63.9 |
| Noun | 91.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).