Preprint Article Version 2 This version is not peer-reviewed

What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing

Version 1 : Received: 4 May 2018 / Approved: 7 May 2018 / Online: 7 May 2018 (06:25:55 CEST)
Version 2 : Received: 9 May 2018 / Approved: 10 May 2018 / Online: 10 May 2018 (05:56:56 CEST)

How to cite: Kapetanios, E.; Alshahrani, S.; Angelopoulou, A.; Baldwin, M. What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing. Preprints 2018, 2018050102 (doi: 10.20944/preprints201805.0102.v2). Kapetanios, E.; Alshahrani, S.; Angelopoulou, A.; Baldwin, M. What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing. Preprints 2018, 2018050102 (doi: 10.20944/preprints201805.0102.v2).

Abstract

You should know the words by the company they keep!” has been one of the most famous slogans attributed to John Rubert Firth, 1957. This has ignited a whole school in linguistic research known as the British empiricist contextualism. Sixty years later, many un- or semi-supervised machine learning algorithms have been successfully designed and implemented aiming at extracting word meaning from within the context of a text corpus. These algorithms treat words, more or less, as vectors of real numbers representing frequencies of word occurrences within context and word meaning as positions of words in a high-dimensional vector space model. Word associations, in turn, are treated as calculated distances among them. With the rise of Deep Learning (DL) and other artificial neural networks based architectures, learning the positioning of words and extracting word associations as measured by their distances has further improved. In this paper, however, we revisited the main stream of algorithmic approaches and set the stage for a partly cross-disciplinary evaluation framework to judge about the nature of the extracted word associations by state-of-the-art machine learning algorithms. Our preliminary results are based on word associations extracted from the application of DL framework on a Google News text corpus, as well as on comparisons with human created word association lists such as word collocation dictionaries and psycholinguistic experiments. The results and conclusions provide some insights into the inherited limitations in interpreting the type of word associations and underpinning relations between words with inevitable consequences in other areas, such as extraction of knowledge graphs or image understanding.

Subject Areas

machine learning; algorithms; natural language processing, deep learning, vector space models, semantic similarity, distributional semantics, latent semantic analysis, word2vec

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.