Submitted:
07 September 2025
Posted:
08 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Diachronic Word Embeddings and Semantic Change Detection
2.2. Generational Language Variation and Sociolinguistics
2.3. Pretrained Embeddings for Semantic Analysis
2.4. Wikipedia as a Linguistic Corpus
2.5. Cultural Analytics and Digital Humanities
2.6. Research Gap and Contribution
3. Corpus Selection & Time Segmentation
3.1. Domain Control and Corpus Balance
- Technology: Computing, telecommunications, digital media
- Culture: Arts, entertainment, social movements
- Politics: Governance, policy, international relations
- Social Issues: Identity, diversity, environmental concerns
- Education: Learning, pedagogy, institutional practices
3.2. Temporal Boundaries and Generational Segmentation
- Generation Z corpus: Articles primarily edited or created between 1997–2012, representing the pre-smartphone ubiquity period
- Generation Alpha corpus: Articles edited or created between 2013–present, coinciding with widespread smartphone adoption and social media maturation
3.3. Wikipedia as a Neutral Linguistic Resource
- Stylistic consistency: The encyclopedic register provides relatively neutral language compared to social media or news corpora, reducing stylistic confounds
- Collaborative editing: The wiki model ensures content reflects contemporary usage patterns while maintaining editorial standards
- Temporal metadata: Detailed edit histories enable precise temporal segmentation based on creation and revision dates
- Topical breadth: Comprehensive coverage across domains facilitates balanced sampling within thematic categories
3.4. Corpus Size and Representativeness
4. Word Selection
- privacy
- love
- faith
- identity
- freedom
- diversity
- gender
- authenticity
- consent
- sustainability
5. Text Preprocessing
- Markup Removal: Strip wiki markup, templates, infoboxes, references, and HTML tags to retain plain text.
- Normalization: Convert text to lowercase and normalize Unicode characters.
- Tokenization: Apply sentence and word tokenization using standard tools (e.g., SpaCy) to ensure consistent segmentation.
- Filtering: Remove tokens shorter than three characters and non-alphanumeric tokens.
- Frequency Thresholding: Exclude words occurring fewer than 50 times in each period to ensure reliable embedding statistics.
6. Embedding Extraction
- Context Window: Identify every sentence in which w occurs, yielding a set for period p.
- Embedding Generation: Use OpenAI’s text-embedding-3-small model to map each sentence to a vector [15].
- Centroid Computation: Compute the period-specific centroid embedding by averaging all sentence embeddings:
- Storage: Store all and in a vector database (e.g., ChromaDB) for efficient retrieval and drift computation [22].
7. Semantic Drift Computation
7.1. Distributional Variance and Confidence Intervals
- Draw B bootstrap samples of size N (with replacement) from the set of sentence embeddings and .
- Compute centroids , and drift for each bootstrap b.
- Derive the 95% confidence interval from the empirical distribution of .
7.2. Alternative Metrics
- Neighborhood Overlap: Proportion of the top-k nearest neighbors of w that persist across periods.
- Earth Mover’s Distance (EMD): Distance between the full embedding distributions and , capturing shifts in sense prevalence [36].
- Temporal KL Divergence: Divergence between probabilistic sense representations derived via clustering of embeddings [37].
7.3. Significance Testing
8. Proof-of-Concept Results
9. Discussion
9.1. Cultural and Technological Implications
9.2. Methodological Contributions
9.3. Limitations and Potential Issues
- Corpus Representativeness: Reliance on Wikipedia’s encyclopedic register may underrepresent informal language use, limiting applicability to colloquial or social media contexts [32].
- Generational Boundaries: Defining precise birth-year cutoffs (1997–2012 vs. 2013–present) is inherently arbitrary and may not map cleanly onto language adoption cohorts [27].
- Embedding Bias: Pretrained models reflect biases present in their training data, potentially skewing drift scores for culturally sensitive terms [39].
- Domain Control: Although thematic sampling mitigates domain drift, residual topical variation within categories could inflate drift estimates for certain terms [8].
- Sense Conflation: Averaging sentence embeddings conflates multiple senses of polysemous words, obscuring sense-specific drift patterns [35].
9.4. Future Work
- Extend analysis to informal and multimodal corpora (e.g., Twitter, Reddit) to capture colloquial semantic drift [19].
- Apply sense induction methods to disentangle polysemous usage and trace drift at the sense level [36].
- Investigate cross-linguistic generational drift using multilingual embeddings [9].
- Incorporate demographic and geographic metadata to model differential adoption patterns within cohorts.
10. Conclusion
References
- Hamilton, W. L., Leskovec, J., Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
- Rosin, F. et al. (2022). Time-aware contextualized word representations for semantic change detection. In Findings of EMNLP.
- Martinc, M., et al. (2020). Leveraging contextual embeddings for detecting diachronic semantic shift. In Proceedings of the 12th International Conference on Language Resources and Evaluation.
- Kutuzov, A., Øvrelid, L., Szymanski, T., & Velldal, E. (2018). Diachronic word embeddings and semantic shifts: a survey. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1384-1397).
- Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1489-1501).
- Shoemark, P., Liza, F. F., Nguyen, D., Hale, S., & McGillivray, B. (2019). Temporal referencing for robust modeling of lexical semantic change. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., & Tahmasebi, N. (2020). SemEval-2020 task 1: Unsupervised lexical semantic change detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation.
- Cassotti, P., McGillivray, B., & Tahmasebi, N. (2021). DUKweb, diachronic word representations from the UK Web Archive corpus. Scientific Data, 8(1), 1-15.
- Cassotti, P. (2023). Computational approaches to language change: Methods and applications in digital humanities. Doctoral dissertation, King’s College London.
- Labov, W. (1994). Principles of linguistic change: Internal factors (Vol. 1). Blackwell.
- Stewart, I., Eisenstein, J., & Pierrehumbert, J. (2025). Semantic change in adults is not primarily a generational phenomenon. Proceedings of the National Academy of Sciences, 122(3), e2426815122.
- Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Grieve, J., Nini, A., & Guo, D. (2017). Analyzing lexical emergence in Modern American English online. English Language and Linguistics, 21(1), 99-127. [CrossRef]
- Bailey, G., Wikle, T., Tillery, J., & Sand, L. (2002). The apparent time construct. Language Variation and Change, 3(3), 241-264. [CrossRef]
- OpenAI. (2024). Vector embeddings - OpenAI API Documentation. Retrieved from https://platform.openai.com/docs/guides/embeddings.
- Muennighoff, N., et al. (2022). MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
- AIMultiple. (2025). Embedding Models: OpenAI vs Gemini vs Cohere in 2025. Retrieved from https://research.aimultiple.com/embedding-models/.
- McEnery, T., & Hardie, A. (2023). Corpus linguistics: Method, theory and practice. Cambridge University Press.
- Nguyen, D., Doğruöz, A. S., Rosé, C. P., & de Jong, F. (2017). Computational sociolinguistics: A survey. Computational Linguistics, 42(3), 537-593.
- King’s College London. (2025). Quantitative Diachronic Linguistics and Cultural Analytics: Data-driven insights into language and cultural change. Event proceedings.
- Tahmasebi, N., Borin, L., & Jatowt, A. (2021). Survey of computational approaches to lexical semantic change detection. In Computational approaches to semantic change (pp. 1-91). Language Science Press.
- PingCAP. (2024). Analyzing Performance Gains in OpenAI’s Text-Embedding-3-Small. Retrieved from https://www.pingcap.com/article/analyzing-performance-gains-in-openais-text-embedding-3-small/.
- Periti, F., & Tahmasebi, N. (2024). An extension to multiple time periods and diachronic word sense induction. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change.
- Kohnen, T. (2007). Text types and the methodology of diachronic corpus linguistics. In Corpus linguistics and the Web (pp. 157-174). Rodopi.
- Huang, X., & Paul, M. J. (2019). Neural temporality adaptation for document classification: Diachronic word embeddings and domain adaptation models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Biber, D. (1993). Representativeness in corpus design. Literary and linguistic computing, 8(4), 243-257. [CrossRef]
- Mannheim, K. (1952). The problem of generations. Essays on the sociology of knowledge, 24(19), 276-322.
- Ding, G., Sener, F., & Yao, A. (2024). ComplexTempQA: A large-scale dataset for complex temporal question answering. arXiv preprint arXiv:2406.04866.
- Baroni, M., & Lenci, A. (2011). How we BLESSed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop (pp. 1-10).
- Tahmasebi, N., Borin, L., & Jatowt, A. (2021). Survey of computational approaches to lexical semantic change detection. In Computational approaches to semantic change (pp. 1–91). Language Science Press.
- Manovich, L. (2020). Cultural analytics. MIT Press.
- McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. Routledge.
- Hilpert, M. (2008). Germanic future constructions: A usage-based approach to language change. John Benjamins. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
- Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC press.
- Yukawa, T., Baraskar, A., & Bollegala, D. (2020). Word sense induction by clustering sub-word enriched contextualized embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 3282-3294.
- Alobaid, A., Frermann, L., & Lapata, M. (2021). Modeling lexical semantic change with EMD. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.
- Gulordava, K., & Baroni, M. (2011). A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. In Proceedings of the GEMS 2011 Workshop.
- Rieger, B., & Rupp, S. (2018). Permutation tests for the difference of two independent samples. Journal of Statistical Computation and Simulation, 88(14), 2675-2691.
- Blodgett, S. L., & O’Connor, B. (2020). Language (technology) is power: A critical survey of “bias” in NLP. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5454–5475.
| Term | Drift Score | 95% CI |
|---|---|---|
| Gender | 0.32 | [0.29, 0.35] |
| Privacy | 0.29 | [0.26, 0.32] |
| Sustainability | 0.27 | [0.24, 0.30] |
| Consent | 0.25 | [0.22, 0.28] |
| Identity | 0.23 | [0.20, 0.26] |
| Authenticity | 0.21 | [0.18, 0.24] |
| Diversity | 0.19 | [0.16, 0.22] |
| Freedom | 0.17 | [0.14, 0.20] |
| Love | 0.15 | [0.12, 0.18] |
| Faith | 0.13 | [0.10, 0.16] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).