Submitted:
23 September 2025
Posted:
24 September 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Related Work
2.1. Document Clustering and Hierarchical Organization Methodologies
2.2. Dynamic and Adaptive Clustering Systems
2.3. Context-Aware Information Systems and Large Language Models
2.4. Organizational Document Management and Knowledge Systems
2.5. Multi-dimensional Document Analysis Approaches and Technical Differentiation
2.6. Information-Theoretic Clustering Approaches
3. Methodology
3.1. Dynamic Context Flag System Design
3.2. Flag Extraction Algorithm

3.3. Composite Distance Computation
3.4. Adaptive Hierarchical Clustering Algorithm

3.5. Incremental Update Mechanism
4. System Architecture
5. Experimental Setup and Evaluation Framework
5.1. Comprehensive Dataset Collection and Characteristics
5.2. Baseline Methods and Comparison Framework
5.3. Evaluation Metrics and Validation Protocols
6. Results
6.1. Clustering Accuracy and Hierarchy Quality
6.2. Scalability and Performance Analysis
6.4. Ablation Study and Component Analysis
7. Discussion
7.1. Technical Contributions and Practical Impact
7.2. Limitations and Scope
8. Conclusion
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008.
- Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. 1999, 31, 264-323. [CrossRef]
- Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645-678. [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT; Minneapolis, MN, USA, 2-7 June 2019; pp. 4171-4186. [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP; Hong Kong, China, 3-7 November 2019; pp. 3982-3992. [CrossRef]
- Rodriguez, M.Z.; Comin, C.H.; Casanova, D.; Bruno, O.M.; Amancio, D.R.; Costa, L.F.; Rodrigues, F.A. Clustering algorithms: A comparative approach. PLoS ONE 2019, 14, e0210236. [CrossRef]
- Aggarwal, C.C.; Zhai, C. A survey of text clustering algorithms. In Mining Text Data; Springer: Boston, MA, USA, 2012; pp. 77-128. [CrossRef]
- Salton, G.; McGill, M.J. Introduction to Modern Information Retrieval; McGraw-Hill: New York, NY, USA, 1983.
- Steinbach, M.; Karypis, G.; Kumar, V. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining; Boston, MA, USA, 20 August 2000; pp. 525-526.
- Zhao, Y.; Karypis, G. Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 2005, 10, 141-168. [CrossRef]
- Liu, M.; Liu, Y.; Liang, K.; Tu, W.; Wang, S.; Zhou, S.; Liu, X. Deep temporal graph clustering. In Proceedings of the International Conference on Learning Representations (ICLR); Vienna, Austria, 7-11 May 2024.
- Hanley, H.W.A.; Durumeric, Z. Hierarchical level-wise news article clustering via multilingual Matryoshka embeddings. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL); Vienna, Austria, July 2025; pp. 2476-2492.
- Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 14, 849-856.
- Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395-416. [CrossRef]
- Fortunato, S. Community detection in graphs. Phys. Rep. 2010, 486, 75-174. [CrossRef]
- Newman, M.E.J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577-8582. [CrossRef]
- Zhang, Y.; Fang, G.; Yu, W. On robust clustering of temporal point processes. arXiv 2024, arXiv:2405.17828. [CrossRef]
- Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [CrossRef]
- Fischer, G. Context-aware systems: the ‘right’ information, at the ‘right’ time, in the ‘right’ place, in the ‘right’ way, to the ‘right’ person. In Proceedings of the International Working Conference on Advanced Visual Interfaces; ACM: New York, NY, USA, 2012; pp. 287-294. [CrossRef]
- Kong, X.; Gunter, T.; Pang, R. Large language model-guided document selection. arXiv 2024, arXiv:2406.04638. [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877-1901.
- Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 1990.
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Portland, OR, USA, 2-4 August 1996; pp. 226-231.
- Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Rec. 1999, 28, 49-60. [CrossRef]
- Du, X.; Tanaka-Ishii, K. Information-Theoretic Generative Clustering of Documents. Proc. AAAI Conf. Artif. Intell. 2025, 39, 14195–14202. [CrossRef]
- Kamthawee, K.; Udomcharoenchaikit, C.; Nutanong, S. MIST: Mutual Information Maximization for Short Text Clustering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 12–17 July 2024; pp. 11309–11323. https://aclanthology.org/2024.acl-long.610/.
- Mahmoudi, A.; Fazli, M.; Fard, A.M. Proof of biased behavior of Normalized Mutual Information. Scientific Reports 2024, 14, 8726. [CrossRef]
- Lewandowsky, J.; Bauch, G. Theory and Application of the Information Bottleneck Method. Entropy 2024, 26, 240. [CrossRef]
- Khan, A.A.; Mishra, A.C.; Mohanty, S.K. An Entropy-Based Weighted Dissimilarity Metric for Numerical Data Clustering Using the Distribution of Intra Feature Differences. Knowledge-Based Systems 2023, 280, 110986. [CrossRef]


| Dataset Variation | Documents | Domain | Preprocessing | Key Features |
| Enron-Kaggle | 50K | Business | Raw format | Complete metadata, threads |
| GSA-Internal | 15K | Enterprise | Anonymized | Real workflows, hierarchies |
| GSA-Admin | 3K | Administration | Anonymized | Approval workflows |
| GSA-Research | 4K | R&D | Anonymized | Project documentation |
| 20news-18828 | 18,828 | Discussion | Deduplicated | Clean headers only |
| Reuters-21578 | 21,578 | Financial | SGML format | Professional terminology |
| Dataset Variation | K-Means | Agglomerative | DBSCAN | FLACON (Proposed) | Performance Gain | Significance Level |
| Enron-Kaggle (Raw) | 0.008 | 0.017 | N/A* | 0.311 | Significant improvement | p < 0.001 |
| Enron-Intent (Verified) | 0.012 | 0.023 | 0.009 | 0.287 | Significant improvement | p < 0.001 |
| 20news-18828 (Clean) | 0.016 | 0.029 | 0.014 | 0.251 | Consistent improvement | p < 0.001 |
| 20news-19997 (Original) | 0.021 | 0.034 | 0.018 | 0.289 | Consistent improvement | p < 0.001 |
| 20news-bydate (Temporal) | 0.019 | 0.031 | 0.016 | 0.267 | Consistent improvement | p < 0.001 |
| Reuters-21578 (Financial) | 0.093 | 0.105 | 0.077 | 0.243 | Moderate improvement | p < 0.05 |
| Average Performance | 0.028 | 0.040 | 0.027 | 0.275 | Statistically significant | p < 0.001 |
| Dataset | FLACON | Best Baseline | Performance Gain | Significance |
| GSA-Internal | 0.342 | 0.089 | 3.8× improvement | p < 0.001 |
| GSA-Admin | 0.298 | 0.076 | 3.9× improvement | p < 0.001 |
| GSA-Research | 0.367 | 0.112 | 3.3× improvement | p < 0.001 |
| Average GSA | 0.336 | 0.092 | 3.7× improvement | p < 0.001 |
| Dataset Size | FLACON Time (s) | BERT Clustering (s) | UPGMA (s) | Update Time (s) | Memory Usage (GB) | Queries/sec |
| 10K documents | 60.2 | 89.7 | 118.4 | 0.18 | 1.2 | 1,850 |
| 50K documents | 187.5 | 278.3 | 356.2 | 0.45 | 4.8 | 1,420 |
| 100K documents | 342.8 | 521.6 | 689.5 | 0.78 | 8.9 | 1,180 |
| 500K documents | 823.4 | 1,247.2 | 1,658.3 | 1.52 | 22.4 | 895 |
| 1M documents | 1,284.7 | 1,934.8 | 2,567.1 | 2.31 | 41.7 | 742 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).