Submitted:
26 May 2026
Posted:
27 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
2.1. Formats and Description of Entries
- OSCAR (Open Super-large Crawled Aggregated coRpus) [1]1 provides a multilingual web-crawled corpus derived from Common Crawl, distributed as JSONL (JSON Lines) files in Zstandard-compressed form, with document-level description fields covering language identification confidence scores, quality annotations, harmful content annotation, perplexity-based filtering indicators and crawl provenance.
- HPLT Monolingual 3.0 (High Performance Language Technologies Monolingual Release 3.0) [3]3 adopts the JSONL format, alongside Parquet, with document-level metadata fields covering language identification, document- and segment-level quality estimates, web register labels, personally identifiable information annotations and crawl provenance.
- ParlaMint (Parliamentary Corpora of the Comparable and Multilingual Type) [4]4 provides comparable parliamentary corpora encoded in TEI/Parla-CLARIN XML5, which offers extensive metadata description capabilities through the TEI schema [5], together with CoNLL-U derivatives for linguistic annotation. The corpora include structured metadata describing speakers (e.g. name, gender, party affiliation and parliamentary role), sessions, legislative terms and meetings, alongside topic annotations and government/opposition status indicators.
- MARCELL (Multilingual Resources for CEF.AT in the Legal Domain) [6]6 provides comparable national legislative corpora for seven languages (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian), distributed in CoNLL-U Plus format with four additional columns beyond the standard ten. The documents are tokenised, lemmatised, morphosyntactically annotated and dependency-parsed, and are further enriched with named entity labels together with ... enriched with term and descriptor annotations form IATE (Interactive Terminology for Europe – the European Union’s interinstitutional terminology database)7 [7] and EuroVoc (multilingual, multidisciplinary thesaurus of the European Union, maintained by the Publications Office) 8 [8], providing a topic classification of each legislative document.
- CURLICAT (Curated Multilingual Language Resources for CEF.AT) [9]9 provides curated multilingual corpora in seven languages (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian), drawn from national corpora and covering domains relevant to the European Digital Service Infrastructures, including eHealth, Europeana, eGovernment, education and science. The data are represented in CoNLL-U Plus format, and are tokenised, lemmatised, morphologically analysed and annotated for named entities and IATE terms, with named entities further replaced for anonymisation purposes. A uniform metadata schema is applied across all language sub-corpora.
- Google AudioSet [10]10 uses a hierarchically organised JSON ontology of 632 audio classes together with CSV indices of YouTube video identifiers annotated with 10-second start/end offsets and multi-label class assignments. The ontology encodes parent–child relationships, textual descriptions and category restriction markers for each class, thereby supporting hierarchical retrieval, taxonomy-aware evaluation and the selective construction of balanced or class-specific subsets for sound event detection, audio tagging and weakly supervised learning.
- LibriSpeech [11]11 combines speaker-, chapter- and book-level metadata with utterance-aligned transcripts derived from public-domain LibriVox audiobooks. Per-utterance records include speaker identifiers, chapter and book identifiers, and transcription text, while speaker metadata additionally specifies gender and total contributed audio duration. The corpus is partitioned into predefined clean and other subsets reflecting differences in recording quality and speech recognition difficulty.
- The Million Song Dataset (MSD) [12]12 stores up to 55 metadata and audio-analysis attributes per track in HDF5 format, including artist and album identifiers, release year, listener-derived tags, MusicBrainz cross-references and similarity information, together with low-level acoustic descriptors such as tempo, key, time signature, loudness, segment-level pitch and timbre vectors, and beat- and bar-level onset times. A companion SQLite database provides indexed catalogue access, enabling efficient queries by artist, year or tag and supporting tasks such as music recommendation, cover song identification and large-scale music information retrieval.
- COCO (Common Objects in Context) [13]13 presents metadata in a single JSON file structured around the fields info, licenses, categories, images and annotations, with image-level metadata recording dimensions, file name, URL, capture date and license identifier, annotation entries encoding bounding boxes, segmentation masks, keypoints, area and crowd flags linked by category and image identifiers.
- LAION-5B (Large-scale Artificial Intelligence Open Network, 5 Billion image–text pairs) [14]14 comprises 5.85 billion image–text records as Parquet shards enriched with CLIP cosine similarity, NSFW and watermark probability scores, image width and height, source URL, alt-text language identification and a hashed pointer to the underlying image, enabling subset selection by language, resolution, aesthetic quality or safety threshold.
- YouTube-8M [15]15 presents large-scale video annotation and feature data as TFRecord files containing precomputed frame-level visual embeddings and audio embeddings, together with a CSV vocabulary of more than 4,800 Knowledge Graph entities organised into 24 thematic verticals. Per-video metadata records the video identifier, duration and multi-label entity assignments, enabling large-scale video classification, retrieval and representation learning without requiring access to the raw media.
2.2. Databases for Metadata Management
3. Motivation and Objectives of the Present Model
- (i)
- Schema design and descriptive coverage – the definition of a metadata schema that characterises data entries through a set of categories which are either uniform across resources or systematically compatible, and which is sufficiently expressive to accommodate the diversity of resource types, modalities and provenance contexts encountered in practice.
- (ii)
- Expressive power and semantic integration – the capacity to represent and exploit relationships between metadata entries, both internally and through the integration of external resources such as ontologies, terminological databases and controlled vocabularies, so as to support semantically informed retrieval and reasoning.
- (iii)
- Storage and processing efficiency – the choice of storage and management technologies appropriate to the intended applications, ensuring that the system can sustain the range of read, write and query workloads required for both routine cataloguing and on-demand dataset extraction.
- (iv)
- Flexibility and scalability – the ability of the conceptual model and the underlying technical implementation to accommodate new resource types, evolving descriptive requirements and growth in data volume without structural redesign.
- (v)
- Accessibility and interoperability – the provision of well-defined access levels and interfaces that enable integration into larger research infrastructures and direct use in downstream natural-language-processing and artificial-intelligence applications, including the training and fine-tuning of large language models.
- Cross-modal applicability. The metadata schema is designed to accommodate large datasets across textual, audio, image, video and multimodal collections, rather than being restricted to a single modality as is the case in most existing resources, including the BulNC.
- Unified system of interconnected categories. The heterogeneous datasets surveyed in Section 2.1 can be converted into a single coherent schema in which descriptive categories are explicitly interrelated, thereby overcoming the fragmentation that characterises current metadata practices.
- Graph-based storage. Metadata are stored in a graph database rather than in flat files or normalised relational tables, so that the hierarchical and many-to-many relations characteristic of contemporary data collections are represented natively, in line with recent developments in the literature on metadata knowledge graphs.
- Expressive power and efficient retrieval. The graph representation supports expressive traversal queries over the metadata repository, enabling the joint filtering of structural, descriptive and analytical criteria in a manner that is not practically achievable through the format-specific interfaces of the resources reviewed above.
- Orientation towards AI and LLM applications. The schema is designed with downstream applications in natural-language processing, computer vision and artificial intelligence in mind, including in particular the description and management of training and fine-tuning data for large language models. A prototype implementation supporting textual data for LLM development has been released and is reported in Section 5.
4. Main Characteristics of the Model
- an extensive and semantically rich metadata schema for the uniform description of data entries across modalities;
- a graph database that stores and manages the resulting metadata, representing both entries and the relationships among them as nodes and typed edges;
- a retrieval mechanism that supports the extraction of dataset subsets according to user-specified criteria, implemented through Cypher queries executed as graph traversals along paths between nodes.
4.1. Extensive Metadata Schema
- To ensure the findability, accessibility, interoperability and reusability of the units of the dataset, in accordance with the FAIR principles for scientific data management [32], which have become the reference framework for the publication and reuse of research data and which the schema operationalises at the level of individual data entries; and
- To enable the effective extraction of specialised subdatasets tailored to specific research and application tasks, including the training and fine-tuning of large language models and the development of various AI applications.
4.2. Organisation and Management of Metadata in a Graph Database
4.3. Efficient information retrieval
5. Prototype for Proof-of-Concept
5.1. Prototype Metadata Schema
5.2. Prototype Graph Database
- Document – the central node type, instantiated once per textual entry. It carries the document-specific properties of the schema, including Identifier, DocumentTitle, PublicationDate, URL, etc.), as well as some of the optional descriptors.
- Domain – a node type representing a subject area drawn from the controlled taxonomy of 45 main domains and more than 150 subdomains, with the hierarchical structure of the taxonomy captured by the connections among the Domain nodes themselves.
- Style, Type, Medium – representing, respectively, the stylistic register of the document (such as legal, journalistic, administrative or scientific), its genre (such as book, journal article, administrative document or news report), and medium (written, spoken), all drawn from controlled vocabularies and instantiated once per distinct value so that documents sharing the same register, genre or medium are attached to the same node.
- Author – a node type storing the name of the author and, where available, optional biographical descriptors such as period of activity, principal language and primary professional role.
- Source – a node type representing the publishing organisation, media outlet, web platform or other institutional originator from which the document was acquired, optionally classified by source type (newspaper, journal, repository, broadcaster, and so on).
- Licence – a node type representing the licensing regime under which the document is distributed, identified by name and by a classification of permitted use (open, attribution-only, non-commercial, restricted, and so on), with an optional URL pointing to the full text of the licence; in addition, each licence is automatically mapped to one of two broader categories, free or restricted, defined as a node LicenceCategory, which supports rapid filtering of the collection at retrieval time according to the licensing requirements of the downstream application.
- Statistics – a node type aggregating the quantitative descriptors of the document, namely NumberWords, NumberSentences, NumberParagraphs and NumberTokens, instantiated once per document and represented as a distinct node so that range queries over size or length can be expressed as traversals to a dedicated node rather than as property filters on the central Document node.
- Date – a node type representing a calendar reference, stored in ISO 8601 form and instantiated once per distinct value in the collection. The representation supports two levels of temporal granularity: a full date yyyy-mm-dd and a year-only value yyyy format. The two forms share a single node type, so that documents attached to a year-only Date node and documents attached to a fully specified Date node within that year can be retrieved together through a single traversal pattern.
5.3. Prototype Web Interface
- Licence filter.
- Users can restrict results by licence type, selecting either general or specific licences, for example all Creative Commons licences or specific licences, as well as categories such as open licences or restrictive licences.
- Subject area filter.
- Documents can be filtered by selecting one or more subject areas. The graph database also allows for the addition of more complex search and filtering functions via complex logical operations (and, or, not), which is not implemented in the initial prototype.
- Filter by time period.
- Users can specify a range of publication years by entering a start and/or end year to limit the results to a specific period (if one of the limits is omitted, the corresponding default value is applied — the earliest or latest year in the database). It is possible to implement more complex queries (two or more periods), as well as more precise time ranges (by month or specific date) for cases where the text items contain more precise metadata.
- Keyword filter.
- Free-text search is supported for multiple keywords, entered as a comma-separated list. In the prototype, keyword search covers both the keyword field in the metadata and a search of the title, domains and subdomains and other relevant metadata categories.
- a full description of the metadata for the selected documents in JSON format;
- a list of links to the original sources in TSV format;
- full text data as a ZIP archive, upon confirmation of the details and agreement to the terms of use, including the restrictions imposed by the individual licences of the original documents.
5.4. A Simple Evaluation of the Graph Database



5.5. Advantages of the Model
- 1.
- Expressive representation of relationships. The model captures the rich relationships between the descriptive characteristics of textual documents – subject area, licence, source, author, keywords and others – making the resulting metadata more expressive for describing and researching documents across different text resources.
- 2.
- Hierarchical and non-hierarchical structures. The graph database allows for a natural representation of the complex hierarchical and non-hierarchical relationships between the metadata categories of textual documents.
- 3.
- Efficient retrieval through graph traversals. Complex queries are processed as efficient graph traversals, ensuring effective extraction of subsets and comparative analyses with constant traversal complexity, regardless of the volume of data.
- 4.
- Extensibility of the schema. New node types and new relation types – for example, linking the Keywords and Domain nodes – can be introduced without modification of the overall description schema or migration of existing data.
- 5.
- Orientation towards LLM and AI applications. The schema includes metadata categories specific to large language models – ratings for personally identifiable information, ratings for biased content, licence link and licence type, and information on applicability to various categories of automated tasks – which makes it suitable both for the construction of training and fine-tuning datasets and for other applications in the field of artificial intelligence.
6. Conclusions
Acknowledgments
References
- Abadji, J.; Ortiz Suárez, P.; Romary, L.; Sagot, B. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the Proc. 13th Language Resources and Evaluation Conf. (LREC), 2022; pp. 4344–4355. Available online: https://oscar-project.org/.
- Nguyen, T.; Van Nguyen, C.; Lai, V.D.; Man, H.; Ngo, N.T.; Dernoncourt, F.; Rossi, R.A.; Nguyen, T.H. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. In Proceedings of the Proc. Joint Int. Conf. Computational Linguistics, Language Resources and Evaluation (LREC-COLING), 2024; pp. 4226–4237. Available online: https://huggingface.co/datasets/uonlp/CulturaX.
- Oepen, S.; et al. HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models. arXiv. 2025. Available online: https://hplt-project.org/datasets/v3.0.
- Erjavec, T.; Ogrodniczuk, M.; Osenova, P.; Ljubešić, N.; et al. The ParlaMint Corpora of Parliamentary Proceedings. Lang. Resour. Eval. 2023, 57, 415–448. Available online: https://www.clarin.eu/parlamint. [CrossRef] [PubMed]
- TEI Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange . TEI Consort. Last updated. 2024, version 4.9.0 ed. [Google Scholar]
- Váradi, T.; Koeva, S.; Yamalov, M.; Tadić, M.; Sass, B.; Nitoń, B.; Ogrodniczuk, M.; Pęzik, P.; Barbu Mititelu, V.; Ion, R.; et al. The MARCELL Legislative Corpus. In Proceedings of the Proc. 12th Language Resources and Evaluation Conf. (LREC), 2020; pp. 3761–3768. [Google Scholar]
- European Union. IATE: Interactive Terminology for Europe. In terinstitutional terminology database of the European Union, 2024; Managed by the Translation Centre for the Bodies of the European Union (CdT) on behalf of the EU institutions.
- Publications Office of the European Union. EuroVoc: The EU’s Multilingual and Multidisciplinary Thesaurus. EU. Vocab. Portal Available in 24 official EU languages plus Albanian, Macedonian and Serbian. 2024. [Google Scholar]
- Váradi, T.; Nyéki, B.; Koeva, S.; Tadić, M.; Štefanec, V.; Ogrodniczuk, M.; Nitoń, B.; Pęzik, P.; Barbu Mititelu, V.; Irimia, E.; et al. Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In Proceedings of the Proc. 13th Language Resources and Evaluation Conf. (LREC), 2022; pp. 100–108. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2017; pp. 776–780. Available online: https://research.google.com/audioset/.
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2015; pp. 5206–5210. [Google Scholar]
- Bertin-Mahieux, T.; Ellis, D.P.W.; Whitman, B.; Lamere, P. The Million Song Dataset. In Proceedings of the Proc. 12th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2011; pp. 591–596. Available online: http://millionsongdataset.com/.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Proc. European Conf. Computer Vision (ECCV), 2014; pp. 740–755. [Google Scholar]
- Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image–Text Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022; Available online: https://laion.ai/blog/laion-5b/.
- Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv. 2016. Available online: https://research.google.com/youtube8m/.
- Pinoli, P.; Ceri, S.; Martinenghi, D.; Nanni, L. Metadata management for scientific databases. Inf. Syst. 2019, 81, 1–20. [Google Scholar] [CrossRef]
- Niazi, S.; Ismail, M.; Grohsschmiedt, S.; Ronström, M.; Haridi, S.; Dowling, J. HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. In Proceedings of the Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST ’17), Santa Clara, CA, USA, 2017; pp. 89–104. [Google Scholar]
- Yang, Y.; Yu, F.; Zhang, J.; Xiao, B.; Wang, F.; Zhang, M. The Metadata Management Based on MongoDB for EAST Experiment. Fusion Eng. Des. 2023, 195, 113896. [Google Scholar] [CrossRef]
- Halevy, A.; Korn, F.; Noy, N.F.; Olston, C.; Polyzotis, N.; Roy, S.; Whang, S.E. Goods: Organizing Google’s Datasets. In Proceedings of the Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, San Francisco, CA, USA, 2016; pp. 795–806. [Google Scholar] [CrossRef]
- Sawadogo, P.N.; Darmont, J. On Data Lake Architectures and Metadata Management. J. Intell. Inf. Syst. 2021, 56, 97–120. [Google Scholar] [CrossRef]
- Cimiano, P.; Chiarcos, C.; McCrae, J.P.; Gracia, J. Linguistic Linked Data: Representation, Generation and Applications; Springer: Cham, 2020. [Google Scholar] [CrossRef]
- Liang, Z. Harmonizing Metadata of Language Resources for Enhanced Querying and Accessibility. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Khan, A.; Doroshenko, V.; Sahay, R.; Chronaki, C.; Bohlscheid, H. Using Graph Tools on Metadata Repositories. In Proceedings of the Building Continents of Knowledge in Oceans of Data: The Future of Co-Created eHealth;Studies in Health Technology and Informatics; IOS Press, 2018; Vol. 247, pp. 761–765. [Google Scholar] [CrossRef]
- Abu Ahmad, R.; D’Souza, J.; Zloch, M.; Otto, W.; Rehm, G.; Oelen, A.; Dietze, S.; Auer, S. Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph. In Proceedings of the Joint Proceedings of the Onto4FAIR 2023 Workshops, 2024. [Google Scholar]
- DCMI Usage Board. ANSI/NISO Z39.85-2012; DCMI Metadata Terms. DCMI Recommendation, 2020. Standardised as ISO 15836-1:2017 and ISO 15836-2:2019; supersedes the Dublin Core Metadata Element Set, Version 1.1.
- Broeder, D.; Windhouwer, M.; Van Uytvanck, D.; Goosen, T.; Trippel, T. CMDI: A Component Metadata Infrastructure. In Proceedings of the Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR, Workshop at LREC 2012, Istanbul, Turkey, 2012; pp. 1–4. [Google Scholar]
- Zulueta-Coarasa, T.; Jug, F.; Mathur, A.; Moore, J.; Muñoz Barrutia, A.; Babalola, K.; Bankhead, P.; Gilloteaux, P.; Gogoberidze, N.; Jones, M.; et al. MIFA: Metadata, Incentives, Formats, and Accessibility Guidelines to Improve the Reuse of AI Datasets for Bioimage Analysis, 2023. arXiv arXiv:q.
- Akhtar, M.; Benjelloun, O.; Conforti, C.; Foschini, L.; Gijsbers, P.; Giner-Miguelez, J.; Goswami, S.; Jain, N.; Karamousadakis, M.; Krishna, S.; et al. Croissant: A Metadata Format for ML-Ready Datasets. Proc. Adv. Neural Inf. Process. Syst. 2024, Vol. 37, 82133–82148. [Google Scholar]
- Ignatowicz, J.; Kutt, K.; Nalepa, G.J. Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
- Koeva, S.; Blagoeva, D.; Kolkovska, S. Bulgarian National Corpus Project. In Proceedings of the Proc. 7th Int. Conf. Language Resources and Evaluation (LREC), 2010. [Google Scholar]
- Koeva, S.; Stoyanova, I.; Leseva, S.; Dimitrova, T.; Dekova, R.; Tarpomanova, E. The Bulgarian National Corpus: Theory and Practice in Corpus Design. J. Lang. Model. 2012, 0, 65–110. [Google Scholar] [CrossRef]
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
- Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18), Houston, TX, USA, 2018; pp. 1433–1445. [Google Scholar] [CrossRef]
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 |


| Category | Data type | Description |
|---|---|---|
| Mandatory (15) | ||
| Identifier | String (free) | Unique document identifier with the language prefix bg. |
| Licence | String (from list) | Licence name with classification by type (open, restricted, etc.). |
| PublicationDate | Date (yyyy-mm-dd) | Date of publication of the text. |
| DocumentTitle | String (free) | Title of the document. |
| Source | String (free) | Publishing organisation, media outlet or institutional originator. |
| Medium | String (from list) | Modality of the resource (textual, multimodal). |
| URL | String (URL) | Original web address. |
| Domain | Array[String] (from list) | Up to six subject areas from a controlled vocabulary. |
| Keywords | Array[String] (free) | Up to six keywords characterising the content. |
| NumberWords | Integer | Total number of words. |
| NumberSentences | Integer | Total number of sentences. |
| NumberParagraphs | Integer | Total number of paragraphs. |
| NumberTokens | Integer | Total number of tokens. |
| PersonallyIdentifiableInformation | Array[Double] | Per-sentence vector of the proportion of tokens flagged as personally identifiable. |
| BiasedInformation | Array[Double] | Per-sentence vector of the proportion of tokens flagged as potentially biased. |
| Optional (8) | ||
| Author | Array[String] (free) | Name(s) of the author(s). |
| Style | String (from list) | Stylistic register (e.g. legal, journalistic, administrative). |
| Type | String (from list) | Document genre (e.g. book, document, article). |
| Subdomain | Array[String] (from list) | Narrower thematic classification, hierarchically linked to Domain. |
| TranslatedDocument | Boolean | Indicator of original Bulgarian text vs. translation. |
| CollectionDate | Date (yyyy-mm-dd) | Date of acquisition into the collection. |
| LicenseLink | String (URL) | URL of the licence text. |
| TaskCategories | Array[String] (from list) | Anticipated NLP applications, from a predefined list. |
| Relation | From → To | Interpretation |
|---|---|---|
| BELONGS_TO | Document→Domain | Subject-area assignment (up to six). |
| SUBCATEGORY_OF | Domain→Domain | Hierarchical link to parent domain. |
| LICENSED_WITH | Document→Licence | License assignment of the document. |
| HAS_LICENCE_CATEGORY | Licence→LicenceCategory | High-level class (free / restricted). |
| WRITTEN_BY | Document→Author | Authorship. |
| PUBLISHED_IN | Document→Source | Publishing organisation or platform. |
| PUBLISHED_ON | Document→Date | Date of publication. |
| HAS_STYLE | Document→Style | Stylistic register. |
| HAS_TYPE | Document→Type | Document genre. |
| HAS_MEDIUM | Document→Medium | Medium of the dataset entry. |
| HAS_KEYWORD | Document→Keyword | Content-descriptive term. |
| HAS_SIZE | Document→Statistics | Token-count summary. |
| SUITABLE_FOR | Document→TaskCategory | Suitable NLP application. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.