Preprint
Article

This version is not peer-reviewed.

A Model for Metadata Organisation and Management for Compilation of Specialised Datasets from Big Data

Submitted:

26 May 2026

Posted:

27 May 2026

You are already at the latest version

Abstract
This paper presents a model for the organisation and management of metadata that enables the efficient compilation of specialised datasets from large and heterogeneous data collections. The model is built around three principal components: (i) an extensive metadata schema, organised into administrative, editorial, structural, descriptive, classificational, analytical and statistical categories, that describes the formal and content-related characteristics of data entries in accordance with the FAIR principles of findability, accessibility, interoperability and reusability; (ii) a graph-database representation in which metadata categories and their values are stored as nodes and relationships as directed, typed edges, capturing both hierarchical structures, such as taxonomies of domains, and non-hierarchical structures, such as cross-references between resources and links to external semantic resources; and (iii) a retrieval mechanism implemented through Cypher queries executed as graph traversals, supporting the extraction of thematically and application-oriented data subsets according to combinations of structural, descriptive and analytical criteria. The model is modality-independent by design and supports the incremental introduction of new categories and relations without migration of existing data. Its feasibility is demonstrated through a prototype for Bulgarian textual data, comprising a graph database of more than 200,000 documents and a web-based filtering interface, which together validate the suitability of the approach for the compilation of specialised datasets for the training and fine-tuning of large language models and other NLP applications.
Keywords: 
;  ;  ;  ;  

1. Introduction

The paper presents a model for the representation, organisation and management of metadata for large datasets facilitating the efficient compilation of subdatasets based on specific selection criteria. The model may find applications in the field of data science, artificial intelligence, computational linguistics, and more specifically in the management of metadata for large-scale datasets, including the pre-training and fine-tuning of large language models and the development and training of artificial intelligence applications. As the volume, heterogeneity, and provenance complexity of linguistic resources continue to grow, the capacity to locate, analyse, and recombine various resources in an organised and controllable manner has become a prerequisite for reproducible and methodologically sound research in the field of natural language processing and artificial intelligence.
Several significant limitations persist in the current practice with respect to dataset management. Metadata schemata across data repositories remain fragmented and highly dependent on the initial task for which they have been compiled and applied, which impedes cross-collection retrieval and further reuse. The reliability and compositional structure of datasets employed in model training are rarely captured in an analysable form, and significant effort is required to overview a dataset and evaluate its applicability for a particular task. The relationships between the resources, such as derivational relations, overlap (duplicates and near-duplicates), translation, or domain affiliation, are rarely explicitly stated in the description of the dataset samples, although they are central to assessing the coverage, the quality and the application potential of the data. The filtering and selection of training and evaluation corpora for specific tasks, domains, or languages relies, to a large extent, on pre-established procedures rather than on data-driven, unified and queryable representation of the available resources.
The present model aims to fill these gaps by proposing a structured and organised but varied metadata schema coupled with an efficient and flexible representation as a graph database to enforce and facilitate the FAIR principles in data compilation – Findable, Accessible, Interoperable, Reusable.
The present work is guided by the following research questions:
(1) How a metadata schema can be designed and structured to unify the description of heterogeneous language data collections, yet preserve its expressive power in order to support both general-purpose and task-specific dataset compilation?
(2) How can we achieve an efficient representation of metadata that offers advantages over traditional paradigms in the context of dataset extraction for model training and fine-tuning, as well as various NLP tasks?
(3) How can we offer efficient and varied approaches to compile datasets from large data for specific purposes, domains, tasks, etc.?
The remainder of the paper addresses these questions by situating the proposed model within related work (Section 2), presenting its conceptual architecture and underlying graph schema (Section 4), and discussing its implications for dataset and metadata management in contemporary artificial intelligence research (Section 5.5). The validity of the model is demonstrated through a propotype for management of language data in Bulgarian and a webinterface for dataset filtering and compilation (Section 5). The paper concludes with some direction for future work (Section 6).

3. Motivation and Objectives of the Present Model

The principal challenges that arise in the design of a metadata organisation and management system can be grouped into the following interrelated dimensions:
(i)
Schema design and descriptive coverage – the definition of a metadata schema that characterises data entries through a set of categories which are either uniform across resources or systematically compatible, and which is sufficiently expressive to accommodate the diversity of resource types, modalities and provenance contexts encountered in practice.
(ii)
Expressive power and semantic integration – the capacity to represent and exploit relationships between metadata entries, both internally and through the integration of external resources such as ontologies, terminological databases and controlled vocabularies, so as to support semantically informed retrieval and reasoning.
(iii)
Storage and processing efficiency – the choice of storage and management technologies appropriate to the intended applications, ensuring that the system can sustain the range of read, write and query workloads required for both routine cataloguing and on-demand dataset extraction.
(iv)
Flexibility and scalability – the ability of the conceptual model and the underlying technical implementation to accommodate new resource types, evolving descriptive requirements and growth in data volume without structural redesign.
(v)
Accessibility and interoperability – the provision of well-defined access levels and interfaces that enable integration into larger research infrastructures and direct use in downstream natural-language-processing and artificial-intelligence applications, including the training and fine-tuning of large language models.
A range of widely adopted metadata schemata aim to establish universal or interoperable standards for resource description, including Dublin Core for general-purpose bibliographic and digital-object metadata [25], the TEI Guidelines for the encoding of textual resources [5], the Component MetaData Infrastructure (CMDI) developed within CLARIN for language resources [26], among others. In practice, however, none of these has achieved universal adoption, owing to a combination of factors: the heterogeneity of resource types and disciplinary requirements, the granularity mismatch between general-purpose and domain-specific descriptors, the cost of retrofitting collections. More recent examples of standardisation efforts specifically tailored to datasets used in machine learning and NLP are MIFA (Metadata, Incentives, Formats, and Accessibility) [27], which formulates guidelines for the description and sharing of bioimage datasets intended for the training of AI models, or Croissant [28], a JSON-LD-based metadata standard built specifically for machine-learning datasets, which has been integrated into platforms such as Hugging Face, Kaggle, OpenML and Google Dataset Search.
The absence of database-backed metadata storage shifts the burden of indexing, cross-collection alignment and deduplication onto the users of the data, who are left to implement these operations through pipelines that are rarely compatible with one another and that seldom outlive the immediate task for which they were developed. This in turn constrains the choice of storage and management technology, so we rarely see solutions that offer large variety of language data in a uniform system which is queryable at scale and supports both routine cataloguing and on-demand dataset extraction.
A further requirement, increasingly recognised in current practice, is the incorporation of external knowledge resources, including ontologies, terminological databases, controlled vocabularies and other semantic resources that establish both hierarchical and non-hierarchical relations among metadata fields and provide further semantic expressiveness of the matadata, since such resources enable consistent annotation, cross-collection alignment and semantically informed retrieval that purely descriptive metadata cannot support on their own. A recent illustration of this integrative trend is the Metadata Enrichment Model proposed by Ignatowicz et al. [29], which combines fine-tuned computer-vision models, large language models and structured knowledge graphs to enrich the metadata of digitised cultural-heritage collections, demonstrating how neural representations and explicit semantic resources can be jointly mobilised to overcome the limitations of either approach in isolation.
The present model builds on the principles established in the compilation of the Bulgarian National Corpus (BulNC) [30,31]16, a large reference corpus of Bulgarian (approx. 240,000 text samples) with metadata at the document level, organised into administrative, editorial, structural, descriptive, classificational, analytical and statistical categories, some of which are linked by hierarchical and other relations.
With respect to the structure and metadata management, the proposed model in the current work offers the following extensions, generalisations and advantages:
  • Cross-modal applicability. The metadata schema is designed to accommodate large datasets across textual, audio, image, video and multimodal collections, rather than being restricted to a single modality as is the case in most existing resources, including the BulNC.
  • Unified system of interconnected categories. The heterogeneous datasets surveyed in Section 2.1 can be converted into a single coherent schema in which descriptive categories are explicitly interrelated, thereby overcoming the fragmentation that characterises current metadata practices.
  • Graph-based storage. Metadata are stored in a graph database rather than in flat files or normalised relational tables, so that the hierarchical and many-to-many relations characteristic of contemporary data collections are represented natively, in line with recent developments in the literature on metadata knowledge graphs.
  • Expressive power and efficient retrieval. The graph representation supports expressive traversal queries over the metadata repository, enabling the joint filtering of structural, descriptive and analytical criteria in a manner that is not practically achievable through the format-specific interfaces of the resources reviewed above.
  • Orientation towards AI and LLM applications. The schema is designed with downstream applications in natural-language processing, computer vision and artificial intelligence in mind, including in particular the description and management of training and fine-tuning data for large language models. A prototype implementation supporting textual data for LLM development has been released and is reported in Section 5.
The key features of the linguistic datasets that need to be the focus for choosing an approach for metadata management are the heterogeneity of the data, the semantic richness and fine-grained interlinking between metadata categories and the need for extensive capabilities allowing selective extraction that the training and fine-tuning of large language models demand. The model presented in this work positions itself within this landscape by adopting a graph-database substrate as its core representational layer, while incorporating the semantic interoperability principles developed within the Linked Data tradition and the structural metadata categories familiar from data-lake catalogues, in order to support domain-specific and application-oriented dataset extraction across heterogeneous language data collections.

4. Main Characteristics of the Model

The technical problem addressed by the current work is to provide a unified extended description of individual entries via a metadata system applicable for large and heterogeneous datasets, as well as to provide an effective technical solution for the efficient retrieval of suitable subdatasets for specific domains, tasks and applications, such as those related to pre-training, fine-tuning of large language models, Retrieval-Augmented Generation (RAG) applications, and others.
The proposed system for the organisation and management of metadata, oriented towards the extraction of specialised datasets from large heterogeneous collections, comprises three principal components:
  • an extensive and semantically rich metadata schema for the uniform description of data entries across modalities;
  • a graph database that stores and manages the resulting metadata, representing both entries and the relationships among them as nodes and typed edges;
  • a retrieval mechanism that supports the extraction of dataset subsets according to user-specified criteria, implemented through Cypher queries executed as graph traversals along paths between nodes.
The essence of the model consists of the following three main characteristics.

4.1. Extensive Metadata Schema

The metadata schema is organised around two classes of descriptors: mandatory categories, which must be assigned a value for every data entry, data-specific mandatory categories, which are mandatory for specific entry types, and optional categories, whose values are populated where possible by extraction from the original source, by application of automated heuristics, or by post-hoc data analysis. In both cases, values may take the form of discrete or continuous (or free) values, numerical, textual or boolean datatypes.
The metadata categories employed in the proposed schema fall into seven broad types, which together provide a comprehensive description of each data unit. Administrative metadata capture information related to the management and curation of the resource, such as identifiers, processing history and access to the data entries. Editorial metadata record the provenance of the entry, including its source, date of acquisition, licensing terms, version. Structural metadata describe the internal organisation of the resource and the relationships among its constituent parts. Descriptive metadata characterise its content through titles, summaries, keywords and other content-oriented indicators. Classificational metadata assign the resource to categories defined by taxonomies or ontologies, such as genre, style and domain. Analytical metadata report the results of linguistic, acoustic, visual or other automated analyses applied to the resource. Statistical metadata summarise quantitative properties of the entry, such as size, length, frequency counts or distributional measures derived from its content.
Mandatory descriptors are restricted to those for which a value can in principle be determined for any entry, regardless of modality, provenance or processing history, and they typically cover the administrative, editorial, and structural types of metadata, which together establish the identity, origin and basic organisation of the entry and without which the entry cannot be located, lawfully reused or meaningfully compared with others.
Data-specific mandatory categories are those that are mandatory for particular data, modality, domain, etc.
Optional descriptors, by contrast, encompass the descriptive, classificational, analytical and statistical types of metadata, whose population depends on the presence of particular data in the original source or on analytical procedures and tools that can be applied to enrich the data, or the needs of the specific application context. Their absence does not compromise the basic identifiability of the entry or its intended use in standard scenarios.
Each metadata record describes formal or content-related characteristics of a data unit (either a file or a subpart of a file), that forms an entry of the dataset. The schema is designed to serve two complementary purposes:
  • To ensure the findability, accessibility, interoperability and reusability of the units of the dataset, in accordance with the FAIR principles for scientific data management [32], which have become the reference framework for the publication and reuse of research data and which the schema operationalises at the level of individual data entries; and
  • To enable the effective extraction of specialised subdatasets tailored to specific research and application tasks, including the training and fine-tuning of large language models and the development of various AI applications.
In the proposed model, the metadata is represented as a graph structure, in which the nodes correspond to the values of the metadata categories and the edges encode directed relations between them. These relations reflect real-world dependencies among the descriptors and organise them into both hierarchical structures, such as taxonomies of genre or domain, and non-hierarchical structures, such as cross-references between resources, provenance chains and associations with external semantic resources.

4.2. Organisation and Management of Metadata in a Graph Database

The metadata is organised and managed in a graph database, such as Neo4j, with multiple node types whose relations reflect the actual dependencies among the metadata categories. The resulting system of nodes and edges provides a flexible representation that supports both the extraction of subsets from the overall collection and the execution of secondary tasks of data analysis and statistical overview, including the traversal of category hierarchies (for example, the system of domains), the computation of distributions over time periods or domains, and the comparative analysis of how different categories vary across the collection.
The adoption of a graph database for the management of metadata associated with large data collections offers several advantages over relational or document-oriented alternatives. Relationships among metadata categories are represented as first-class objects, which allows both hierarchical (taxonomic) and non-hierarchical (associative or descriptive) structures to be modelled natively rather than reconstructed through joins or nested elements. The schema can be enriched incrementally with new categories and new types of relations without migration of existing data, which is particularly valuable in the context of evolving research and application requirements and in the context of large volumes of heterogeneous data.
Furthermore, the graph-based representation is compatible with the established semantic-web standards, including the Resource Description Framework (RDF), the Simple Knowledge Organisation System (SKOS) and the Web Ontology Language (OWL), and supports integration with external ontologies, terminological databases and other semantic resources. This compatibility enables the capabilities to enrich the descriptive power of the proposed model, enhances interoperability across infrastructures and facilitates compliance with the FAIR principles [32] at both the representational and the infrastructural level.

4.3. Efficient information retrieval

The graph database supports declarative querying through the Cypher language [33], which allows complex retrieval tasks to be expressed through intuitive patterns that mirror the structure of the graph itself, so that nodes, edges and paths in a query appear in the same visual arrangement as in the underlying data. This makes the language particularly accessible to domain experts who are not necessarily trained in database programming, while remaining sufficiently expressive to support sophisticated retrieval logic.
Within this framework, the model supports the extraction of subdataset according to combinations of structural, descriptive and analytical criteria, the computation of comparative and distributional statistics over the metadata, and the application of specialised graph analysis algorithms, including community detection, centrality measures, shortest-path traversal and similarity-based clustering. Such operations are natively supported by graph-database engines, often through dedicated algorithmic libraries, and can be composed with retrieval queries to express analytical pipelines that would otherwise require external processing.
A further advantage concerns query performance. Traversal of an edge in a property-graph database is an operation of approximately constant complexity with respect to the size of the database, since adjacent nodes are reached through direct pointers rather than through index lookups over the global relation. Equivalent operations in a relational database, by contrast, require recursive join operations whose cost grows rapidly with the depth of the traversal, since each level adds a further join over potentially large tables. This asymmetry becomes increasingly pronounced as the depth of the queries and the size of the underlying metadata repository increase, which is the typical setting in which the proposed model is intended to operate when handling large datasets and expressive metadata schemata.

5. Prototype for Proof-of-Concept

To demonstrate the feasibility and practical utility of the proposed model, a prototype has been implemented for the management of metadata associated with a large dataset of textual data in Bulgarian intended for the training and fine-tuning of large language models. The prototype develops an extensive metadata schema for the description of textual data with the possibility to extend the schema to other data types, modalities, etc.
The model instantiates the metadata schema in a Neo4j graph database, populates it with metadata from a large collection of varied textual resources, characterised with different metadata description in different formats and coverage.
The resulting repository is accessible through a Cypher-based web interface that supports the extraction of domain-specific and application-oriented subdatasets according to combinations of administrative, structural, descriptive, classificational, analytical and statistical criteria.
The implementation confirms that the proposed approach scales to realistic volumes of metadata, supports the expressive retrieval queries required in practice, and integrates with external semantic resources (e.g., ontology of domains) without modification of the underlying schema, thereby validating the principal design advantages of the model.

5.1. Prototype Metadata Schema

In the proposed prototype for describing language data, the metadata is organised into 15 mandatory and 8 optional categories presented in Table 1.
The publication dates span the period from 1935 to 2022 (over 90% of the data is from after 1990). The dataset comprises text documents totalling approximately 718.4 million words.
The classification of documents according to domain is built on a hierarchical taxonomy comprising 45 main domains and more than 150 subdomains, designed to balance breadth of coverage with sufficient granularity for thematically targeted dataset extraction. The 45 main domains span the principal areas of contemporary written production – including, news domains, administrative domains related to law and public administration, knowledge domains, etc., such as economics, politics, information technology, medicine and health, culture and the arts, education, sport, religion, lifestyle, etc. Up to six main domains can be selected which gives a wider range of descriptors through combination to enable classification of multidisciplinary texts.
The subdomains refine this top-level partition into more specific thematic areas, which makes it possible to assemble narrowly focused training and evaluation subsets without losing the ability to aggregate over the broader category. The hierarchical relationships are represented natively in the graph structure with each subdomain connected to its parent domain through a directed is-a edge.
Queries can traverse the hierarchy upwards (to broaden a selection) or downwards (to narrow it) without requiring the explicit enumeration of all descendants. The taxonomy is, moreover, designed to be open at the leaf level: new subdomains can be attached to existing main domains as additional thematic areas are encountered in the data, without modification of the higher-level structure or migration of existing metadata.
Other relations both hierarchical and non-hierarchical are also defined. More details can be seen below in Section 5.2.
The two analytical categories PersonallyIdentifiableInformation and BiasedInformation are represented as per-sentence vectors rather than as document-level summary statistics, which preserves the internal distribution of the property across the document and allows downstream applications to filter or weight sentences individually.

5.2. Prototype Graph Database

In the proposed prototype, the metadata schema described above is realised in the graph database through a small set of node types characterised by the relevant subset of the schema’s properties:
  • Document – the central node type, instantiated once per textual entry. It carries the document-specific properties of the schema, including Identifier, DocumentTitle, PublicationDate, URL, etc.), as well as some of the optional descriptors.
  • Domain – a node type representing a subject area drawn from the controlled taxonomy of 45 main domains and more than 150 subdomains, with the hierarchical structure of the taxonomy captured by the connections among the Domain nodes themselves.
  • Style, Type, Medium – representing, respectively, the stylistic register of the document (such as legal, journalistic, administrative or scientific), its genre (such as book, journal article, administrative document or news report), and medium (written, spoken), all drawn from controlled vocabularies and instantiated once per distinct value so that documents sharing the same register, genre or medium are attached to the same node.
  • Author – a node type storing the name of the author and, where available, optional biographical descriptors such as period of activity, principal language and primary professional role.
  • Source – a node type representing the publishing organisation, media outlet, web platform or other institutional originator from which the document was acquired, optionally classified by source type (newspaper, journal, repository, broadcaster, and so on).
  • Licence – a node type representing the licensing regime under which the document is distributed, identified by name and by a classification of permitted use (open, attribution-only, non-commercial, restricted, and so on), with an optional URL pointing to the full text of the licence; in addition, each licence is automatically mapped to one of two broader categories, free or restricted, defined as a node LicenceCategory, which supports rapid filtering of the collection at retrieval time according to the licensing requirements of the downstream application.
  • Statistics – a node type aggregating the quantitative descriptors of the document, namely NumberWords, NumberSentences, NumberParagraphs and NumberTokens, instantiated once per document and represented as a distinct node so that range queries over size or length can be expressed as traversals to a dedicated node rather than as property filters on the central Document node.
  • Date – a node type representing a calendar reference, stored in ISO 8601 form and instantiated once per distinct value in the collection. The representation supports two levels of temporal granularity: a full date yyyy-mm-dd and a year-only value yyyy format. The two forms share a single node type, so that documents attached to a year-only Date node and documents attached to a fully specified Date node within that year can be retrieved together through a single traversal pattern.
The relationships between the node types are represented as directed arcs, each capturing a single, semantically distinct connection. The full schema comprises the distinct relation types shown in Table 2.
The graph database of the current prototype contains a total of 237,795 documents, represented as Document nodes and described with metadata in accordance with the schema introduced above. Authors, subject areas, licences, sources, styles, types, keywords and task categories are modelled as separate node types, so that recurrent values are stored once and shared across documents.
New documents are incorporated into the data collection by creating the relevant nodes and establishing directed edges to existing or newly created nodes for domain, author, source, licence, etc.
The graph database supports the efficient introduction of new nodes or new types of relations without altering the existing schema or migrating already entered data. For example, if new subject areas need to be added to the hierarchy, the new elements are integrated via the SUBCATEGORY_OF relation, which represents the subject area as a subcategory of another subject area. In this way, a hierarchical classification by subject area is maintained, and the collection can include new subject areas and subsets without restructuring the database.
When introducing new types of relations, for example when linking keywords to subject areas (the CHARACTERISTIC_OF relation between nodes of the Keyword and Domain types), the new relation is defined without altering existing data, and additional automated methods can be used to establish and supplement new relations between existing nodes in the database.
If the dataset is expanded with multilingual data, a new type of node can be introduced, e.g. Language and a new relation IS_WRITTEN_IN from a Document to a Language can be defined. With the introduction of the Language mode, a new relation IS_TRANSLATED_FROM for translated documents will also be possible.

5.3. Prototype Web Interface

The proposed prototype includes a web interface that allows users to access the graph database to view the full collection of documents and to search for and retrieve subsets of data based on the metadata system.
Four filtering mechanisms are available.
Licence filter.
Users can restrict results by licence type, selecting either general or specific licences, for example all Creative Commons licences or specific licences, as well as categories such as open licences or restrictive licences.
Subject area filter.
Documents can be filtered by selecting one or more subject areas. The graph database also allows for the addition of more complex search and filtering functions via complex logical operations (and, or, not), which is not implemented in the initial prototype.
Filter by time period.
Users can specify a range of publication years by entering a start and/or end year to limit the results to a specific period (if one of the limits is omitted, the corresponding default value is applied — the earliest or latest year in the database). It is possible to implement more complex queries (two or more periods), as well as more precise time ranges (by month or specific date) for cases where the text items contain more precise metadata.
Keyword filter.
Free-text search is supported for multiple keywords, entered as a comma-separated list. In the prototype, keyword search covers both the keyword field in the metadata and a search of the title, domains and subdomains and other relevant metadata categories.
The search results are displayed as a paginated list of documents, with each text item showing the document title, subject area tags, licence, document type, publication date, source, link to the original source and quantitative characteristics (number of paragraphs, sentences and words).
The interface offers three options for downloading the results:
  • a full description of the metadata for the selected documents in JSON format;
  • a list of links to the original sources in TSV format;
  • full text data as a ZIP archive, upon confirmation of the details and agreement to the terms of use, including the restrictions imposed by the individual licences of the original documents.
The search web interface and the results from the search are shown in Figure 1.

5.4. A Simple Evaluation of the Graph Database

To assess the feasibility of the prototype and the practical performance of the graph-based representation, a small set of representative Cypher queries was designed and executed against the populated database. The queries span three selectivity cases and are intended to give an indicative rather than exhaustive overview of the model’s behaviour.
Test case 1: Limited selectivity: single-criterion retrieval, sparse domain, testing the best-case cost of a domain-filtering traversal
Preprints 215424 i001
The query locates a single Domain node through an index seek and traverses the 156 incoming BELONGS_TO edges, completing in 9 ms with 470 database accesses. This is a strong result: the engine operates at the practical limit of measurable cost and behaves as expected for a graph database, since most work directly contributes to the answer. The execution confirms that small, well-indexed retrievals are effectively instantaneous at the current scale.
Test case 2: Medium selectivity – two-criterion retrieval testing how the system handles combined queries, which are the typical user scenario
Preprints 215424 i002
The query reaches both anchoring nodes through index seeks, expands one set of 4,817 edges and intersects it with the second criterion to return 3,409 documents in 105 ms with 40,539 database accesses. The cost is an order of magnitude higher than Test Case 1 but remains well within the range expected for interactive use, and is proportional to the size of the intermediate set rather than to the total size of the collection. Multi-criterion graph retrieval scales well with the intersection size.
Test case 3: High selectivity – range filter on a non-indexed property testing the cost of property-scan over the entire dataset
Preprints 215424 i003
The date-range query demonstrates the expected penalty of scanning an unindexed property PublicationDate, where the engine falls back to a full NodeByLabelScan over all 237,795 documents, completing the range filter in 234 ms with 452,363 database accesses. The query remains fast enough for interactive use, but the cost is now dominated by the linear scan, leaving a substantial performance margin. Creating an index on PublicationDate would reduce both the database accesses and the latency, especially for selective ranges.
Figure 2 shows an example of the typical user scenario of search in the database modelled as a graph traversal. The documents in the Education domain published under the CC-BY-SA licence are identified by traversing the graph using the BELONGS_TO edges and the LICENSED_WITH edges.

5.5. Advantages of the Model

The principal advantages of the model, as demonstrated by the prototype, can be summarised as follows:
1.
Expressive representation of relationships. The model captures the rich relationships between the descriptive characteristics of textual documents – subject area, licence, source, author, keywords and others – making the resulting metadata more expressive for describing and researching documents across different text resources.
2.
Hierarchical and non-hierarchical structures. The graph database allows for a natural representation of the complex hierarchical and non-hierarchical relationships between the metadata categories of textual documents.
3.
Efficient retrieval through graph traversals. Complex queries are processed as efficient graph traversals, ensuring effective extraction of subsets and comparative analyses with constant traversal complexity, regardless of the volume of data.
4.
Extensibility of the schema. New node types and new relation types – for example, linking the Keywords and Domain nodes – can be introduced without modification of the overall description schema or migration of existing data.
5.
Orientation towards LLM and AI applications. The schema includes metadata categories specific to large language models – ratings for personally identifiable information, ratings for biased content, licence link and licence type, and information on applicability to various categories of automated tasks – which makes it suitable both for the construction of training and fine-tuning datasets and for other applications in the field of artificial intelligence.

6. Conclusions

The proposed model for the representation, description and management of large-scale metadata through a graph database, together with its mechanisms for the extraction of specialised subdatasets according to combinations of structural, descriptive and analytical criteria, is suitable for industrial application in the development of large language models, artificial-intelligence systems and other data-intensive applications, as well as to a wide range of research uses.
The principal application of the model is the compilation of specialised datasets tailored to specific tasks. The prototype presented in this work demonstrates this in the textual modality by exploiting the graph-based representation of the metadata of a large number of documents and supporting the construction of subsets based on a variety of criteria. The system is suitable for compilation of datasets for the pre-training and fine-tuning of large language models, for retrieval-augmented generation (RAG) and for other NLP applications in Bulgarian.
The model has been developed in accordance with the FAIR principles for the management of research data [32], ensuring that the units of the underlying collection are findable, accessible, interoperable and reusable both at the level of the individual entry and at the level of the collection as a whole.
While the prototype reported here is demonstrated for textual data, the schema and the graph-database representation are modality-independent by design, and the extension of the prototype to audio, image, video and multimodal collections is a natural direction for further development.

Acknowledgments

The work presented here is funded as part of the project Infrastructure for Fine-tuning Pre-trained Large Language Models, Grant Agreement No. PVU – 55 from 12.12.2024 /BG-RRP-2.017-0030-C01/.

References

  1. Abadji, J.; Ortiz Suárez, P.; Romary, L.; Sagot, B. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the Proc. 13th Language Resources and Evaluation Conf. (LREC), 2022; pp. 4344–4355. Available online: https://oscar-project.org/.
  2. Nguyen, T.; Van Nguyen, C.; Lai, V.D.; Man, H.; Ngo, N.T.; Dernoncourt, F.; Rossi, R.A.; Nguyen, T.H. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. In Proceedings of the Proc. Joint Int. Conf. Computational Linguistics, Language Resources and Evaluation (LREC-COLING), 2024; pp. 4226–4237. Available online: https://huggingface.co/datasets/uonlp/CulturaX.
  3. Oepen, S.; et al. HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models. arXiv. 2025. Available online: https://hplt-project.org/datasets/v3.0.
  4. Erjavec, T.; Ogrodniczuk, M.; Osenova, P.; Ljubešić, N.; et al. The ParlaMint Corpora of Parliamentary Proceedings. Lang. Resour. Eval. 2023, 57, 415–448. Available online: https://www.clarin.eu/parlamint. [CrossRef] [PubMed]
  5. TEI Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange  . TEI Consort. Last updated. 2024, version 4.9.0 ed. [Google Scholar]
  6. Váradi, T.; Koeva, S.; Yamalov, M.; Tadić, M.; Sass, B.; Nitoń, B.; Ogrodniczuk, M.; Pęzik, P.; Barbu Mititelu, V.; Ion, R.; et al. The MARCELL Legislative Corpus. In Proceedings of the Proc. 12th Language Resources and Evaluation Conf. (LREC), 2020; pp. 3761–3768. [Google Scholar]
  7. European Union. IATE: Interactive Terminology for Europe. In terinstitutional terminology database of the European Union, 2024; Managed by the Translation Centre for the Bodies of the European Union (CdT) on behalf of the EU institutions.
  8. Publications Office of the European Union. EuroVoc: The EU’s Multilingual and Multidisciplinary Thesaurus. EU. Vocab. Portal Available in 24 official EU languages plus Albanian, Macedonian and Serbian. 2024. [Google Scholar]
  9. Váradi, T.; Nyéki, B.; Koeva, S.; Tadić, M.; Štefanec, V.; Ogrodniczuk, M.; Nitoń, B.; Pęzik, P.; Barbu Mititelu, V.; Irimia, E.; et al. Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In Proceedings of the Proc. 13th Language Resources and Evaluation Conf. (LREC), 2022; pp. 100–108. [Google Scholar]
  10. Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2017; pp. 776–780. Available online: https://research.google.com/audioset/.
  11. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2015; pp. 5206–5210. [Google Scholar]
  12. Bertin-Mahieux, T.; Ellis, D.P.W.; Whitman, B.; Lamere, P. The Million Song Dataset. In Proceedings of the Proc. 12th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2011; pp. 591–596. Available online: http://millionsongdataset.com/.
  13. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Proc. European Conf. Computer Vision (ECCV), 2014; pp. 740–755. [Google Scholar]
  14. Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image–Text Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022; Available online: https://laion.ai/blog/laion-5b/.
  15. Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv. 2016. Available online: https://research.google.com/youtube8m/.
  16. Pinoli, P.; Ceri, S.; Martinenghi, D.; Nanni, L. Metadata management for scientific databases. Inf. Syst. 2019, 81, 1–20. [Google Scholar] [CrossRef]
  17. Niazi, S.; Ismail, M.; Grohsschmiedt, S.; Ronström, M.; Haridi, S.; Dowling, J. HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. In Proceedings of the Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST ’17), Santa Clara, CA, USA, 2017; pp. 89–104. [Google Scholar]
  18. Yang, Y.; Yu, F.; Zhang, J.; Xiao, B.; Wang, F.; Zhang, M. The Metadata Management Based on MongoDB for EAST Experiment. Fusion Eng. Des. 2023, 195, 113896. [Google Scholar] [CrossRef]
  19. Halevy, A.; Korn, F.; Noy, N.F.; Olston, C.; Polyzotis, N.; Roy, S.; Whang, S.E. Goods: Organizing Google’s Datasets. In Proceedings of the Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, San Francisco, CA, USA, 2016; pp. 795–806. [Google Scholar] [CrossRef]
  20. Sawadogo, P.N.; Darmont, J. On Data Lake Architectures and Metadata Management. J. Intell. Inf. Syst. 2021, 56, 97–120. [Google Scholar] [CrossRef]
  21. Cimiano, P.; Chiarcos, C.; McCrae, J.P.; Gracia, J. Linguistic Linked Data: Representation, Generation and Applications; Springer: Cham, 2020. [Google Scholar] [CrossRef]
  22. Liang, Z. Harmonizing Metadata of Language Resources for Enhanced Querying and Accessibility. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
  23. Khan, A.; Doroshenko, V.; Sahay, R.; Chronaki, C.; Bohlscheid, H. Using Graph Tools on Metadata Repositories. In Proceedings of the Building Continents of Knowledge in Oceans of Data: The Future of Co-Created eHealth;Studies in Health Technology and Informatics; IOS Press, 2018; Vol. 247, pp. 761–765. [Google Scholar] [CrossRef]
  24. Abu Ahmad, R.; D’Souza, J.; Zloch, M.; Otto, W.; Rehm, G.; Oelen, A.; Dietze, S.; Auer, S. Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph. In Proceedings of the Joint Proceedings of the Onto4FAIR 2023 Workshops, 2024. [Google Scholar]
  25. DCMI Usage Board. ANSI/NISO Z39.85-2012; DCMI Metadata Terms. DCMI Recommendation, 2020. Standardised as ISO 15836-1:2017 and ISO 15836-2:2019; supersedes the Dublin Core Metadata Element Set, Version 1.1.
  26. Broeder, D.; Windhouwer, M.; Van Uytvanck, D.; Goosen, T.; Trippel, T. CMDI: A Component Metadata Infrastructure. In Proceedings of the Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR, Workshop at LREC 2012, Istanbul, Turkey, 2012; pp. 1–4. [Google Scholar]
  27. Zulueta-Coarasa, T.; Jug, F.; Mathur, A.; Moore, J.; Muñoz Barrutia, A.; Babalola, K.; Bankhead, P.; Gilloteaux, P.; Gogoberidze, N.; Jones, M.; et al. MIFA: Metadata, Incentives, Formats, and Accessibility Guidelines to Improve the Reuse of AI Datasets for Bioimage Analysis, 2023. arXiv arXiv:q.
  28. Akhtar, M.; Benjelloun, O.; Conforti, C.; Foschini, L.; Gijsbers, P.; Giner-Miguelez, J.; Goswami, S.; Jain, N.; Karamousadakis, M.; Krishna, S.; et al. Croissant: A Metadata Format for ML-Ready Datasets. Proc. Adv. Neural Inf. Process. Syst. 2024, Vol. 37, 82133–82148. [Google Scholar]
  29. Ignatowicz, J.; Kutt, K.; Nalepa, G.J. Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications. arXiv 2025, arXiv:cs. [Google Scholar] [CrossRef]
  30. Koeva, S.; Blagoeva, D.; Kolkovska, S. Bulgarian National Corpus Project. In Proceedings of the Proc. 7th Int. Conf. Language Resources and Evaluation (LREC), 2010. [Google Scholar]
  31. Koeva, S.; Stoyanova, I.; Leseva, S.; Dimitrova, T.; Dekova, R.; Tarpomanova, E. The Bulgarian National Corpus: Theory and Practice in Corpus Design. J. Lang. Model. 2012, 0, 65–110. [Google Scholar] [CrossRef]
  32. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
  33. Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18), Houston, TX, USA, 2018; pp. 1433–1445. [Google Scholar] [CrossRef]
Figure 1. Search interface of the prototype.
Figure 1. Search interface of the prototype.
Preprints 215424 g001
Figure 2. Cypher traversal for the two-criterion query (Test Case 2).
Figure 2. Cypher traversal for the two-criterion query (Test Case 2).
Preprints 215424 g002
Table 1. Metadata categories implemented in the prototype for textual data (15 mandatory, 8 optional), with the data type of each value.
Table 1. Metadata categories implemented in the prototype for textual data (15 mandatory, 8 optional), with the data type of each value.
Category Data type Description
Mandatory (15)
Identifier String (free) Unique document identifier with the language prefix bg.
Licence String (from list) Licence name with classification by type (open, restricted, etc.).
PublicationDate Date (yyyy-mm-dd) Date of publication of the text.
DocumentTitle String (free) Title of the document.
Source String (free) Publishing organisation, media outlet or institutional originator.
Medium String (from list) Modality of the resource (textual, multimodal).
URL String (URL) Original web address.
Domain Array[String] (from list) Up to six subject areas from a controlled vocabulary.
Keywords Array[String] (free) Up to six keywords characterising the content.
NumberWords Integer Total number of words.
NumberSentences Integer Total number of sentences.
NumberParagraphs Integer Total number of paragraphs.
NumberTokens Integer Total number of tokens.
PersonallyIdentifiableInformation Array[Double] Per-sentence vector of the proportion of tokens flagged as personally identifiable.
BiasedInformation Array[Double] Per-sentence vector of the proportion of tokens flagged as potentially biased.
Optional (8)
Author Array[String] (free) Name(s) of the author(s).
Style String (from list) Stylistic register (e.g. legal, journalistic, administrative).
Type String (from list) Document genre (e.g. book, document, article).
Subdomain Array[String] (from list) Narrower thematic classification, hierarchically linked to Domain.
TranslatedDocument Boolean Indicator of original Bulgarian text vs. translation.
CollectionDate Date (yyyy-mm-dd) Date of acquisition into the collection.
LicenseLink String (URL) URL of the licence text.
TaskCategories Array[String] (from list) Anticipated NLP applications, from a predefined list.
Table 2. Relation types of the prototype’s graph schema, with the direction between connected node types.
Table 2. Relation types of the prototype’s graph schema, with the direction between connected node types.
Relation From → To Interpretation
BELONGS_TO DocumentDomain Subject-area assignment (up to six).
SUBCATEGORY_OF DomainDomain Hierarchical link to parent domain.
LICENSED_WITH DocumentLicence License assignment of the document.
HAS_LICENCE_CATEGORY LicenceLicenceCategory High-level class (free / restricted).
WRITTEN_BY DocumentAuthor Authorship.
PUBLISHED_IN DocumentSource Publishing organisation or platform.
PUBLISHED_ON DocumentDate Date of publication.
HAS_STYLE DocumentStyle Stylistic register.
HAS_TYPE DocumentType Document genre.
HAS_MEDIUM DocumentMedium Medium of the dataset entry.
HAS_KEYWORD DocumentKeyword Content-descriptive term.
HAS_SIZE DocumentStatistics Token-count summary.
SUITABLE_FOR DocumentTaskCategory Suitable NLP application.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated