Submitted:
01 November 2025
Posted:
03 November 2025
You are already at the latest version
Abstract
Keywords:
Introduction
The Semantic Web
RDF and RDFS
Large Language Models
GPT-4o
Example Corpus
Literate Programming with Nuweb

Methodology
PDF to Text Conversion


- Our chunking strategy is simple: after the text has been removed of extra whitespace we concatenate every 10 lines.

- In order to best identify the chunk we need we will, later on, do a cosine similarity based retrieval. For that purpose we will generate text embeddings using OpenAI’s API.

Document Schema

Ask the LLM

Prompt Construction for Graph Construction


- For the task at present we implement a cosine similarity based retriever which provides a Retrieval Augmented Generation (RAG) capability across our chunked document

Discussion and Conclusions
- Defined RDFS for a class of financial documents. This RDFS could also have come from a pre-defined RDFS source. Alternatively, OWL could also have been used.
- Chunked an input document.
- Defined a RAG function for the document fragments.
- Queried the RDFS properties and used them to generate LLM prompts.
- Those prompts were used to retrieve the document chunks most likely to contain the answer to the propt question.
- An LLM was sent the prompts and relevant text chunks to extract the requested information.
- The LLM response was used to complete the Knowledge Graph in accordance with the RDFS.
Appendix A. Code


References
- Sean Bechhofer, Frank van Harmelen, Jim Hendler, Ian Horrocks, Deborah McGuinness, Peter Patel-Schneijder, and Lynn Andrea Stein. OWL Web Ontology Language Reference. Recommendation, World Wide Web Consortium (W3C), February10 2004. See http://www.w3.org/TR/owl-ref/.
- Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web, volume 284. Scientific American, 2001.
- Dan Brickley and R.V. Guha. RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation, World Wide Web Consortium, February 2004.
- Preston Briggs. Nuweb, a simple literate programming tool. |cs.rice.edu:/public/preston|, Rice University, Houston, TX, USA, 1993.
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- Gavin Carothers and Eric Prud’hommeaux. RDF 1.1 turtle. W3C recommendation, W3C, February 2014.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- Osprey Funds. OSPREY BITCOIN TRUST: Investment Terms & Private Placement Details. https://ospreyfunds.io/wp-content/uploads/Osprey-Bitcoin-Trust-Fact-Sheet.pdf, 2019. [Online; accessed 26-May-2024].
- Aldo Gangemi, Nicola Guarino, Claudio Masolo, Alessandro Oltramari, and Luc Schneider. Sweetening Ontologies with DOLCE, pages 166–181. Springer, Berlin, Heidelberg, 2002.
- Ali Khalili and Sören Auer. Wysiwym authoring of structured content based on schema.org. In Xuemin Lin, Yannis Manolopoulos, Divesh Srivastava, and Guangyan Huang, editors, Web Information Systems Engineering – WISE 2013, volume 8181 of Lecture Notes in Computer Science, pages 425–438. Springer Berlin Heidelberg, 2013.
- Ora Lassila and Ralph R. Swick. Resource Description Framework (RDF) Model and Syntax Specification. Technical report, 1999.
- Natalya F. Noy, Nigam H. Shah, Patricia L. Whetzel, Benjamin Dai, Michael Dorf, Nicholas Griffith, Clement Jonquet, Daniel L. Rubin, Margaret-Anne Storey, Christopher G. Chute, and Mark A. Musen. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37(suppl_2):W170–W173, 05 2009. [CrossRef]
- OpenAI. Gpt-4 technical report, 2024.
- GG Petrova, AF Tuzovsky, and Nataliya Valerievna Aksenova. Application of the financial industry business ontology (fibo) for development of a financial organization ontology. In Journal of Physics: Conference Series, volume 803, page 012116. IOP Publishing, 2017.
- Roie Schwaber-Cohen. Chunking Strategies for LLM Applications. https://www.pinecone.io/learn/chunking-strategies/, 2023. [Online; accessed 2-June-2024].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).