Submitted:
02 May 2024
Posted:
02 May 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Materials and Methods
3.1. Material: Our Evaluation Dataset
3.2. Method: Three Large Language Models Applied for Research Dimensions Generation
3.2.1. The Three LLMs
3.2.2. Research Dimensions Generation Prompt for the LLMs
3.3. Method: Three Types of Similarity Evaluations between ORKG Properties vs. LLM-Generated Research Dimensions
3.3.1. Semantic Alignment and Deviation Evaluations Using GPT-3.5
3.3.2. ORKG Property and LLM-Generated Research Dimension Mappings by GPT-3.5
3.3.3. Scientific Embeddings-Based Semantic Distance Evaluations
3.3.4. Human Assessment Survey Comparing ORKG Properties with LLM-Generated Research Dimensions
- How many of the properties generated by ChatGPT are relevant to your ORKG structured contribution? (Your answer should be a number)
- Considering the ChatGPT-generated content, would you consider making edits to your original ORKG structured contribution?
-
If the ChatGPT-generated content were available to you as suggestions before you created your structured contribution, would it have been helpful?
- (a)
- If you answered "Yes" to the question above, could you describe how it would have been helpful?
- On a scale of 1 to 5, please rate how well the ChatGPT-generated response aligns with your ORKG structured contribution.
- We plan to release an AI-powered feature to support users in creating their ORKG contributions with automated suggestions. In this context, please share any additional comments or thoughts you have regarding the given ChatGPT-generated structured contribution and its relevance to your ORKG contribution.
4. Results and Discussion
4.1. Semantic Alignment and Deviation Evaluations
4.2. Property and Research Dimension Mappings
4.3. Embeddings-Based Evaluations
4.4. Human Assessment Survey
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
References
- Arab Oghli, O.; D’Souza, J.; Auer, S. Clustering Semantic Predicates in the Open Research Knowledge Graph. International Conference on Asian Digital Libraries. Springer, 2022, pp. 477–484.
- Auer, S.; Oelen, A.; Haris, M.; Stocker, M.; D’Souza, J.; Farfar, K.E.; Vogt, L.; Prinz, M.; Wiens, V.; Jaradeh, M.Y. Improving access to scientific literature with knowledge graphs. Bibliothek Forschung und Praxis 2020, 44, 516–529. [Google Scholar] [CrossRef]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Harnad, S. Language Writ Large: LLMs, ChatGPT, Grounding, Meaning and Understanding. arXiv 2024, arXiv:2402.02243. [Google Scholar]
- Karanikolas, N.; Manga, E.; Samaridi, N.; Tousidou, E.; Vassilakopoulos, M. Large Language Models versus Natural Language Understanding and Generation. Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics, 2023, pp. 278–290.
- Ostendorff, M.; Rethmeier, N.; Augenstein, I.; Gipp, B.; Rehm, G. Neighborhood contrastive learning for scientific document representations with citation embeddings. arXiv 2022, arXiv:2202.06671. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. ; others. Improving language understanding by generative pre-training 2018.
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; others. Language models are few-shot learners. Advances in neural information processing systems 2020, 33, 1877–1901. [Google Scholar]
- Cai, H.; Cai, X.; Chang, J.; Li, S.; Yao, L.; Wang, C.; Gao, Z.; Li, Y.; Lin, M.; Yang, S. ; others. SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis. arXiv 2024, arXiv:2403.01976. [Google Scholar]
- Jin, H.; Zhang, Y.; Meng, D.; Wang, J.; Tan, J. A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods. arXiv 2024, arXiv:2403.02901. [Google Scholar]
- Liang, W.; Zhang, Y.; Cao, H.; Wang, B.; Ding, D.; Yang, X.; Vodrahalli, K.; He, S.; Smith, D.; Yin, Y. ; others. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. arXiv, 2023; arXiv:2310.01783. [Google Scholar]
- Antu, S.A.; Chen, H.; Richards, C.K. Using LLM (Large Language Model) to Improve Efficiency in Literature Review for Undergraduate Research 2023.
- Latif, E.; Fang, L.; Ma, P.; Zhai, X. Knowledge distillation of llm for education. arXiv 2023, arXiv:2312.15842. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
- Cohan, A.; Feldman, S.; Beltagy, I.; Downey, D.; Weld, D.S. Specter: Document-level representation learning using citation-informed transformers. arXiv 2020, arXiv:2004.07180. [Google Scholar]
- Singhal, A.; others. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 2001, 24, 35–43. [Google Scholar]
- Yasunaga, M.; Kasai, J.; Zhang, R.; Fabbri, A.R.; Li, I.; Friedman, D.; Radev, D.R. Scisummnet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks. Proceedings of the AAAI conference on artificial intelligence, 2019, Vol. 33, pp. 7386–7393.
- Banerjee, D.; Singh, P.; Avadhanam, A.; Srivastava, S. Benchmarking LLM powered chatbots: methods and metrics. arXiv 2023, arXiv:2308.04624. [Google Scholar]
- Verma, V.; Aggarwal, R.K. A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective. Social Network Analysis and Mining 2020, 10, 43. [Google Scholar] [CrossRef]
- Ferdous, R. ; others. An efficient k-means algorithm integrated with Jaccard distance measure for document clustering. 2009 first asian himalayas international conference on internet. IEEE, 2009, pp. 1–6.
- O’callaghan, D.; Greene, D.; Carthy, J.; Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications 2015, 42, 5645–5657. [Google Scholar] [CrossRef]
- Kocmi, T.; Federmann, C. Large language models are state-of-the-art evaluators of translation quality. arXiv 2023, arXiv:2302.14520. [Google Scholar]
- Leiter, C.; Opitz, J.; Deutsch, D.; Gao, Y.; Dror, R.; Eger, S. The eval4nlp 2023 shared task on prompting large language models as explainable metrics. arXiv 2023, arXiv:2310.19792. [Google Scholar]
- Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed: 2024-04-23.
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. ; others. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Thakkar, H.; Manimaran, A. Comprehensive Examination of Instruction-Based Language Models: A Comparative Analysis of Mistral-7B and Llama-2-7B. 2023 International Conference on Emerging Research in Computational Science (ICERCS). IEEE, 2023, pp. 1–6.
- Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. Accessed: 2024-04-23.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; others. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. others. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 2022, 35, 24824–24837. [Google Scholar]











| Prompting technique | System prompt | Output example |
|---|---|---|
| Zero-Shot | You will be provided with a research problem, your task is to list dimensions that are relevant to find similar papers for the research problem. Respond only in the format of a python list. | ["Natural Language Processing", "Text Analysis", "Machine Learning", "Deep Learning", "Information Retrieval", "Artificial Intelligence", "Language Models", "Document Summarization"] |
| Few-Shot | You will be provided with a research problem, your task is to list dimensions that are relevant to find similar papers for the research problem. Respond only in the format of a python list. The following are two successfully completed task examples. Research problem: "Transformer models" Research dimensions: [’model’, ’date created’, ’pretraining architecture’, ’pretraining task’, ’training corpus’, ’optimizer’, ’tokenization’, ’number of parameters’, ’license’] Research problem: "Liposomes as drug carriers" Research dimensions: [’Type of nanocarrier’, ’Nanoparticle preparation method’, ’Lipid composition’, ’Drug type’, ’Particle size’] | [’Summarization approach’, ’Document type’, ’Language’, ’Evaluation metric’, ’Model type’, ’Training dataset’, ’Compression ratio’, ’Summary length’] |
| Chain-of-Thought | You will be provided with a research problem, your task is to list dimensions that are relevant to find similar papers for the research problem. Provide justification why each dimension is relevant for finding similar papers. Think step-by-step. At the end combine all the relevant dimensions in the format of a python list. | ["Task/Methodology", "Domain/Genre", "Evaluation Metrics", "Language", "Input/Output Format", "Deep Learning/Traditional Methods", "Applications"] |
| Research problem | ORKG properties | GPT-3.5 dimensions | Llama 2 dimensions | Mistral dimensions |
|---|---|---|---|---|
| Limestone mineral identification using Hyperion imagery | ’Data used’, ’Research problem’, ’Study Area’, ’Software’, ’Analysis’, ’Processing’, ’Field instrument’, ’Reference’ | ’Mineralogy’, ’Limestone’, ’Hyperion imagery’, ’Identification’ | ’Remote Sensing’, ’Hyperspectral Imaging’, ’Mineral Identification’, ’Limestone’, ’Hyperion Imagery’, ’Geological Mapping’, ’Image Classification’, ’Spectral Analysis’, ’Earth Observation’, ’Geoscience’ | ’Limestone’, ’Mineral identification’, ’Remote sensing’, ’Hyperion imagery’, ’Spectral signatures’, ’Band ratios’, ’Textural features’, ’Shape descriptors’, ’Machine learning algorithms’, ’Support Vector Machines’, ’Decision trees’, ’Random forests’ |
| Do human noroviruses infect animals | ’Material’, ’Method’, ’Symptoms and signs’, ’Research problem’, ’Species’, ’Location’, ’Antigen used’, ’Immunoglobulin class’, ’Primers’, ’Noroviruses found’, ’Norovirus genotype’ | ’viral infectivity’, ’host range’, ’cross-species transmission’, ’zoonotic potential’ | ’Host species’, ’Viral genotype’, ’Transmission route’, ’Infection prevalence’, ’Pathogenesis’, ’Diagnostic methods’, ’Epidemiology’, ’Evolutionary history’, ’Interaction with other pathogens’, ’Public health implications’ | ’Norovirus’, ’Animals’, ’Transmission’, ’Species’, ’Symptoms’, ’Epidemiology’, ’Serotypes’, ’Immunity’, ’Detection methods’, ’Genome analysis’ |
| Score | Description |
|---|---|
| 1 - Strongly Disaligned | The strings in the lists have minimal or no semantic similarity. |
| 2 - Disaligned | The strings in the lists have limited semantic alignment. |
| 3 - Neutral | The semantic alignment between the lists is moderate or average. |
| 4 - Aligned | The strings in the lists show substantial semantic alignment. |
| 5 - Strongly Aligned | The strings in the lists exhibit high semantic coherence and alignment. |
| Score | Description |
|---|---|
| 1 - Minimal Deviation | The strings in the lists show little or no semantic difference. |
| 2 - Low Deviation | The semantic variance between the lists is limited. |
| 3 - Moderate Deviation | There is a moderate level of semantic difference between the strings in the lists. |
| 4 - Substantial Deviation | The lists exhibit a considerable semantic gap or difference. |
| 5 - Significant Deviation | The semantic disparity between the lists is pronounced and substantial. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).