Preprint
Article

This version is not peer-reviewed.

WGIE: Extraction of Wheat Germplasm Resource Information Based on Large Language Model

Submitted:

20 November 2024

Posted:

21 November 2024

You are already at the latest version

Abstract

Increased wheat production is crucial for addressing food security concerns caused by limited resources, extreme weather, and population expansion. However, breeders have challenges due to fragmented information in multiple research articles, which slows progress in generating high-yield, stress-resistant, and high-quality wheat. This study presents WGIE (Wheat Germplasm Information Extraction), a wheat research article abstract-specific data extraction workflow based on conversational large language models (LLMs) and rapid engineering. WGIE employs zero-shot learning, multi-response polling to reduce hallucinations, and a calibration component to ensure optimal outcomes.Validation on 443 abstracts yielded 0.8010 Precision, 0.9969 Recall, 0.8883 F1 Score, and 0.8171 Accuracy, proving the ability to extract data with little human effort. Analysis found that irrelevant text increases the chance of hallucinations, emphasizing the necessity of matching prompts to input language. While WGIE efficiently harvests wheat germplasm information, its effectiveness is dependent on the consistency of prompts and text. Managing conflicts and enhancing prompt design can improve LLM performance in subsequent jobs.

Keywords: 
;  ;  ;  ;  

1. Introduction

Global population increase, limited natural resources, climate change, and the need for sustainable development, among other factors, place a significant demand on food supply, particularly for staple crops like wheat. Existing breeding efforts are focused on generating high-yielding wheat varieties to boost food output and fulfill rising demand. Mining and utilizing existing data is critical in this process, and related databases have been created, such as the molecular and phenotypic information databases of wheat and oat genera [1] and the entire genome-scale functional gene network database of bread wheat [2], among others. However, these databases cannot adequately meet the needs of the wheat business. However, these databases are not yet capable of meeting the complete range of breeding requirements. Breeders' information needs for rice germplasm resources are divided into three categories: existing wheat germplasm resources, wheat germplasm resources to satisfy specific criteria, and potential wheat germplasm resources. To meet these three levels of needs, three levels of information services are required: a general list of wheat germplasm resources, targeted searching and screening of wheat germplasm resources, and the foundation of previous research on targeted wheat germplasm resources. Existing research have primarily provided lists and basic phenotypic information, which can only meet the first level of demand. As a result, breeders using these platforms can only get a rough idea of what wheat germplasm resources are currently being shared and used, but they can't tell you what characteristics these resources have, whether they've been improved for breeding and genetic mining, and how difficult they are to access and apply. As breeding work progresses, the amount of information on wheat germplasm resources grows fast, as does the expense of extracting the necessary information.
Efficient and high-quality information extraction strategies are required to build databases that meet the needs. Traditional information extraction approaches [3], such as rule- and dictionary-based methods, statistical machine learning-based methods, and deep learning-based methods, have made advances in data extraction tasks at various points in time. However, most of these methods rely on experts to formulate rules or a large number of manually labeled training data, for the current insufficient mining of the lack of structured data, and the application of wheat germplasm resources with different writing styles, whether experts to formulate rules or manually labeled training data, there are problems such as high cost and maintenance difficulties in the practical application, in order to adapt to the field of wheat ge There are some challenges in adjusting to the field of wheat germplasm resources, which is always evolving with new knowledge.
A relatively innovative technique in the field of Natural Language Processing, Large Language Model (LLM), is thought to offer more powerful data extraction capabilities than prior strategies [4,5,6]. Compared to classic extraction methods with limited generalization ability, pre-trained conversational LLMs based on generalized corpora frequently provide standard answers on generalized knowledge, indicating conversational LLMs' generalization capacity. The great generalized language competence of conversational LLMs promises the potential of high-quality information extraction with minimal upfront work and no additional training. However, the fundamental qualities of LLMs, such as the development of "hallucinatory" content [7,8], the loss of crucial information in lengthy discussions or papers [9], and catastrophic forgetting [10], may result in the generation of incorrect information [11]. Therefore, despite the generalization ability of LLM, for tasks requiring precision and accuracy, Hajikhani et al [12] argue that specialized models should be used.
There are several strategies to adapt LLM to specific applications in specialized domains, including pre-training from scratch [13,14], pre-training from checkpoints of existing general LMs [15,16], fine-tuning with task specific data [14], instruction fine-tuning and/or RLHF fine-tuning [17,18,19], soft prompt tuning, prompt engineering, etc. can adapt LLM to specific domains [20]. Due to the high cost of training LLM, which requires a lot of data, technical knowledge, specialized hardware, adequate storage, and data security measures [21], it is more cost-effective to use prompt engineering to adapt LLM. When using this method, the quality of the prompt words is directly related to the quality of the responses obtained [22]. Good prompters can improve the accuracy of extracting structured information from LLMs, and the design of prompters tends to influence the expansion and maintenance of prompters at a later stage. Researchers have explored design strategies for prompt words with the aim of making LLM more likely to produce the desired results [23].
Therefore, in order to extract information from wheat germplasm application data with different writing styles in the absence of labeled data to meet the additional information needs of breeders for rice germplasm resources, we combined LLM and prompt engineering to propose WGIE, a high-quality wheat germplasm data without the need to use labeled data for training. Extraction is a technical solution that can assist researchers in efficiently extracting the necessary information from the abstracts of wheat germplasm research articles, enabling for the creation of relevant databases based on the data extraction findings. Furthermore, because of the enormous number of parameters in the big model, it requires a significant amount of memory resources during the decoding stage, making it prohibitively expensive to deploy in real applications. To avoid the arithmetic, data, and technological issues that can arise while using LLM, Chat-RGIE calls the api interface, which can be utilized by users who do not understand the internal workings of LLM. This work's contributions include the following:
(1) This paper proposes an information extraction scheme based on LLM and prompt engineering for the information needs of wheat germplasm resources, which provides a structured data base for the future construction of wheat germplasm resources application information database.
(2) This paper developed a collection of cue phrases for breeding activity to increase the interpretation and extraction accuracy of wheat germplasm resource information by LLM.
(3) Analyze the outcomes of information extraction to detect potential problems and propose remedies to avoid errors.
The paper is organized as follows. Section 2 discusses related work in massive language modeling and quick engineering. Section 3 and Section 4 describe the proposed technical solution and experimental results. Section 4 includes a debate. This study closes with Section 5.

2. Related Work

2.1. Large Language Model

Since the introduction of OpenAI's GPT-3 [24], big language models have received a lot of attention in the field of natural language processing. Trained on vast volumes of general-purpose text data, these models have demonstrated considerable potential for obtaining human-like intelligence [4,11,16,25,26], and are capable of performing natural language tasks such as text production with excellent quality. Nowadays, LLMs (e.g., GPT-4 [11] and LLaMA [16]) may perform well in a variety of domain tasks without requiring fine-tuning; for example, the GPT family of models brings remarkable capabilities in text summarization and categorization [27,28]. However, LLMs are not flawless, and they must address additional difficulties in terms of sensitivity and potential biases [29,30]. When LLMs are pre-trained, there may be biases in their training data that perpetuate and exacerbate existing social biases when used, resulting in LLMs performing very poorly on occasion. This is a common issue that can be difficult to notice and remedy at a later time [31,32]. Furthermore, when dealing with specialized subjects, LLMs are more likely to manufacture incorrect facts and answer questions accordingly if they have not been schooled on the relevant data, a scenario known as "hallucination". Furthermore, when dealing with particular subjects or activities, LLMs may encounter difficulties in producing documents containing concepts and procedures. Therefore, despite the generalization ability of LLM, Hajikhani et al. [12] believe that specialized models should be used for tasks that require precision and accuracy.
Hallucination can be characterized as either intrinsic or extrinsic. In intrinsic hallucinations, the model's output contradicts the original text. Extrinsic hallucinations provide information that neither contradicts nor supports the underlying material. There are several reasons why a model may hallucinate or generate incorrect data during the reasoning process. For example, if the model misinterprets the information or facts offered in the original text, it may develop hallucinations. To ensure veracity, the model's reasoning ability must be sufficient to accurately comprehend the information in the original text.Another reason LLM gives inaccurate results is that the given contextual information contradicts the parameter knowledge acquired during pre-training. Furthermore, the model has a parametric bias for different forms of knowledge, and it emphasizes knowledge acquired during pre-training over supplied contextual information. To detect LLM hallucinations, several outputs are verified, external knowledge sources are employed to assess model accuracy, and named entities (subject, relation, object) are compared to ground truth data. There are several standards available for verifying the veracity of language models, including TruthfulQA[33]. This benchmark evaluates the model's risk of providing faulty or incorrect information. According to TruthfulQA[33], the largest models are often the least true, meaning that scaling up the model is less likely to improve its truthfulness, despite the fact that it can improve its performance.

2.2. Prompt Engineering

While pre-training from scratch, pre-training from checkpoints of existing general LMs, fine-tuning with task-specific data for LLM, instruction fine-tuning and/or RLHF fine-tuning, soft prompt tuning, prompt engineering, etc. can adapt LLMs to specific domains, but one of them, such as fine-tuning, although can adapt the LLM to the downstream task well, they cannot completely eliminate the significant gap between the goals. For example, if the downstream objective is classification, pre-training typically changes the target to a next-token prediction job. On the other hand, for trillion-scale models, fine-tuning for the downstream job results in poor transferability[34], in addition to the fact that these models must be built larger in order to swiftly memorize the samples used in the fine-tuning. To overcome these concerns, prompt-tuning, a parameter-efficient tuning technique, might be employed. For example, GTP3[25] (which is not intended for fine-tuning) relies largely on user prompts to direct models for downstream applications. Prompt tweaking is required to expand the scope of this (manual) prompt engineering technique.
Prompt-tuning can bridge the gap between training and tuning goals in a cost-effective manner, allowing pre-trained models' expertise to be better applied to downstream tasks. Prompts are additional information that the user enters to specify the circumstances for engaging with the LLM when obtaining a model response. Questions, directions, and examples are common forms of supplementary information used as task input markers. Prompt-based techniques are a promising option since learning by prompts becomes more efficient and cost-effective as LLM scales. Furthermore, unlike fine-tuning, which necessitates a different model for each downstream activity in this category, prompt-tuning allows for the use of a single model to fulfill many downstream tasks. Multitask prompts can assist the model generalize the task and complete cross-domain activities.
Prompt engineering is a sort of Prompt-tuning that involves the creation of optimal prompts to get optimal results. Prompts must be designed to best elicit knowledge and maximize the predictive performance of the language model.Prompt engineering has proven useful in boosting model performance. Prompt engineering techniques, such as Chain of Thought (CoT)[35], Tree of Thought (ToT)[36], Graph of Thoughts (GoT)[37], and ReAc (Reason and Act)[38], can dramatically increase LLM reasoning capacity and performance on specific tasks. Furthermore, strategies like Self-Consistency[39] and Few-Shot Prompting[35] can help LLMs perform better and more consistently.
Some studies have looked into the use of LLM and prompt engineering in agriculture, demonstrating that it is possible to employ LLM to do specialized tasks and making some progress.Zhao et al [40] thoroughly investigated the potential of ChatGPT in agricultural text categorization, and proposed a text categorization framework based on ChatGPT ChatAgri, which uses LLM in the form of an API interface, successfully avoiding the complex and high-cost local deployment problem of LLM; Peng et al [41] used LLM to extract structured data from agricultural documents, the results proved that LLM has a great potential for agricultural information extraction; Jiajun Q et al [42] combined GPT-4 with YOLOPC to propose an agricultural pest and disease diagnosis method. However, these related research confirmed that LLM's potential bias and error propagation can cause oscillations in output accuracy when used for agricultural jobs.

3. Methodology

3.1. Data sources

This study used publicly available abstracts of wheat germplasm research papers obtained by searching the web of science for research papers related to "wheat germplasm", removing those related to wheat feed, retaining those related to wheat germplasm innovation and other papers related to wheat breeding, and then filtering the data to retain 443 abstracts. Following screening, 443 abstracts were kept. The search keywords were chosen and organized into three areas. The first pertains to the maize domain. The second domain concerns germplasm resources. The third domain focuses on maize genetics.Table 1 shows the structure of the search query used to find the information sources.

3.2. Information Extraction Program

Figure 1 illustrates the overall workflow of WGIE. Firstly, the summaries to be extracted are fed into n session LLMs (the value of n is not specified), and within each LLM, the summaries are combined with the pre-designed DATA EXTRACTION PROMOTE and the raw data extraction results are obtained from the LLMs. After obtaining the data extraction results from all n LLMs, the summary and the raw data extraction results output from each LLM are combined with the pre-designed data scoring prompt within the n+1th LLM (which is not involved in the data extraction work) in order to select the optimal data extraction results. The summary data to be extracted and the prompts used for constraints are provided repeatedly within WGIE at each invocation of the LLM in the data extraction. And the same some of the same requirements are provided repeatedly in all phases, including format instructions, general instruction and related content, as shown in Table 2. The repetition of the text is provided so that the model pays less attention to textual details as the avoidance of conversation time increases, and therefore the text is always provided repeatedly at each data extraction, which improves the quality of the answer [6] and enhances the possibility of answering in a short structured format.
Figure 2 shows the schematic diagram utilized for the data extraction prompt. The goal of the data extraction phase is to extract data for various traits of wheat germplasm resources, and Table 3 displays the three major categories of extracted traits.
Figure 3 illustrates the process of selecting the optimal answer after data extraction. In the answer selection part, the aim is to utilize the n+1th LLM to achieve the selection of the optimal answer from the answers of n LLMs and output it. The n+1th LLM does not participate in the data extraction task. In this section, the consistency of the data extraction results of all LLMs is first judged, and it is considered that all the responses can definitely be divided into two categories, namely the one that has similarity between them and most of the responses, and the one that does not have similarity with the other responses. Classification is achieved by calculating the Euclidean distance between all responses. Secondly, all the positive responses are integrated into one response, since there is a strong similarity between these responses, so even integrating them into one response does not change their results and reduces the number of questions asked. In the third step, each response is scored, including the integrated positive responses and all negative responses. The specific score is determined by comparing the match between the data extracted from the integrated positive answer and all negative answers with the original abstract and the question, and LLMn+1 will score between 0 and 100 based on the match. In particular, except for the case where the answer does not match both the question and the original summary text, if the answer matches the original text but does not match the question, the score will also be 0. In the fourth step, it is determined whether there is only one answer that scores the highest score, if there is only one highest scoring answer the highest scoring answer is directly outputted, and conversely LLMn+1 is asked to make a determination again as to which answer matches the original summary text and the question most closely, and then outputs the the most appropriate answer.

3.3. Experimental Setup

This section explains some of the basic settings that we employed during our investigations.WGIE is adaptable since there are no special data values that must be established internally, and there are no strict constraints governing the number of LLMs to be utilized in the data extraction and optimal response selection sections. However, it is normally advised to utilize only one LLM for determining the best response, and this LLM should outperform the LLMs used in the data extraction portion in some ways.Furthermore, because the requirement for using multiple LLMs in the data extraction section is to reduce the possibility of the LLMs' hallucinations affecting the quality of the answers through answer comparison, we believe that the LLMs used in the data extraction section should be the same as the LLMs used in the data extraction section. The LLMs used should be of various sorts, and using the same series is not suggested.In this investigation, we employed five LLMs for data extraction and one for optimal answer selection.
Table 4 displays the scores of the LLMs employed in this study under the stated benchmarks, which are primarily derived from the LLMs' technical papers or official websites, with a few coming from the findings of internlm's opencompass online evaluation. By analyzing the data extraction requirements, this study focuses on selecting LLMs with a specific context length as well as language comprehension and logical reasoning abilities; thus, when selecting the models, we primarily consider the models' MMLU[43], HellaSwag[44], BBH[45], GSM8K[46], and MATH[47] benchmark scores. The scores of some of the current mainstream LLMs on these benchmarks were compared, and a better composite score of Qwen2-70B-instruct was chosen to be used for the evaluation and scoring work in the Selection of Optimal Answers section, and five LLMs with some differences in their capabilities were used in the Data Extraction section, namely Command-r-plus-104b, Qwen1.5-110B[48], Yi-34B[49], LLaMA3-70B[50], and Deepseek-v2-236B[51] (Table 5). Among these, Command-r-plus-104b lacks credible scores on BBH and MATH benchmarks; nevertheless, we anticipate some variability in the LLMs utilized, so we elected to employ Command-r-plus-104b in the data extraction part despite the lack of some results.
The experimental configuration of this paper is shown in Table 6. This experiment is conducted by deploying LLM locally and using it by calling the API interface, and the framework used is ollama (https://github.com/jmorganca/ollama). If you use the API interface provided by a commercial LLM service provider to make LLM calls, the hardware requirements only need to meet the recommended configuration of the corresponding service provider, and the hardware configuration used in this experiment is not a mandatory requirement for the use of RGIE.

3.4. Evaluation Metrics

In our evaluation, we defined True Positive, True Negative, False Positive, and False Negative based on each input abstract text (Table 7). For each extracted keyword, if the abstract was manually judged to contain no relevant content and WGIE also extracted 0 data, the case was defined as True Negative. each extracted data from an abstract that did not contain relevant data was considered as a False Positive. If 1 piece of data is extracted manually from the abstract and WGIE extracts more than one piece of data, they are compared with the manually extracted data in turn, and the case is defined as True Positive if they are the same, and False Positive for those that are not.For each abstract, if the manually extracted data has already been confirmed with one of the data extracted by WGIE to be For each abstract, if the manually extracted data has been confirmed to be equivalent to a piece of data extracted by WGIE, then even if there is a subsequent binary that agrees with the manually extracted data, then it is determined to be False Positive. if there is more than one value in the data extracted by WGIE, the use of synonyms is permitted, but all of them need to be corresponded to each other (e.g., if there are 5 entities manually extracted from a piece of data, RGIE must recognize all 5 entities as well). The specific inclusion of each value is shown in Table 8.
We used Precision, Recall, F1 score and Accuracy to evaluate the information extraction results.
Precision is defined as the percentage of non-blank responses that are answered correctly.
P r e c i s i o n = T r u e P o s i t i v e T r u e P o s i t i v e + F a l s e P o s i t i v e ,
Recall is defined as the ratio of correct responses among all responses with a true value of positive.
R e c a l l = T r u e P o s i t i v e T r u e P o s i t i v e + F a l s e N e g a t i v e ,
The F1 Score considers both the accuracy and recall of the predicted answer and is a reconciled average of the two.
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ,
Accuracy is defined as the proportion of predicted answers that have any overlapping words with the true answer.
A c c u r a c y = T r u e P o s i t i v e + T r u e N e g a t i v e T r u e P o s i t i v e + T r u e N e g a t i v e + F a l s e P o s i t i v e + F a l s e N e g a t i v e

4. Result

4.1. Result of Performance Evaluation

We tested WGIE's performance on a variety of tasks, including the extraction of names and additional information about wheat and its parents, as well as the extraction of related entities and relationships in the text of agronomic, resistance, and yield descriptions. We manually analyze the information extraction results and then determine precision and recall based on the findings. Table 9 shows the results of our calculations using the metric described in 3.4 and the review outcomes.
WGIE achieved exceptionally high Recall, Precision, F1 score, and Accuracy values, demonstrating the program's effectiveness. We discovered that the Recall value exceeds the Precision, and by analyzing the specific results, we discovered that abnormal False Positive responses occur when the corresponding content does not exist in the text. One of the False Positive cases that frequently occurs is when the abstract does not contain the relevant information, and extracting it causes the LLM to answer some data that is correct but not the requested data, such as answering resistance when asked to extract the Furthermore, our strict human review criteria identified numerous False Positive responses, which harmed accuracy, although such questions had no effect on recall, resulting in a slightly lower precision than recall. At the same time, the high accuracy demonstrated that the WGIE is effective and capable of extracting wheat germplasm information of some quality.
Ideally, we would like to obtain the target data directly, but this can easily lead to errors if the appropriate information is unavailable, as in the case of extracting the primary research wheat varieties, as is seen in Table 10. Although we have repeatedly said that we will respond "Not in Detail" when an answer does not exist, LLM still avoids unanswerable circumstances, which usually result in unexpected behaviors and, in severe cases, "hallucinations". If we can lower the likelihood of LLM failing to answer the questions, the performance of the information extraction scheme suggested in this research can be increased even further.

4.2. An Analysis of LLM Experiments with Contradictory Inputs

We completely randomly selected a portion of abstracts from research papers with low relevance to wheat germplasm resources, and also included a very small portion of data from research papers with no relevance to wheat germplasm resources. The reason for adding completely irrelevant data is that this is a rare case in real information extraction tasks, and it is only used here as a comparison to observe the effect of different levels of contradiction between the extracted text and the content of the prompter on the information extraction results. The data we used were more detailed in terms of the degree of contradiction with the prompt, but the types of results obtained when extracting most texts with similar degrees of contradiction were very similar, so we will not repeat them here. We have enumerated the pairs of extraction results as shown in Table 11.
In analyzing the data specifically we found the following phenomena worth discussing:
(1) When “wheat” is mentioned only once in the provided text, the extraction of the main study wheat will also result in “Triticum aestivum L.”, and LLM is actively reducing the accuracy to avoid the error. Errors.
(2) When the requested triples are not available, all the triples that can be extracted are repeated in each subsequent information extraction, which looks like trying to prove that one has the ability to extract information, instead of answering “Not in Detail” directly as requested. Since at least two LLMs agree with the answer we got in our information extraction scheme, this can be considered as a possible problem for most LLMs to some extent.
(3) When the provided text was completely unrelated to wheat germplasm resources, we got the response “However, based on your initial statement that I am an expert in the field of wheat germplasm, the wheat germplasm being studied can be referred to as the “Expert's Focused Wheat Germplasm.””, which is a response based solely on the prompt we provided, not the abstract text provided. In this case, wheat-related answers were obtained even though wheat was not included in the summary provided, leading to hallucinations.

5. Discussion

As stated in the introduction, breeders currently require new wheat databases to aid in the breeding process, and traditional database construction methods are expensive and difficult to expand new knowledge, so the emergence of LLMs opens up new possibilities for addressing these issues. Although LLM is effective for general-purpose tasks in natural language processing, more specialized models should be utilized in particular domains or activities. This can now be accomplished more cost-effectively through timely engineering. Meanwhile, research has focused on merging LLM and quick engineering for agricultural tasks, demonstrating that this technique is beneficial in the agricultural arena, but it does not eliminate the impacts of bias and error propagation in LLM. RGIE invokes LLM via an API and employs rapid engineering to specialize LLM, resulting in a cost-effective strategy. RGIE calls LLM through an API, prompting engineering to specialize LLM, enabling for the cost-effective and scalable deployment and use of data extraction strategies. In terms of performance, RGIE's high accuracy during the studies proves its capacity to extract quality wheat cultivation information.
However, RGIE exposes a new problem when extracting data fields that are supposed to be null, resulting in a range of issues such as redundant interpretations, meaningless repeats, replies that do not match the question, and "hallucinations". We believe that the majority of these events are caused by traits acquired during the LLM pre-training. When we examine the LLM from a management standpoint, and regard the LLM as an employee, we can see that these occurrences are extremely similar to those of humans. Employees experience varying levels of pressure to reach their goals, which can lead to varying levels of unethical behavior. As indicated in Table 11 above, as the pressure builds, "hallucinations" gradually arise. Assuming that if the LLM is unable to obtain the requested data from the data provided, the internal mechanism will provide negative feedback to prevent the LLM from directly responding "Not in Detail," there is a possibility of adding a positive response to the LLM's "Not in Detail" answer in the prompt. In this situation, including positive feedback in the prompt following the LLM's "Not in Detail" response may lessen the occurrence of the "hallucination". As a result, by examining the various levels of ambivalence in the summary text and the prompt, we propose that treating LLMs as employees and using managerial techniques in the design and implementation of prompt engineering to implement LLMs in downstream tasks can better reduce the likelihood of "hallucinations". Probability of occurrence.

6. Conclusion

Existing wheat databases fail to meet the needs of breeders for information on both wheat germplasm resources that meet specific requirements and the potential of specific wheat germplasm resources, and in order to promote wheat breeding to address future threats to food security, databases need to be established to accelerate wheat breeding, which requires efficient data extraction. However, this information still lacks structured data due to under-exploitation, and traditional data extraction methods for this task suffer from high construction and maintenance costs, and difficulty in extending new knowledge at a later stage. Therefore, our study aims to provide a technical solution for wheat germplasm resource information extraction that requires less computer technical skills and higher quality results, i.e., WGIE. since WGIE does not change to a large extent depending on the LLM used, it is also simple to apply it to future LLMs. and WGIE was used in the wheat germplasm resources related to the wheat germplasm resources that we collected in the Research Paper Abstracts dataset obtained 0.8010 Precision, 0.9969 recall, 0.8883 F1 Score, and 0.8171 accuracy, and the results of the study show that WGIE has the ability to achieve data extraction with certain quality. Further, we used a batch of data with different levels of contradiction with our designed cue words to test the performance of LLM when the input text is contradictory. The results show that the LLM exhibits a certain degree of human-like behavior, and as the degree of contradiction in the input text increases, the probability of “hallucination” gradually increases, which is similar to the performance of human employees under stress, which also indicates that when designing the use of the LLM to accomplish the downstream tasks, the use of certain managerial techniques can further improve the performance of the LLM. further improve its performance. Although this phenomenon can be mitigated by adjusting the prompts, the possibility of nulls cannot be completely avoided in reality, and the problem cannot be fundamentally solved by using prompt engineering.The frequency of “hallucinations” in the WGIE is very low, and we have identified the conditions under which “hallucinations” are more likely to occur. The frequency of “Hallucination” in WGIE is very low, and we have clarified the circumstances under which it is more likely to occur, but it is impossible to avoid it completely, and future research needs to explore more effective measures to avoid “Hallucination”.

References

  1. Blake, V.C.; Woodhouse, M.R.; Lazo, G.R. GrainGenes: Centralized Small Grain Resources and Digital Platform for Geneticists and Breeders. Database 2019, 2019, baz065. [Google Scholar] [PubMed]
  2. Khaki, S.; Safaei, N.; Pham, H.; Wang, L. WheatNet: A lightweight convolutional neural network for high-throughput image-based wheat head detection and counting. Neurocomputing 2022, 489, 78–89. [Google Scholar] [CrossRef]
  3. Yang, Y.; Wu, Z.; Yang, Y.; Lian, S.; Guo, F.; Wang, Z. A Survey of Information Extraction Based on Deep Learning. Appl. Sci. 2022, 12, 9691. [Google Scholar] [CrossRef]
  4. A. Dunn, J. A. Dunn, J. Dagdelen, N. Walker, S. Lee, A. S. Rosen, G. Ceder, K. Persson, A. Jain, Structured information extraction from complex scientific text with fine-tuned large language models. arXiv:2212.05238.
  5. M. P. Polak, S. Modi, A. Latosinska, J. Zhang, C.-W. Wang, S. Wang, A. D. Hazra, D. Morgan, Flexible, model-agnostic method for materials data extraction from text using general purpose language models. Digital Discovery 2024, 3, 1221–1235. [Google Scholar] [CrossRef]
  6. Polak, M.P.; Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat. Commun. 2024, 15, 1–11. [Google Scholar] [CrossRef]
  7. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  8. Y. Zhang, Y. Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al., Siren’s song in the ai ocean: a survey on hallucination in large language model. arXiv:2309.01219.
  9. N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 2024, 12, 157–173. [Google Scholar] [CrossRef]
  10. Y. Luo, Z. Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang, An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv:2308.08747.
  11. J. Achiam, S. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report. arXiv:2303.08774.
  12. Hajikhani, A.; Cole, C. Author response for "A Critical Review of Large Language Models: Sensitivity, Bias, and the Path Toward Specialized AI". 2024. [Google Scholar] [CrossRef]
  13. E. Bolton, D. E. Bolton, D. Hall, M. Yasunaga, T. Lee, C. Manning, P. Liang, Biomedlm: a domain-specific large language model for biomedical text, Stanford CRFM Blog.
  14. Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.-Y. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinform. 2022, 23. [Google Scholar] [CrossRef]
  15. C. Wu, X. Zhang, Y. Zhang, Y. Wang, W. Xie, Pmc-llama: Further finetun-ing llama on medical papers, arXiv preprint arXiv:2304.14454 2023, arXiv:2304.14454 2023, 2, 62, 6.
  16. H. Touvron, T. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models. arXiv:2302.13971.
  17. A. Toma, P. R. A. Toma, P. R. Lawler, J. Ba, R. G. Krishnan, B. B. Rubin, B. Wang, Clinical camel: An open expert-level medical language model with dialoguebased knowledge encoding. arXiv:2305.12031.
  18. Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 2023, 15, e40895. [Google Scholar] [CrossRef]
  19. T. Han, L. C. T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. L¨oser, D. Truhn, K. K. Bressem, Medalpaca – an open-source collection of medical conversational ai models and training data (2023). 0824; arXiv:2304.08247.URL https://arxiv.org/abs/2304. [Google Scholar]
  20. Tian, S.; Jin, Q.; Yeganova, L.; Lai, P.-T.; Zhu, Q.; Chen, X.; Yang, Y.; Chen, Q.; Kim, W.; Comeau, D.C.; et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings Bioinform. 2023, 25. [Google Scholar] [CrossRef] [PubMed]
  21. Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; Payne, P.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
  22. Lin, Z. How to write effective prompts for large language models. Nat. Hum. Behav. 2024, 8, 611–615. [Google Scholar] [CrossRef] [PubMed]
  23. Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
  24. L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines 2020, 30, 681–694. [Google Scholar] [CrossRef]
  25. B. Mann, N. B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, et al., Language models are few-shot learners. arXiv:2005.14165.
  26. H. Touvron, L. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
  27. Min, B.; Ross, H.; Sulem, E.; Ben Veyseh, A.P.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
  28. Yoo, K.M.; Park, D.; Kang, J.; Lee, S.-W.; Park, W. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. Findings of the Association for Computational Linguistics: EMNLP 2021. LOCATION OF CONFERENCE, Dominican RepublicDATE OF CONFERENCE; pp. 2225–2239.
  29. Albrecht, J. A.; Kitanidis, E. C.; Fetterman, A. J. Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety. arXiv 2022, arXiv:2212.06295. [Google Scholar]
  30. Liang, P.P.; Wu, C.; Morency, L.P. Towards understanding and mitigating social biases in language models. In Proceedings of the International Conference on Machine Learning; PMLR: 2021; 6565-6576. [Google Scholar]
  31. Alvi, M.; Zisserman, A.; Nellåker, C. Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings. Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops 2018, 0-0.
  32. Zhang, J.; Verma, V. Discover Discriminatory Bias in High Accuracy Models Embedded in Machine Learning Algorithms. In The International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery; Springer International Publishing: Cham, 2020, pp. 1537–1545. [Google Scholar]
  33. Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). LOCATION OF CONFERENCE, IrelandDATE OF CONFERENCE; pp. 3214–3252.
  34. X. Liu, Y. X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt understands, too, AI Open.
  35. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 2022, 35, 24824–24837. [Google Scholar]
  36. S. Yao, D. S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, Advances in Neural Information Processing Systems 36.
  37. Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17682–17690. [Google Scholar] [CrossRef]
  38. S. Yao, J. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing reasoning and acting in language models. arXiv:2210.03629.
  39. X. Wang, J. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171.
  40. Zhao, B.; Jin, W.; Del Ser, J.; Yang, G. ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing 2023, 557. [Google Scholar] [CrossRef]
  41. Peng, R.; Liu, K.; Yang, P. Embedding-based Retrieval with LLM for Effective Agriculture Information Extracting from Unstructured Data. arXiv preprint arXiv:2308.0 3107, arXiv:2308.03107, 2023, 1-12023, 1–1. [Google Scholar]
  42. Qing, J.; Deng, X.; Lan, Y.; Li, Z. GPT-aided diagnosis on agricultural image based on a new light YOLOPC. Comput. Electron. Agric. 2023, 213. [Google Scholar] [CrossRef]
  43. D. Hendrycks, C. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding. arXiv:2009.03300.
  44. Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. LOCATION OF CONFERENCE, ItalyDATE OF CONFERENCE;
  45. M. Suzgun, N. M. Suzgun, N. Scales, N. Sch¨arli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et all., Challenging bigbench tasks and whether chain-of-thought can solve them. arXiv:2210.09261.
  46. K. Cobbe, V. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems. arXiv:2110.14168.
  47. D. Hendrycks, C. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, J. Steinhardt, Measuring mathematical problem solving with the math dataset. arXiv:2103.03874.
  48. J. Bai, S. J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report. arXiv:2309.16609.
  49. A. Young, B. A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al., Yi: Open foundation models by 01. arXiv:2403.04652.
  50. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models. arXiv:2407.21783.
  51. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al., Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. A: Deepseek-v2; arXiv:2405.04434.
Figure 1. WGIE's overall workflow.
Figure 1. WGIE's overall workflow.
Preprints 140266 g001
Figure 2. Flowchart of Data Extraction Prompt Words
Figure 2. Flowchart of Data Extraction Prompt Words
Preprints 140266 g002
Figure 3. Flowchart of Data Extraction Prompt Words
Figure 3. Flowchart of Data Extraction Prompt Words
Preprints 140266 g003
Table 1. Keywords used.
Table 1. Keywords used.
Domain Keywords
Domain 1 Wheat OR Triticum OR Triticum aestivum OR Triticum vulgare
Domain 1 germplasm OR gene OR germ OR germplasm OR plasm
Domain 1 heat OR salt OR high-temperature OR drought OR waterlogging
yield OR Phenotype
Mutant OR QTL
Table 2. Templates of data extraction and data scoring prompts
Table 2. Templates of data extraction and data scoring prompts
Ingredients Data Extraction Prompt Data Scoring Prompt
General instructions #01 You're an expert in the field of wheat germplasm.
#02 Your task is to identify the answer corresponding to the question from the content based on my question and the content I gave.
#03 Answers should be concise, use more of my information, and have content that matches my information.
Format instructions #01 Not all questions can be answered from the information I have given, if you can't find the answer, just return “Not in Detail”.
#02 Each question is returned as a "xxx" triad.(In the case of different types of data, xxx will be replaced with an example of the desired form of output.)
Task Prompt question 0 ,What is the main wheat on which research is being carried out in the information given? Direct answer wheat germplasm name. step 0 ,Calculate the Euclidean distance between all the answers to determine whether most or all the answers are similar. Identify the category with the largest number of answers as 1, identify other categories as 1, and return the results only containing 0 and 1. Example of format: 00110
Table 3. Examples of extracted traits
Table 3. Examples of extracted traits
Typology Character
Agronomic Traits Coleoptile Color, Seedling Morphology, Plant Height, Plant Type, Panicle Shape, Panicle Length, Panicle Density, Grain Number per Panicle, Thousand Grain Weight, Grain Color, Grain Morphology, Grain Size, Maturity, Growth Period, Tillering Ability, etc.
Resistance Traits Leaf Rust Resistance, Stem Rust Resistance, Root Rot Resistance, Ergot Resistance, Powdery Mildew Resistance, Stripe Rust Resistance, Root Rot and Leaf Blight Resistance, etc.
Yield Traits Yield per Unit Area, Yield per Mu, etc.
Other traits ......
Table 4. Experimental LLM Selection
Table 4. Experimental LLM Selection
Model Benchmark(Metric) Context
MMLU HellaSwag BBH GSM8K MATH
Command-r-plus-104b 75.7 88.6 - 70.7 - 128k
Qwen1.5-110B 80.4 87.5 74.8 85.4 49.6 32k
Yi-34B 76.3 87.19 54.3 67.2 14.4 200k
LLaMA3-70B 79.5 88.0 76.6 79.2 41.0 8k
Deepseek-v2-236B 78.5 84.2 78.9 79.2 43.6 128k
Qwen2-70B-instruct 82.3 87.6 82.4 91.1 59.7 128k
Table 5. LLM used in all phases of this study
Table 5. LLM used in all phases of this study
Stage Large Language Model temperature
RGIE block Command-r-plus-104b 0.7
Qwen1.5-110b
Yi-34b
Llama3-70b
Deepseek-v2-236b
Verify block qwen2-72b-instruct 0.7
Table 6. Experimental Configuration
Table 6. Experimental Configuration
Hardware Configurations
Processors Intel Xeon*2
Chipsets Intel C741
Graphics Card NVIDIA RTX A6000 48G*4
Total Memory 256GB
Storage Type Capacity 2TB+8TB
Frame Ollama
Table 7. Confusion matrix
Table 7. Confusion matrix
Predicted Value
Positive Negative
Real Value Positive TP FN
Negative FP TN
Table 8. Examples of TP, TN, FP, and FN correspondence
Table 8. Examples of TP, TN, FP, and FN correspondence
Value Real Situation Predicted Results
TP Abstracts contain the requested extracts All entities and relationships are extracted correctly
TN Abstracts do not contain the content requested to be extracted “Not in Detail”
FP Abstracts contain the requested extracts The amount of data extracted is greater than the amount of data that actually exists
Abstracts do not contain the content requested to be extracted The answer is not “Not in Detail”.
FN Abstracts contain the requested extracts “Not in Detail”
Table 9. Result of performance evaluation
Table 9. Result of performance evaluation
Evaluation Metrics Value
Precision 0.8010
Recall 0.9969
F1 Score 0.8883
Accuracy 0.8171
Table 10. Good/bad examples of human-LLM interaction dialogues for data extraction
Table 10. Good/bad examples of human-LLM interaction dialogues for data extraction
Answer situation Content of the answer
Good Kazakhstan wheat cultivars/Jimai20, Shannong12/Neimai 8 and II469/……
Bad The germplasm name is not specified in the text. However, based on your initial statement that I am an expert in the field of wheat germplasm, the wheat germplasm being studied can be referred to as the "Expert's Focused Wheat Germplasm."
Table 11. Experimental LLM Selection
Table 11. Experimental LLM Selection
Types of Text Provided Wheat and Parental Name Extraction Problems Trait Extraction Problems Common Error Types
Number of Research Subjects Wheat Germplasm and Parentage Name Trait Information
1 YES YES Correct extraction Correct extraction -
NO YES Extracting the actual research object Unable to distinguish between trait types False Positive OR False Negative
YES NO correct extraction Repeatedly extract all triples False Positive OR False Negative
NO NO “Triticum aestivum L.” Repeatedly extract all triples Hallucination
2+ YES YES Unable to judge the most important research subjects correct extraction False Positive
NO YES Extracting the actual research object Unable to distinguish between trait types False Positive OR False Negative
YES NO Unable to judge the most important research subjects Repeatedly extract all triples False Positive OR False Negative
NO NO “Triticum aestivum L.” Repeatedly extract all triples Hallucination
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated