Submitted:
13 March 2024
Posted:
15 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Web Application Development and Implementation
2.1. LLMs Integration
2.2. PaSSER App Functionalities
- Cleaning and standardizing text data. This is achieved by removing unnecessary characters (punctuation and special characters). Converting the text to a uniform size (usually lower case). Separating the text into individual words or tokens. In the implementation considered here, the text is divided into chunks with different overlap.
- Vector embedding. The goal is to convert tokens (text tokens) into numeric vectors. This is achieved by using pre-trained word embedding models from selected LLMs (in this case Misrtal:7b, Llama2:7b, and Orca2:7b). These models map words or phrases to high-dimensional vectors. Each word or phrase in the text is transformed into a vector that represents its semantic meaning based on the context in which it appears.
- Aggregating embeddings for larger text units to represent whole sentences or documents as vectors. It can be achieved by simple aggregation methods (averaging the vectors of all words in a sentence or document) or by using sentence transformers or document embedding techniques that take into account the more consistent and contextual nature of words. Here transformers are used which are taken from the selected LLMs.
- Create a vectorstore to store the vector representations in a structured format. The data structures used are optimized for operations with high-dimensional vectors. ChromaDB is used for the vectorstore.
- Selection of a specific knowledge base in a specific domain.
- 2.
- To create a reference dataset for a specific domain, a collection of answers related to the selected domain is collected. Each response should contain key information related to potential queries in that area. These answers should then be saved in a text file format.
- 3.
- A selected LLM is deployed to systematically generate a series of questions corresponding to each predefined reference answer. This operation facilitates the creation of a structured dataset comprising pairs of questions and their corresponding answers. Subsequently, this dataset is saved into a JSON file format.
- 4.
- The finalized dataset is uploaded to the PaSSER App, initiating an automated sequence of response generation for each query within the target domain. Subsequently, each generated response is forwarded to a dedicated Python backend script. This script is tasked with assessing the responses based on predefined metrics, comparing them to the established reference answers. The outcomes of this evaluation are then stored on the blockchain, ensuring a transparent and immutable ledger of the model's performance metrics.




- 5.
- Retrieving the results from the blockchain for further processing and analysis.
3. Evaluation Metrics
3.1. METEOR (Metric for Evaluation of Translation with Explicit ORdering)
- –
- Word alignment between candidate and reference translations based on exact, stem, synonym, and paraphrase matches, with the constraint that each word in the candidate and reference sentences can only be used once and aims to maximize the overall match between the candidate and references.
- –
-
Calculation of Number of matched words in the candidate / Number of words in the candidate and Number of matched words in the candidate / Number of words in the reference:Where:Number of unigrams in the candidate translation that are matched with the reference translation.Total number of unigrams in the candidate translation.Total number of unigrams in the reference translation(s).
- –
- Calculation of Penalty for chunkiness which accounts for the arrangement and fluency of the matched chunks (c) = Number of chunks of contiguous matched unigrams in the candidate translation and (m):
- –
-
The final score is computed using the harmonic mean of Precision and Recall, adjusted by the penalty factor:Where .
3.2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- –
- is the ratio of the number of overlapping n-grams between the system summary and the reference summaries to the total number of n-grams in the reference summaries:
- –
- is the ratio of the number of overlapping n-grams in the system summary to the total number of n-grams in the system summary itself:
- –
- is the harmonic mean of precision and recall:
- –
- is a length of the LCS divided by the total number of words in the reference summary. This measures the extent to which the generated summary captures the content of the reference summaries:
- –
- is a length of the LCS divided by the total number of words in the generated summary. This assesses the extent to which the words in the generated summary appear in the reference summaries:
- –
- is a harmonic mean of the LCS-based precision and recall:
3.3. BLEU (Bilingual Evaluation Understudy)
3.4. Perplexity (PPL)
- – PPL with Laplace Smoothing adjusts the probability estimation for each word by adding one to the count of each word in the training corpus, including unseen words. This method ensures that no word has a zero probability. Adjusted Probability Estimate with Laplace Smoothing:
- –
- PPL with Lidstone smoothing is a generalization of Laplace smoothing where instead of adding one to each count, a fraction (where ) is added. This allows for more flexibility compared to the fixed increment in Laplace smoothing. Adjusted Probability Estimate with Lidstone Smoothing:
3.5. Cosine Similarity
3.6. Pearson Correlation
3.7. F1 Score
4. Testing
- – Intel Xeon, 32 Cores, 128 GB RAM, Ubuntu 22.04 w/o GPU.
- – Mac M1, 8 CPU, 10 GPU, 16GB RAM, OSX 13.4.
4.1. Q&A Time LLM Test Results
4.2. RAG Q&A Score Test Results
5. Discussion
6. Conclusion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hambarde, K.A.; Proença, H. Information Retrieval: Recent Advances and Beyond. IEEE Access 2023, 11, 76581–76604. [Google Scholar] [CrossRef]
- Sidorov, G. Vector Space Model for Texts and the Tf-Idf Measure. In Syntactic n-grams in Computational Linguistics; Sidorov, G., Ed.; Springer International Publishing: Cham, 2019; pp. 11–15. ISBN 978-3-030-14771-6. [Google Scholar]
- Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, a Survey. Multimedia Tools and Applications 2019, 78, 15169–15211. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks 2021.
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey 2024.
- GitHub - Scpdxtest/PaSSER. Available online: https://github.com/scpdxtest/PaSSER (accessed on 8 March 2024).
- Popchev, I.; Doukovska, L.; Radeva, I. A Framework of Blockchain/IPFS-Based Platform for Smart Crop Production. In Proceedings of the International Conference Automatics and Informatics; IEEE: Varna, Bulgaria, 6 October 2022; pp. 265–270. [Google Scholar]
- Popchev, I.; Doukovska, L.; Radeva, I. A Prototype of Blockchain/Distributed File System Platform; IEEE: Warsaw, Poland, 12 October 2022; pp. 1–7. [Google Scholar]
- Ilieva, G.; Yankova, T.; Radeva, I.; Popchev, I. Blockchain Software Selection as a Fuzzy Multi-Criteria Problem. Computers 2021, 10. [Google Scholar] [CrossRef]
- Radeva, I.; I. Popchev Blockchain-Enabled Supply-Chain in Crop Production Framework. Cybernetics and Information Technologies 2022, 22, 151–170. [Google Scholar] [CrossRef]
- Popchev, I.; Radeva, I.; Doukovska, L. Oracles Integration in Blockchain-Based Platform for Smart Crop Production Data Exchange. Electronics 2023, 12, 2244. [Google Scholar] [CrossRef]
- GitHub - Chroma-Core/Chroma: The AI-Native Open-Source Embedding Database. Available online: https://github.com/chroma-core/chroma (accessed on 26 February 2024).
- NLTK :: Natural Language Toolkit. Available online: https://www.nltk.org/ (accessed on 26 February 2024).
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32 2019, 8024–8035. [Google Scholar]
- NumPy Documentation — NumPy v1.26 Manual. Available online: https://numpy.org/doc/stable/ (accessed on 26 February 2024).
- pltrdy Rouge: Full Python ROUGE Score Implementation (Not a Wrapper).
- contributors (https://github.com/huggingface/transformers/graphs/contributors), T.H.F. team (past and future) with the help of all our Transformers: State-of-the-Art Machine Learning for JAX, PyTorch and TensorFlow.
- SciPy Documentation — SciPy v1.12.0 Manual. Available online: https://docs.scipy.org/doc/scipy/ (accessed on 26 February 2024).
- Pyntelope. Available online: https://pypi.org/project/pyntelope/ (accessed on 27 February 2024).
- Rastogi, R. Papers Explained: Mistral 7B. DAIR.AI 2023.
- Mistral 7B. Available online: https://ar5iv.labs.arxiv.org/html/2310.06825 (accessed on 6 March 2024).
- Workers AI Update: Hello, Mistral 7B! Available online:. Available online: https://blog.cloudflare.com/workers-ai-update-hello-mistral-7b (accessed on 6 March 2024).
- Meta-Llama/Llama-2-7b · Hugging Face. Available online: https://huggingface.co/meta-llama/Llama-2-7b (accessed on 6 March 2024).
- Mitra, A.; Corro, L.D.; Mahajan, S.; Codas, A.; Ribeiro, C.S.; Agrawal, S.; Chen, X.; Razdaibiedina, A.; Jones, E.; Aggarwal, K.; et al. Orca-2: Teaching Small Language Models How to Reason. 2023. [CrossRef]
- Popchev, I.; Radeva, I.; Dimitrova, M. Towards Blockchain Wallets Classification and Implementation. In Proceedings of the 2023 International Conference Automatics and Informatics (ICAI); pp. 346–351. [CrossRef]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation 2023.
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Goldstein, J., Lavie, A., Lin, C.-Y., Voss, C., Eds.; Association for Computational Linguistics: Ann Arbor, Michigan; pp. 65–72.
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain; pp. 74–81.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Isabelle, P., Charniak, E., Lin, D., Eds.; Association for Computational Linguistics: Philadelphia, Pennsylvania, USA; pp. 311–318.
- Arora, K.; Rangarajan, A. Contrastive Entropy: A New Evaluation Metric for Unnormalized Language Models. Available online: https://arxiv.org/abs/1601.00248v2 (accessed on 8 February 2024).
- Dan Jurafsky; James H. Martin Speech and Language Processing. Available online: https://web.stanford.edu/~jurafsky/slp3/ (accessed on 8 February 2024).
- Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification. In Proceedings of the Intelligent Data Engineering and Automated Learning – IDEAL 2013; Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X., Eds.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2013; pp. 611–618. [Google Scholar]
- Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation; 2006; Vol. Vol. 4304, p. 1021; ISBN 978-3-540-49787-5.









| Llama2:7b | Mistral:7b | Orca2:7b | ||||
| Metric | macOS / M1 | Ubuntu / Xeon | macOS / M1 | Ubuntu / Xeon | macOS / M1 | Ubuntu / Xeon |
| Evaluation Time (sec.) | Faster (51,613) | Slower (115,176) | Faster (35,864) | Slower (45,325) | Fastest (24,759) | Slowest (74,431) |
| Evaluation Count (units) | Slightly Higher (720) | Comparable (717) | Higher (496) | Lower (284) | Lower (350) | Higher (471) |
| Load Duration Time (sec.) | Faster (0.025) | Slower (0.043) | Fastest (0.016) | Slower (0.039) | Similar (0.037) | Similar (0.045) |
| Prompt Evaluation Count | Lower (51) | Higher (68) | Lower (47) | Higher (54) | Lower (53) | Highest (96) |
| Prompt Evaluation Duration (sec.) | Shorter (0.571) | Longer (5.190) | Shorter (0.557) | Longer (4.488) | Shorter (0.588) | Longest (6.955) |
| Total Duration (sec.) | Shorter (52,211) | Longer (120,413) | Shorter (36,440) | Longer (49,856) | Shortest (25,387) | Longer (81,434) |
| Tokens/Second | Higher (14.07) | Lower (6.3) | Higher (13.91) | Lower (6.36) | Highest (14.38) | Lower (6.53) |
| Metric | Llama2:7b | Mistral:7b | Orca2:7b | Best Model | Metric in Text Generation and Summarization tasks |
| METEOR | 0.248 | 0.271 | 0.236 | Mistral:7b | Assesses fluency and adequacy of generated text response, considering synonymy and paraphrase. |
| ROUGE-1 recall | 0.026 | 0.032 | 0.021 | Mistral:7b | Measures the extent to which a generated summary captures key points from a source text, indicating coverage. |
| ROUGE-1 precision | 0.146 | 0.161 | 0.122 | Mistral:7b | Evaluates the fraction of content in the generated summary that is relevant to the source text, implying conciseness. |
| ROUGE-1 f-score | 0.499 | 0.472 | 0.503 | Orca2:7b | Provides a balance between recall and precision for assessing the overall quality of a generated summary. |
| ROUGE-l recall | 0.065 | 0.07 | 0.055 | Mistral:7b | Reflects the degree to which a generated lowercase summary encompasses the content of a reference lowercase summary. |
| ROUGE-l precision | 0.131 | 0.143 | 0.108 | Mistral:7b | Measures the accuracy of a generated lowercase summary in replicating the significant elements of the source text. |
| ROUGE-l f-score | 0.455 | 0.424 | 0.457 | Orca2:7b | Integrates precision and recall to evaluate the quality of a generated lowercase summary holistically. |
| BLUE | 0.186 | 0.199 | 0.163 | Mistral:7b | Quantifies the similarity of the generated text to reference texts by comparing n-grams, useful for machine translation and summarization. |
| Laplace Perplexity | 52.992 | 53.06 | 53.083 | Llama2:7b | Estimates the likelihood of a sequence in generated text, indicating how well the text generation model predicts sample sequences. |
| Lidstone Perplexity | 46.935 | 46.778 | 56.94 | Mistral:7b | Assesses the smoothness and predictability of a text generation model by evaluating the likelihood of sequence occurrence with small probability adjustments. |
| Cosine similarity | 0.728 | 0.773 | 0.716 | Mistral:7b | Determines the semantic similarity between the vector representations of generated text and reference texts. |
| Pearson correlation | 0.843 | 0.861 | 0.845 | Mistral:7b | Quantifies the linear correspondence between generated text scores and human-evaluated scores, indicating model predictability and reliability. |
| F1 score | 0.178 | 0.219 | 0.153 | Mistral:7b | Combines the precision and recall of the generated text in summarization tasks, providing a singular measure of its informational quality. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).