Submitted:
24 June 2025
Posted:
25 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Methodology
3.1. Data Collection
3.1.1. Financial Domain
3.1.2. Biomedical Domain
3.1.3. Scientific Domain
3.2. Question-Answer Pairs Dataset Creation
3.2.1. Financial Domain
3.2.2. Biomedical Domain
3.2.3. Scientific Literature Domain
3.3. Pre-processing
3.3.1. Text Extraction
3.3.2. Lower-Casing and Stripping
3.3.3. Sentence Cleaning
- clean_sentence(): This function is designed to clean up individual sentences. Converts the sentence to lowercase. Removes special characters using regular expressions. Optionally removes stop-words, leveraging the gensim library’s remove stop-words function.
- get_cleaned_ sentences(): This function applies the clean sentence function to a list of sentences. The optional parameter removes the stop-words flag and controls whether stop-words are removed from the sentences.
3.3.4. Convert Sentences into Tokens
3.3.5. Chunking Strategy
3.4. Model Implementation
3.4.1. Model Selection
- Financial Domain For the financial domain, two variants of BERT models were utilized: BERT Base Model: A base BERT model was employed to capture general financial information and nuances. BERT Large Model: A larger version of BERT was utilized to grasp more complex financial patterns and relationships within the text.
- Biomedical Domain In the biomedical domain, a specialized BERT model pre-trained on biomedical literature, known as BioBERT, was chosen. BioBERT is appropriate for the study of COVID-19 research publications since it is designed to comprehend the distinct terminologies and ideas found in biomedical texts.
- Scientific Literature Domain For the scientific literature domain, we utilized SciBERT, a BERT model pre-trained on a diverse range of scientific texts. SciBERT is designed to capture the intricacies of scientific language, making it suitable for extracting information from research papers and scientific literature.
3.4.2. Model Fine-tuning
3.5. Question Answering Setup
3.6. Evaluation
3.6.1. Metrics
3.6.2. Cross-Domain Evaluation
- For the biomedical domain, both BERT large and Bio-clinical BERT gave partial answers for some question-answer pairs and no answers for a few pairs.
- For the scientific domain, both BERT large and SciBERT models predicted partially correct answers.
3.6.3. Comparative Analysis
4. Design Specification
4.1. Techniques
- Domain-Specific Fine-Tuned BERT: For each of the domains, a domain-specific fine-tuned BERT model was employed.
- Transfer Learning: Transfer learning in BERT (Bidirectional Encoder Representations from Transformers) involves leveraging pre-trained models on large corpora and fine-tuning them for specific downstream tasks. Google’s BERT algorithm has demonstrated impressive results across a range of natural language processing (NLP) applications. The key idea behind transfer learning in BERT is to utilize the pre-trained knowledge encoded in the model’s parameters and adapt it to a particular task or domain with limited labeled data [19].
4.2. Architecture
4.2.1. Multi-Head Attention Mechanism
4.2.2. Domain-Specific Embeddings
4.3. Framework
- PyTorch Transformers Library: The PyTorch Transformers library was used, which facilitates seamless integration with pre-trained BERT models. This library offers a comprehensive set of tools for tokenization, model configuration, and training.
4.4. Algorithm Description
4.4.1. Algorithm Functionality
4.4.2. Algorithm Requirements
4.5. Tools and Languages
5. Implementation
5.1. Domain-Specific Implementation
5.1.1. Financial Domain
- With pipeline library: When models were implemented with pipeline library, confidence scores for both BERT base and BERT large models were extremely low, even though the answers were correct.
- With tokenization and segmentation: Models were implemented with different approaches where pre-processed input question and answer text were tokenized using the pre-trained tokenizer. The tokenized input is then segmented into question-and-answer segments. A pre-trained model was trained with the tokenized and segmented input to estimate the beginning and ending positions of the response within the input text. Post-processing is used to handle any spaces at the start of the answer tokens after model inference. The final answer is reconstructed by concatenating these tokens.
- With chunking strategy: Chunking strategy refers to the process of breaking down a large document, such as a PDF, into smaller chunks or segments to be processed by a model. Due to the limitation of 512 tokens Bert models were not efficient for long documents as they will only consider the first 512 tokens. To overcome that limitation input text was stripped into chunks of 512 tokens and then fed to the model in a loop. While chunking can be effective in handling lengthy documents, it comes with certain limitations: Context Discontinuity: Breaking a document into chunks may result in the loss of contextual information that spans across different chunks. BERT models use context to interpret words, so if a question’s pertinent context is divided into two chunks, the model’s performance might be impacted. Answer Span Across Chunks: Sometimes a question’s answer can be found in more than one section. If the model processes each chunk independently, it might miss the context necessary to identify the correct answer span that extends beyond a single chunk. Incoherent Context: The chunks processed in isolation might not provide coherent context, leading to potential misunderstandings by the model. Since BERT is meant to record contextual relationships between words, breaking up the text into smaller sections might cause this continuity to be broken. Increased Complexity: Chunking introduces additional complexity into the pre-processing and post-processing stages. Managing the boundaries of chunks and ensuring a seamless flow of information between them requires careful handling.
- With a Curated Data set of question-answer pairs: A data set of ten examples was created manually from PDFs containing question, context, and ground truth columns. This data set in CSV format was then read as a data frame and fed to the model to calculate the F1 score. This strategy overcomes the following limitations of the chunking method: Context Preservation: The curated data set contains question-answer pairs carefully crafted to ensure that the context necessary for answering the questions is preserved. In contrast, chunking large documents may introduce discontinuities in context, potentially affecting the model’s performance. Reduced Complexity: Utilizing a curated data set might simplify the training process compared to managing the complexities introduced by chunking. Dealing with context boundaries, overlaps, and potential information loss associated with chunking can be challenging. A curated data set of question-answer pairs has additional advantages as follows: Training Data Quality: If a curated data set is well-constructed and diverse, it provides a clean and controlled environment for training the model. The model learns from specific examples that are explicitly designed for the task, which can be beneficial in terms of generalization to similar scenarios. Task Relevance: If a task is well-represented in the curated data set, and the questions and answers cover a diverse range of scenarios, a model may perform better compared to a model trained on chunks of documents. This is particularly true if the curated data set is domain-specific or tailored to the types of documents. Reduced Complexity: Utilizing a curated data set might simplify the training process compared to managing the complexities introduced by chunking. Dealing with context boundaries, overlaps, and potential information loss associated with chunking can be challenging. Evaluation: Curated data sets often come with predefined evaluation metrics and benchmarks, like ground truth, making it easier to assess the model’s performance and compare it against other models in the field. Efficiency: Training on a curated data set may be computationally more efficient than training on large, chunked documents, especially if the documents are extensive.
- Fine-tuning of DistilBERT on SQAuD data set: To fine-tune the pre-trained BERT model, the Trainer class from the PyTorch library was utilized. A small subset of the SQAuD data set was loaded from the data sets library and was split into train test data sets using the train test split method. Then DistilBERT, a distilled version of BERT was loaded to process question and answer. The data set was pre-processed to truncate the context and map the answer tokens to the context. Map function from the data set library was used to apply pre-processing to the entire data set. A batch of examples were created using Data Collator. The next step was to define hyper-parameters in training arguments such as learning rate, number of epochs, and weight decay. After that, the trainer was given training arguments that included the model, data set, tokenizer, and data collator. The train function was called to fine-tune the model. This fine-tuned model was saved and used for inference for the financial data set.
5.1.2. Biomedical Domain
5.1.3. Scientific Literature Domain
6. Evaluation
6.1. Financial Domain
6.1.1. Case Study 1
6.1.2. Case Study 2
6.1.3. Case Study 3
6.1.4. Case Study 4
| Question | Context | Ground Truth | Predicted answer | F1 score |
|---|---|---|---|---|
| What were Amazon’s net sales in the first quarter of 2023? | PDF Text | Net sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in the first quarter 2022. | $127.4 billion | 0.039 |
| How much did net sales increase compared to the first quarter of 2022? | PDF Text | Excluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the quarter, net sales increased 11% compared with the first quarter of 2022. | 9% | 0.0 |
| How did North America segment sales change year-over-year? | PDF Text | North America segment sales increased 11% year-over-year to $76.9 billion. | Foreign exchange rates | 0.199 |
| What was the operating income for the AWS segment? | PDF Text | AWS segment operating income was $5.1 billion, compared with an operating income of $6.5 billion in the first quarter of 2022. | $5.1 billion | 0.0 |
| How did the operating cash flow change for the trailing twelve months? | PDF Text | Operating cash flow increased 38% to $54.3 billion for the trailing twelve months, compared with $39.3 billion for the trailing twelve months ended March 31, 2022. | Net sales increased 9% | 0.074 |
6.1.5. Case Study 5
6.2. Bio-Medical Domain
6.2.1. Case Study 6
6.2.2. Case Study 7
6.3. Scientific Domain
6.3.1. Case Study 8
6.3.2. Case Study 9
7. Discussion
- Chunking text into 512 tokens is only useful for small PDFs. Amazon’s annual reports used for analysis are 16 pages long and the prototype developed here lacks the implementation for longer PDFs. Additionally, this could result in context loss and incoherence in answer generation while processing multiple chunks.
- Low F1 scores for the curated data set for respective domains as shown in Table 9 suggest that the proposed research needs optimization and should consider fine-tuning the curated data set.
- A data set was created for each domain using only a few pages of the PDFs.Creating data sets manually for fine-tuning and evaluation is challenging.
- When cross-domain evaluation was conducted in case studies 6,7,8 and 9 did not show much difference. As described in the literature review, previous studies show domain-specific BERT models have achieved significant results for respective domains.
8. Conclusions and Future Work
References
- Roy, A., Bhaduri, J., Kumar, T. & Raj, K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecological Informatics. 75 pp. 101919 (2023). [CrossRef]
- Khan, W., Raj, K., Kumar, T., Roy, A. & Luo, B. Introducing urdu digits dataset with demonstration of an efficient and robust noisy decoder-based pseudo example generator. Symmetry. 14, 1976 (2022). [CrossRef]
- Chandio, A., Gui, G., Kumar, T., Ullah, I., Ranjbarzadeh, R., Roy, A., Hussain, A. & Shen, Y. Precise single-stage detector. ArXiv Preprint ArXiv:2210.04252. (2022). [CrossRef]
- Singh, A., Raj, K., Kumar, T., Verma, S. & Roy, A. Deep learning-based cost-effective and responsive robot for autism treatment. Drones. 7, 81 (2023). [CrossRef]
- Chandio, A., Shen, Y., Bendechache, M., Inayat, I. & Kumar, T. AUDD: audio Urdu digits dataset for automatic audio Urdu digit recognition. Applied Sciences. 11, 8842 (2021). [CrossRef]
- Kumar, T., Park, J., Ali, M., Uddin, A., Ko, J. & Bae, S. Binary-classifiers-enabled filters for semi-supervised learning. IEEE Access. 9 pp. 167663-167673 (2021). [CrossRef]
- Singh, A., Ranjbarzadeh, R., Raj, K., Kumar, T. & Roy, A. Understanding EEG signals for subject-wise definition of armoni activities. ArXiv Preprint ArXiv:2301.00948. (2023). [CrossRef]
- Turab, M., Kumar, T., Bendechache, M. & Saber, T. Investigating multi-feature selection and ensembling for audio classification. ArXiv Preprint ArXiv:2206.07511. (2022). [CrossRef]
- Kumar, T., Park, J. & Bae, S. Intra-Class Random Erasing (ICRE) augmentation for audio classification. Korean Society Of Broadcasting And Media Engineering Conference Proceedings. pp. 246-249 (2020).
- Kumar, T., Park, J., Ali, M., Uddin, A. & Bae, S. Class specific autoencoders enhance sample diversity. Journal Of Broadcast Engineering. 26, 844-854 (2021). [CrossRef]
- Park, J., Kumar, T. & Bae, S. Search for optimal data augmentation policy for environmental sound classification with deep neural networks. Journal Of Broadcast Engineering. 25, 854-860 (2020). [CrossRef]
- Aleem, S., Kumar, T., Little, S., Bendechache, M., Brennan, R. & McGuinness, K. Random data augmentation based enhancement: a generalized enhancement approach for medical datasets. ArXiv Preprint ArXiv:2210.00824. (2022). [CrossRef]
- Ranjbarzadeh, R., Jafarzadeh Ghoushchi, S., Tataei Sarshar, N., Tirkolaee, E., Ali, S., Kumar, T. & Bendechache, M. ME-CCNN: Multi-encoded images and a cascade convolutional neural network for breast tumor segmentation and recognition. Artificial Intelligence Review. pp. 1-38 (2023). [CrossRef]
- Kumar, T., Turab, M., Raj, K., Mileo, A., Brennan, R. & Bendechache, M. Advanced Data Augmentation Approaches: A Comprehensive Survey and Future directions. ArXiv Preprint ArXiv:2301.02830. (2023). [CrossRef]
- Roy, A., Bhaduri, J., Kumar, T. & Raj, K. A computer vision-based object localization model for endangered wildlife detection. Ecological Economics, Forthcoming. (2022).
- Kumar, T., Turab, M., Talpur, S., Brennan, R. & Bendechache, M. FORGED CHARACTER DETECTION DATASETS: PASSPORTS. DRIVING LICENCES AND VISA STICKERS.
- Kumar, T., Mileo, A., Brennan, R. & Bendechache, M. RSMDA: Random Slices Mixing Data Augmentation. Applied Sciences. 13, 1711 (2023). [CrossRef]
- Kumar, T., Brennan, R. & Bendechache, M. Stride Random Erasing Augmentation. CS & IT Conference Proceedings. 12 (2022).
- Kumar, T., Turab, M., Mileo, A., Bendechache, M. & Saber, T. AudRandAug: Random Image Augmentations for Audio Classification. ArXiv Preprint ArXiv:2309.04762. (2023). [CrossRef]
- Adhikari, A., Ram, A., Tang, R. and Lin, J. (2019). Docbert: Bert for document classification, arXiv preprint arXiv:1904.08398. [CrossRef]
- K. Pearce, T. Zhan, A. Komanduri, and J. Zhan, “A Comparative Study of Transformer-Based Language Models on Extractive Question Answering,” Oct. 2021, [Online]. Available: http://arxiv.org/abs/2110.03142. [CrossRef]
- W. Zaghouani, I. Vladimir, and M. Ruiz, “COVID-Twitter-BERT: A natural language processing model to analyze COVID-19 content on Twitter.” [Online]. Available: https://github.com/digitalepidemiologylab/covid-twitter-bert.
- E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” Apr. 2019, [Online]. Available: http://arxiv.org/abs/1904.03323. [CrossRef]
- V. Zayats, K. Toutanova, and M. Ostendorf, “Representations for Question Answering from Documents with Tables and Text,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2101.10573. [CrossRef]
- Y.-C. Chen, Z. Gan, Y. Cheng, J. Liu, and J. Liu, “Distilling Knowledge Learned in BERT for Text Generation,” Association for Computational Linguistics. [CrossRef]
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.04805. [CrossRef]
- S. Wadhwa, K. R. Chandu, and E. Nyberg, “Comparative Analysis of Neural QA models on SQuAD,” Jun. 2018, [Online]. Available: http://arxiv.org/abs/1806.06972. [CrossRef]
- Y. Kim, S. Bang, J. Sohn, and H. Kim, “Question answering method for infrastructure damage information retrieval from textual data using bidirectional encoder representations from transformers,” Autom Constr, vol. 134, Feb. 2022. [CrossRef]
- A. Adhikari, A. Ram, R. Tang, and J. Lin, “DocBERT: BERT for Document Classification,” Apr. 2019, [Online]. Available: http://arxiv.org/abs/1904.08398. [CrossRef]
- Y. Liu, “Fine-tune BERT for Extractive Summarization,” Mar. 2019, [Online]. Available: http://arxiv.org/abs/1903.10318. [CrossRef]
- W. Yang, Y. Xie, L. Tan, K. Xiong, M. Li, and J. Lin, “Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering,” Apr. 2019, [Online]. Available: http://arxiv.org/abs/1904.06652. [CrossRef]
- A. H. Mohammed and A. H. Ali, “Survey of BERT (Bidirectional Encoder Representation Transformer) types,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Jul. 2021. [CrossRef]
- I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” Mar. 2019, [Online]. Available: http://arxiv.org/abs/1903.10676. [CrossRef]
- A. H. Mohammed and A. H. Ali, “Survey of BERT (Bidirectional Encoder Representation Transformer) types,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Jul. 2021. [CrossRef]
- A. Celikten, A. Ugur, and H. Bulut, “Keyword extraction from biomedical documents using deep contextualized embeddings,” in 2021 International Conference on Innovations in Intelligent Systems and Applications, INISTA 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., Aug. 2021. [CrossRef]
- V. Kommaraju et al., “Unsupervised Pre-training for Biomedical Question Answering,” Sep. 2020, [Online]. Available: http://arxiv.org/abs/2009.12952. [CrossRef]
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,” Sep. 2019, [Online]. Available: http://arxiv.org/abs/1909.11942. [CrossRef]
- M. Namazifar, A. Papangelis, G. Tur, and D. Hakkani-T¨ur, “Language model is all you need: Natural language understanding as Question answering,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2021, pp. 7803–7807. [CrossRef]
- C. Tao, S. Gao, M. Shang, W. Wu, D. Zhao, and R. Yan, “Get the point of my utterance! Learning towards effective responses with multi-head attention mechanism,” in IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, 2018, pp. 4418–4424. [CrossRef]
- Raj, K. & Mileo, A. Towards Understanding Graph Neural Networks: Functional-Semantic Activation Mapping. International Conference On Neural-Symbolic Learning And Reasoning. pp. 98-106 (2024).
- Singh, A., Raj, K., Meghwar, T. & Roy, A. Efficient paddy grain quality assessment approach utilizing affordable sensors. AI. 5, 686-703 (2024). [CrossRef]
- Vavekanand, R., Das, B. & Kumar, T. DAugSindhi: a data augmentation approach for enhancing Sindhi language text classification. Discover Data. 3, 1-12 (2025). [CrossRef]
- Vavekanand, R. & Kumar, T. Data augmentation of ultrasound imaging for non-invasive white blood cell in vitro peritoneal dialysis. Biomedical Engineering Communications. 3, 10-53388 (2024).
- Kumar, T., Mileo, A. & Bendechache, M. Keeporiginalaugment: Single image-based better information-preserving data augmentation approach. IFIP International Conference On Artificial Intelligence Applications And Innovations. pp. 27-40 (2024).
- Barua, M., Kumar, T., Raj, K. & Roy, A. Comparative analysis of deep learning models for stock price prediction in the Indian market. FinTech. 3, 551-568 (2024). [CrossRef]
- Vavekanand, R., Sam, K., Kumar, S. & Kumar, T. Cardiacnet: A neural networks based heartbeat classifications using ecg signals. Studies In Medical And Health Sciences. 1, 1-17 (2024). [CrossRef]
- Kumar, T., Brennan, R., Mileo, A. & Bendechache, M. Image data augmentation approaches: A comprehensive survey and future directions. IEEE Access. (2024). [CrossRef]



| Model | Score |
|---|---|
| BERT base | 3.44E-05 |
| BERT large | 0.441348344 |
| Question | Context | Predicted answer |
|---|---|---|
| How much was the net sales in the year 2022? | Net sales increased 13% to $143.1 billion in the third quarter, compared with $127.1 billion in the third quarter of 2022. | "$ 127.1 billion" |
| Question | Context | Predicted answer |
|---|---|---|
| How AWS help Amazon to grow in the year 2022? | PDF text from Amazon’s quarterly report | Segment sales increased 12% year-over-year to $23.1 billion. Operating income increased to $11.2 billion in the third quarter, compared with $2.5 billion in the third quarter of 2022. North America segment operating income was $4.3 billion, compared with an operating loss of $0.4 billion in the third quarter of 2022. |
| Question | Context | Predicted answer | Score |
|---|---|---|---|
| What are different search engines? | BLOOM has 176 billion parameters and can generate text in 46 natural languages and 13 programming languages. | 176 billion | 0.250225812 |
| Question | Context | Predicted answer | Score |
|---|---|---|---|
| What were Amazon’s net sales in the first quarter of 2023? | PDF text | $127.4 billion | 0.07002584 |
| How much did net sales increase compared to the first quarter of 2022? | PDF text | $127.4 billion | 0.070099174 |
| What was the impact of foreign exchange rates on net sales? | PDF text | $116.4 billion | 0.070099174 |
| How did North America segment sales change year-over-year? | PDF text | $116.4 billion | 0.080088 |
| How did the operating cash flow change for the trailing twelve months? | PDF text | $127.4 billion | 0.03641737 |
| Question Answer pairs | Bio-ClinicalBERT F1-Score | Bio-ClinicalBERT Average F1 | BERT Large F1-Score | BERT Large Average F1 score |
|---|---|---|---|---|
| 1 | 0.035 | 0.033 | 0.028 | 0.043 |
| 2 | 0.037 | 0.083 | ||
| 3 | 0.030 | 0.038 | ||
| 4 | 0.034 | 0.040 | ||
| 5 | 0.032 | 0.122 | ||
| 6 | 0.026 | 0.022 | ||
| 7 | 0.031 | 0.022 | ||
| 8 | 0.038 | 0.025 | ||
| 9 | 0.023 | 0.024 | ||
| 10 | 0.040 | 0.025 |
| Q & A Pairs | SciBERT F1-Score | SciBERT Avg F1 | BERT Large F1-Score | BERT Large Avg F1 |
|---|---|---|---|---|
| 1 | 0.050 | 0.053 | 0.026 | 0.053 |
| 2 | 0.029 | 0.110 | ||
| 3 | 0.091 | 0.058 | ||
| 4 | 0.070 | 0.045 | ||
| 5 | 0.044 | 0.054 | ||
| 6 | 0.062 | 0.029 | ||
| 7 | 0.038 | 0.031 | ||
| 8 | 0.036 | 0.071 | ||
| 9 | 0.054 | 0.034 | ||
| 10 | 0.060 | 0.068 |
| Domain | BERT model used | Average F1 score |
|---|---|---|
| Financial | BERT large uncased | 0.03913 |
| Biomedical | Bio-ClinicalBERT | 0.033 |
| Scientific | SciBERT | 0.053 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).