Submitted:
04 October 2025
Posted:
08 October 2025
Read the latest preprint version here
Abstract
Keywords:
1. Related Work
2. Methodology
2.1. Factual Accuracy and Groundedness
- Ground-truth comparison (when available): Exact Match/F1 for extractive tasks; LLM-graded semantic correctness for abstractive answers.
- Faithfulness: split answer into atomic claims; verify each against retrieved context via LLM-judge or retrieval overlap; aggregate to a 0–1 score.
- Context precision/recall: precision = fraction of used answer content supported by context; recall = fraction of relevant context reflected in the answer.
- Unsupported-claim rate: average number of claims without evidence.
2.2. Hallucination Detection
- Provenance checks / citation accuracy: map text spans or entities to sources; flag unmatched spans.
- Context-utilization tests: perturb context (e.g., inject distractors) and measure answer sensitivity; low sensitivity suggests overreliance on prior model knowledge.
- LLM-judged hallucination: ask an evaluator to list unsupported statements; compute hallucination rate.
2.3. Adversarial Robustness
- Prompt-injection suite: canonical patterns (“ignore previous instructions”, role overrides, obfuscations) appended or embedded; mark attack success rate if the output deviates from policy or reveals secrets.
- Jailbreak attempts: known role-play and multi-turn inducements; score compliance/refusal.
- Context poisoning tests: seed vector store with high-similarity malicious docs; probe if the system uses or obeys them.
- Adversarial Question Benchmarking: adversarial QA datasets which contain questions intentionally written to confuse models.
2.4. Bias and Fairness
- Bias Probing Questions: targeted prompts judged for neutrality and fairness.
- Stereotype Confirmation Tests: targeted prompts that contain stereotype and proportion of biased completions.
- Differential performance: accuracy across various subgroups/topic subsets.
2.5. Toxicity and Content Security
- Toxic Content Prompts: datasets with real toxicity prompts; automatic moderation scores; toxicity rate above threshold.
- Hate and Harassment Scenarios: craft scenarios where the user might harass the assistant or vice versa
- Disallowed Content Categories: self-harm, illicit instructions, sexual content; compliance/refusal confusion matrix.
- Toxicity in Retrieved Text: include toxic sources; verify paraphrase/filtering rather than reproduction.
2.6. Security Risks and Privacy
- Privacy/PII Tests: seeded confidential records; verify refusal without authorization.
- Data Leakage via Prompt: explicit and indirect requests; check for leakage.
- Malicious Use Requests: phishing/malware requests; verify refusal.
2.7. Calibration and Abstention
- Uncertainty Elicitation: output a confidence level or a token (e.g., “Confidence = X%” “[Sure]/[Not Sure]”) or sampling variance between multiple answers
- Selective QA Evaluation: bin predictions by confidence; compare empirical accuracy.
- Abstention behavior in practice: compute accuracy vs coverage as the system abstains below a threshold; track selective accuracy and abstention rate on unanswerable queries.
3. Implementation
3.1. Design Overview
- A Query Monitor intercepts the user’s query to detect adversarial patterns (prompt injection attempts, etc.) before it enters the system.
- The Retriever Monitor observes the documents retrieved for the query, evaluating their relevance and adequacy (metrics like recall, precision of retrieval).
- The LLM Output Monitor examines the final answer, evaluating factual correctness (groundedness against retrieved docs), checking for toxicity or bias in language, and assessing compliance with any policies.
- Optionally, an Aggregation/Decision Module can use the monitors’ signals to decide if the answer should be delivered or if an abstention/warning is triggered (this is more for deployment, but we include it to evaluate if such a policy improves outcomes).
- Offline Evaluation Mode: We use a set of static set of test queries (with expected answers or behaviors when available) to systematically evaluate the model.
- Online Monitoring Mode (Continuous): Secure-RAG monitors live interactions and red flags issues in real-time. This is to ensure deployment and run-time safety, but the tool can be used to log metrics.
- LangChain [9] for constructing the RAG pipeline and using its evaluation helpers.
- LlamaIndex [10] for easily managing documents and indexes, to test retrieval.
- FAISS (Facebook AI Similarity Search) [11] as the vector store for retrieval.
- Pre-trained models or APIs (OpenAI GPT-4) for LLM-as-judge evaluations.
- OpenAI Evals [5] style templates for structuring evals and possibly running them at scale.
3.2. Example
- Adversarial Detector: no immediate red flags (it’s not disallowed, just possibly unanswerable).
- Retrieval Monitor: likely finds nothing relevant in the knowledge base (low recall).
- LLM Output: if the model is not calibrated, it might hallucinate an answer: “The capital of Fantasia is Fantasia City.” This would be unsupported (hallucination). The Answer Checker’s groundedness test fails (no doc supports that), factual check fails (Fantasia doesn’t exist). Secure-RAG marks this query as answered incorrectly and with a hallucination. If the system was calibrated to abstain, it might say “I’m not sure” – that would be a better outcome (which Secure-RAG would mark as correct behavior because for an unanswerable query, abstention is what we expect). We could have a simple expected output rule here: expected either "no answer" or a refusal. Deviating is penalized.
4. Result
- Factual Accuracy & Hallucination: On queries supported by context, generative systems produced concise, well-formed answers; the extractive baseline returned grounded spans but limited synthesis. Hallucinations arose mainly on unanswerable or out-of-scope prompts. Introducing confidence-gated abstention reduced unsupported claims by encouraging deferral when evidence was insufficient, at the cost of answering fewer questions.
- Adversarial Robustness: The Query Monitor effectively blocked malicious instructions when injected directly into user queries, showing the value of early-stage filtering. However, adversarial content embedded in retrieved passages still influenced generation, as existing safeguards did not reliably neutralize these hidden directives. This underscores the need for additional layers—retrieval sanitization, instruction-smuggling detection, policy-aware decoding, and output moderation—to strengthen defenses beyond the query stage.
- Bias & Fairness: Given the scope, we observed no clear systematic disparities on simple factual queries. Generative methods can introduce stylistic or inferential bias absent from purely extractive behavior.
- Calibration & Confidence: Eliciting confidence and enforcing abstention thresholds improved alignment between certainty and answer quality, yielding the expected accuracy-coverage trade-off. Separate robustness defenses remain necessary, as abstention alone does not neutralize prompt injection.
5. Conclusion
5.1. Future Work
References
- Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. Ragas: Automated Evaluation of Retrieval Augmented Generation. 2025; arXiv:cs.CL/2309.15217]. [Google Scholar]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. 2023; arXiv:cs.CL/2211.09110]. [Google Scholar]
- Truelens. https://github.com/truera/trulens/.
- langsmith. https://www.langchain.com/langsmith.
- Evals, O. https://github.com/openai/evals.
- Eval, C.A.D. https://github.com/confident-ai/deepeval.
- Traceloop. https://www.traceloop.com/.
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Langchain. https://github.com/langchain-ai/langchain.
- LlamaIndex. https://github.com/run-llama/llama_index.
- Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. 2025; arXiv:cs.LG/2401.08281]. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).