Near-Saturation Accuracy and Safety-Driven Refusals of Frontier Generative Artificial Intelligence Models on the Japanese Pharmaceutical Benchmark as of June 2026

Hiroyasu Sato; Katsuhiko Ogasawara; Hidehiko Sakurai

doi:10.20944/preprints202606.1845.v1

Submitted:

23 June 2026

Posted:

25 June 2026

You are already at the latest version

Abstract

This brief report evaluated June 2026 frontier generative artificial intelligence models on the Japanese National License Examination for Pharmacists. Four models (ChatGPT GPT-5.5, Gemini 3.5 Flash, Claude Opus 4.8, and Claude Fable 5) were tested using all 345 questions from the 107th examination, including image-based items in the original Japanese format. Overall accuracy counted refusals as incorrect, and conditional accuracy among answered questions was calculated for Fable 5. GPT-5.5, 3.5 Flash, and Opus 4.8 achieved near-saturation accuracies of 99.1%, 98.3%, and 96.8%, respectively. Fable 5 achieved 69.9% overall accuracy but 98.8% conditional accuracy among answered questions, refusing 101 questions (29.3%), especially in Biology (90.0%) and Pharmacology (72.5%). Frontier models showed near-saturation performance, but refusal-aware evaluation is necessary for pharmacy benchmarks.

Keywords:

large language model

;

educational measurement

;

lisence exination

;

pharmacists

;

refusal

Subject:

Medicine and Pharmacology - Pharmacy

Background/Rationale

Generative artificial intelligence (AI) models have been increasingly evaluated using health professions licensure examinations [1,2,3]. In pharmacy, such evaluations are important because pharmacists require knowledge of basic science, pharmacology, toxicology, pharmaceutical formulation, clinical pharmacotherapy, and practical medication management. Previous studies showed that ChatGPT GPT-4 exceeded the passing threshold on the Japanese National License Examination for Pharmacists (JNLEP) [1], and a subsequent comparative study of 18 online chat-based large language models released in 2024 showed substantial improvement, with the highest-performing model achieving 87.2% accuracy [2]. The examination questions and official answers are publicly available from the Ministry of Health, Labour and Welfare, Japan [4]. Since then, frontier AI models have continued to improve rapidly [5,6]. However, evaluating high-capability models in pharmacy examinations requires attention not only to correctness but also to refusal behavior because biology-, pharmacology-, toxicology-, and chemistry-related items may overlap with domains monitored by safety mechanisms for high-risk biological or chemical content [7,8].

Objectives

This brief report aimed to provide a June 2026 snapshot of frontier AI performance on the JNLEP. We evaluated four frontier AI models using all 345 publicly available questions from the 107th JNLEP and distinguished conventional overall accuracy from conditional accuracy among answered questions.

Ethics Statement

This manuscript describes only computational analyses and did not involve human participants, human-originated materials, or animal models. Therefore, ethics approval was not required.

Study Design

This was a descriptive computational analysis of AI model outputs. The study used all 345 questions from the 107th JNLEP, administered in February 2022 [4]. No formal reporting guideline was strictly applicable to this model-evaluation study.

Setting

The evaluation was conducted in June 2026 using four frontier AI models available at that time: ChatGPT GPT-5.5 (GPT-5.5), Gemini 3.5 Flash (3.5 Flash), Claude Opus 4.8 (Opus 4.8), and Claude Fable 5 (Fable 5) [5,6,7,8].

Variables

The primary outcome was overall accuracy, defined as the number of correctly answered questions divided by all 345 questions. For Fable 5, refusal-aware metrics were additionally calculated. A refusal was operationally defined as a response in which no final answer option was provided. Conditional accuracy among answered questions was defined as the number of correct answers divided by the number of responses with a final answer option. The refusal rate was defined as the number of refusals divided by the total number of questions.

Data Sources/Measurement

The original Japanese text of each question was entered without English translation. For questions containing figures, diagrams, graphs, or chemical structures, the visual components were entered as images. Each model was evaluated in a single run for each question. No additional prompt engineering was performed. Model outputs were compared with the official answers published by the Ministry of Health, Labour and Welfare, Japan [4]. A response was considered correct if the selected option matched the official answer. A response was considered incorrect if the selected answer differed from the official answer, if the required number of answer options was not selected, if no final answer option was provided, or if the model refused to answer.

Bias

Potential sources of bias included training-data contamination because the JNLEP questions and official answers are publicly available, single-run variability, and changes in commercial AI services over time. These sources could not be fully eliminated, but refusals were explicitly classified and counted to make answerability transparent.

Main Results

All four models exceeded the official passing threshold of the 107th JNLEP. GPT-5.5 achieved the highest performance, correctly answering 342 of 345 questions, corresponding to an overall accuracy of 99.1%. 3.5 Flash correctly answered 339 questions, corresponding to 98.3% accuracy. Opus 4.8 correctly answered 334 questions, corresponding to 96.8% accuracy (Figure 1). Fable 5 showed a distinct performance profile. When refusals were counted as incorrect responses, the model correctly answered 241 of 345 questions, corresponding to an overall accuracy of 69.9%. However, Fable 5 refused to answer 101 of 345 questions, corresponding to an overall refusal rate of 29.3%. Among the 244 questions for which it provided a valid answer, 241 were correct, giving a conditional accuracy of 98.8%. The refusal rate of Fable 5 was highly subject-dependent. Refusals were most frequent in Biology, where 18 of 20 questions were refused (90.0%), followed by Pharmacology, where 29 of 40 questions were refused (72.5%). In contrast, no refusals occurred in Practice, despite this being the largest subject category with 95 questions (Figure 2). Subject-wise analysis further showed that all four frontier models achieved near-saturation accuracy across all subjects when Fable 5 was evaluated only on questions for which it provided valid answers (Table 1).

Key Results

June 2026 frontier AI models achieved near-saturation performance on the 107th JNLEP. GPT-5.5, 3.5 Flash, and Opus 4.8 achieved overall accuracies of 99.1%, 98.3%, and 96.8%, respectively. The most notable finding was the discrepancy between overall accuracy and conditional accuracy in Fable 5. Its overall accuracy was 69.9%, but its accuracy among answered questions was 98.8%, suggesting a distinction between competence and answerability.

Interpretation

The subject-wise refusal pattern is unlikely to reflect general task difficulty alone. Biology and Pharmacology questions may include concepts related to biological mechanisms, microorganisms, toxins, receptors, pharmacological actions, adverse effects, and drug mechanisms, which may overlap with domains monitored by safety mechanisms for high-risk biological or chemical content [7,8]. Compared with medical or nursing licensure examinations, pharmacy examinations may place greater emphasis on biology, pharmacology, toxicology, chemical properties, and drug mechanisms [1,2,3]. Therefore, legitimate pharmacy-education questions may be more likely to resemble sensitive biology- or chemistry-related prompts from the perspective of safety classifiers. Although this interpretation cannot be confirmed without access to the internal classifier criteria, it suggests that refusal behavior may affect pharmacy benchmarks more strongly than other healthcare licensure benchmarks.

Comparison with Previous Studies

The near-saturation performance observed for GPT-5.5, 3.5 Flash, and Opus 4.8 represents a substantial increase from the highest accuracy of 87.2% reported in a previous 2024 comparative evaluation using the same examination [2]. However, because the JNLEP questions and official answers are publicly available [4], training-data contamination or memorization cannot be excluded, and the results should not be regarded as direct evidence of autonomous clinical competence [9,10].

Limitations/Generalizability

This study has several limitations. Only one examination year was evaluated. Training-data contamination cannot be excluded because the examination materials are publicly available [4,9,10]. Model behavior was assessed at a single time point in June 2026, and commercial AI services may change over time [5,6,7,8]. The internal safety mechanisms of the models were not accessible, and the causes of refusals could not be directly verified. Finally, this study evaluated final-answer correctness but did not assess the quality of explanations, citation accuracy, or clinical appropriateness. The findings are most generalizable to educational benchmark settings using public licensure examinations with fixed correct answers.

Suggestions

As model performance approaches saturation, refusal-aware evaluation becomes necessary. In clinical settings, conservative refusal behavior may be desirable for unsafe or patient-specific requests. In standardized licensure examinations with defined correct answers, however, excessive refusals can reduce functional utility and distort apparent subject-wise performance.

Conclusion

Frontier generative AI models have reached near-saturation performance on the Japanese National License Examination for Pharmacists. However, Claude Fable 5 demonstrated high conditional accuracy with frequent refusals, especially in Biology and Pharmacology. Refusal-aware evaluation is necessary for interpreting generative AI performance in pharmacy education and healthcare.

Author Contributions

Conceptualization: HSato. Data curation: HSato. Methodology/formal analysis/validation: HSato. Project administration: HSakurai. Funding acquisition: None. Writing - original draft: HSato. Writing - review & editing: HSato, KO, Hsakurai.

Conflicts of Interest

No potential conflict of interest relevant to this article was reported.

Funding

None.

Data Availability

Dataset 1. Item-level model responses and refusal classifications. The examination questions and official answers are publicly available from the Ministry of Health, Labour and Welfare, Japan [4].

Acknowledgments

None.

Use of Generative AI

Generative AI tools were used for language editing. The authors reviewed and verified all content, analyses, references, and conclusions and take full responsibility for the final manuscript.

References

Sato, H.; Ogasawara, K. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. J. Educ. Eval. Health Prof. 2024, 21, 4. [Google Scholar] [CrossRef] [PubMed]
Sato, H.; Ogasawara, K.; Sakurai, H. Performance evaluation of 18 generative AI models (ChatGPT, Gemini, Claude, and Perplexity) in the 2024 Japanese pharmacist licensing examination: comparative study. JMIR Med. Educ. 2025, 11, e76925. [Google Scholar] [CrossRef] [PubMed]
Jin, H.K.; Lee, H.E.; Kim, E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med. Educ. 2024, 24, 1013. [Google Scholar] [CrossRef] [PubMed]
Ministry of Health; Labour and Welfare; Japan. The 107th Japanese National License Examination for Pharmacists: questions and answers [Internet]; Ministry of Health, Labour and Welfare: Tokyo, 2022; Available online: https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000198924.html (accessed on 13 June 2026).
OpenAI. Introducing GPT-5.5 [Internet]; OpenAI: San Francisco, 2026; Available online: https://openai.com/index/introducing-gpt-5-5/ (accessed on 13 June 2026).
Google. Gemini 3.5: frontier intelligence with action [Internet]; Google: Mountain View, 2026; Available online: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ (accessed on 13 June 2026).
Anthropic. Claude Fable 5 and Claude Mythos 5 [Internet]; Anthropic: San Francisco, 2026; Available online: https://www.anthropic.com/news/claude-fable-5-mythos-5 (accessed on 13 June 2026).
Amazon Web Services. Anthropic Claude Fable 5 on AWS: Mythos-class capabilities with built-in safeguards now available [Internet]; Amazon Web Services: Seattle, 2026; Available online: https://aws.amazon.com/blogs/aws/anthropic-claude-fable-5-on-aws-mythos-class-capabilities-with-built-in-safeguards-now-available/ (accessed on 13 June 2026).
Dong, Y.; Jiang, X.; Liu, H.; Jin, Z.; Gu, B.; Yang, M.; et al. Generalization or memorization: data contamination and trustworthy evaluation for large language models. In: Findings of the Association for Computational Linguistics: ACL 2024. 2024; pp. 12039–12050. [Google Scholar]
Dekoninck, J.; Muller, M.N.; Vechev, M. ConStat: performance-based contamination detection in large language models. arXiv [Preprint]. 2024. Available online: https://arxiv.org/abs/2405.16281 (accessed on 13 June 2026).

Figure 1. Overall accuracy of the four frontier AI models on the 107th Japanese National License Examination for Pharmacists, calculated using all questions and answered questions only.

Figure 2. Refusal rates across subject categories for Claude Fable 5 on the 107th Japanese National License Examination for Pharmacists.

Table 1. Accuracy according to subject for the four frontier AI models on the 107th Japanese National License Examination for Pharmacists.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Near-Saturation Accuracy and Safety-Driven Refusals of Frontier Generative Artificial Intelligence Models on the Japanese Pharmaceutical Benchmark as of June 2026

Abstract

Keywords:

Subject:

Background/Rationale

Objectives

Ethics Statement

Study Design

Setting

Variables

Data Sources/Measurement

Bias

Main Results

Key Results

Interpretation

Comparison with Previous Studies

Limitations/Generalizability

Suggestions

Conclusion

Author Contributions

Conflicts of Interest

Funding

Data Availability

Acknowledgments

Use of Generative AI

References

MDPI Initiatives

Important Links

Subscribe