Backgrounds: Recent advances in artificial intelligence (AI) have produced ChatGPT-4o, a multimodal large language model (LLM) capable of processing both text and image inputs. Although ChatGPT has demonstrated usefulness in medical examinations, few studies have evaluated its image analysis performance. Methods: This study compared GPT-4o and GPT-4 using public questions from the 116th–118th Japan National Medical Licensing Examinations (JNMLE), each consisting of 400 questions. Both models answered in Japanese using simple prompts, including screenshots for image-based questions. Accuracy was analyzed across essential, general, and clinical questions, with statistical comparisons by chi-square tests. Results: GPT-4o consistently outperformed GPT-4, achieving passing scores in all three examinations. In the 118th JNMLE, GPT-4o scored 457 points versus 425 for GPT-4. GPT-4o demonstrated higher accuracy for image-based questions in the 117th and 116th exams, though the difference in the 118th was not significant. For text-based questions, GPT-4o showed superior medical knowledge, clinical reasoning, and ethical response behavior, notably avoiding prohibited options. Conclusion: Overall, GPT-4o exceeded GPT-4 in both text and image domains, suggesting strong potential as a diagnostic aid and educational resource. Its balanced performance across modalities highlights its promise for integration into future medical education and clinical decision support.