Submitted:
25 August 2025
Posted:
26 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We introduce WebProber, a highly extensible web testing framework that leverages AI agents to simulate human behavior on the web.
- We present a case study on 120 personal websites in the wild, on which WebProber found 29 usability issues.
- We release our code and our human-annotated bug database for future research.1
2. Related Work
Browser-Use Agents
Automated Web Testing

3. WebProber
Prompt Generation
Interaction Simulation
Bug Report Generation
4. Experiments
4.1. Results
Our Approach Effectively Identifies Usability Issues That Impact User Experience
- Broken or misdirected links The most common class of bugs detected is broken or misdirected links. We present an example in Figure 3: the agent identified that a project description was inconsistent with the paper linked through the "Read more here" button.
- Logical inconsistencies Finally, we find our WebProber is also able to detect logical inconsistencies in website contents, typically resulting from typographical errors. These errors sometimes lead to factual inaccuracies or user confusion. For example, in Figure 3, the agent identified a spring course syllabus (determined by calendar dates) that incorrectly scheduled a "Fall break" week.
4.2. Discussion
False Positives
Undetected Bugs
5. Conclusion and Future Work
Appendix A. Example Bug Report Generated by WebProber

Appendix B. Implementation Details
Appendix B.1. Proposing Potential Bugs
| Iteration | Prompt |
|---|---|
| Initial | Go to the website [URL], a personal website. Explore the content, click on links, and occasionally pause to assess whether what is shown and linked on the website is coherent and appropriate. Unreasonable or problematic issues include, but are not limited to: broken or mismatched images/links, UI glitches/incapabilities, illogical or unfunctional web design, or textual errors, etc. |
| One Iteration | Go to the website [URL], a personal website. Systematically explore all accessible content, including menus, links, embedded media, interactive elements, and downloadable materials for WEBSITE BUGS. You should prioritize areas that are often more likely to contain issues. At each stage, critically evaluate whether the displayed information, layout, and behavior align with expectations for a functional and professional web experience. Carefully inspect for issues such as, but not limited to: (1) Broken elements: dead/missing links, 404 pages, failed image or video loads. (2) Interaction failures: non-responsive buttons, malfunctioning forms or filters, non-working download or redirect actions. (3) UI/UX flaws: lack of visual feedback, missing tooltips/ESC buttons, layout inconsistencies, uncustomized templates, poor mobile compatibility. (4) Content inconsistencies: outdated or contradictory data (e.g., dates or names), mismatched references or external links, typos or formatting errors. (5) Domain-specific bugs: for instance, broken external links to publications, projects, GitHub, Google Scholar, etc. Incorrect anchor links (e.g., internal navigation like #about or #projects not working). Outdated or dead email links (e.g., mailto: pointing to deprecated addresses). Missing or malformed citation info (e.g., BibTeX files, DOI links not rendering or downloading properly). Mismatched thumbnails or missing alt-text on research project previews. Videos or talks not embedded properly (e.g., iframe blocked by CORS). For each identified issue, consider its impact, repeatability, and specific trigger (e.g., ”clicking X under condition Y leads to error Z”) |
Appendix B.2. Generating Bug Reports
| Prompt |
|---|
| Please analyze the following agent run trajectory and identify any potential bugs or glitches in the website being tested. Consider both feature bugs (missing or incorrect functionality) and glitch-like bugs (visual or behavioral anomalies). Note that the type of bug is not always obvious, so don’t be afraid to make an assumption. For example, if the website does not support certain features that the agent is trying to use, that is a bug (e.g. the agent is trying to use the "add to cart" feature, but the website does not have a cart, or that the agent is searching in some language that the website does not support). For each step, I’ll provide: 0. The screenshot of the current browser state 1. The agent’s evaluation of the step 2. The next goal 3. The action taken Please analyze the entire sequence of steps and identify: 1. Any unexpected behaviors or errors of the website itself (*note: not the agent’s actions*) 2. Missing or incorrect functionality 3. Visual glitches or UI inconsistencies 4. Any other anomalies that might indicate bugs Here’s the step-by-step trajectory: [Trajectory] Based on the above trajectory, please provide: 1. A summary of any bugs or glitches identified 2. The specific steps where issues occurred 3. The nature of each issue (feature bug, visual glitch, etc.) 4. Any patterns or recurring problems 5. Recommendations for fixing the identified issues For each identified issue, please specify: - The step number where it occurred - Whether it’s a feature bug or visual glitch - The severity of the issue - The expected behavior vs actual behavior |
References
- Chakarov, R. How Many Websites Are There? How many are active in 2023? https://webtribunal.net/blog/how-many-websites, 2023.
- Ricca, F.; Tonella, P. Analysis and testing of Web applications. In Proceedings of the Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001, 2001, pp. 25–34. [CrossRef]
- Cypress.io. Cypress: Testing Frameworks for JavaScript. https://www.cypress.io/, 2025. Accessed: 2025-06-26.
- Google Chrome Developers. Puppeteer: Headless Chrome Node.js API. https://pptr.dev/, 2025. Accessed: 2025-06-26.
- Apache Software Foundation. Apache JMeter: Load Testing for Web Applications. https://jmeter.apache.org/, 2025. Accessed: 2025-06-26.
- Le, N.K.; Bui, Q.M.; Nguyen, M.N.; Nguyen, H.; Vo, T.; Luu, S.T.; Nomura, S.; Nguyen, M.L. Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs, 2025. arXiv:cs.SE/2506.02529.
- Wang, D.; Hsu, T.Y.; Lu, Y.; Gu, H.; Cui, L.; Xie, Y.; Headean, W.; Yao, B.; Veeragouni, A.; Liu, J.; et al. AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents, 2025. arXiv:cs.HC/2504.09723.
- Lu, Y.; Yao, B.; Gu, H.; Huang, J.; Wang, Z.J.; Li, Y.; Gesi, J.; He, Q.; Li, T.J.J.; Wang, D. Uxagent: An llm agent-based usability testing framework for web design. In Proceedings of the Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2025, pp. 1–12.
- Bordes, F.; Pang, R.Y.; Ajay, A.; Li, A.C.; Bardes, A.; Petryk, S.; Mañas, O.; Lin, Z.; Mahmoud, A.; Jayaraman, B.; et al. An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247 2024.
- Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as Agents, 2023. arXiv:cs.AI/2308.03688.
- Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2024. arXiv:cs.AI/2307.13854.
- Koh, J.Y.; Lo, R.; Jang, L.; Duvvur, V.; Lim, M.C.; Huang, P.Y.; Neubig, G.; Zhou, S.; Salakhutdinov, R.; Fried, D. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks, 2024. arXiv:cs.LG/2401.13649.
- Yang, J.; Zhang, H.; Li, F.; Zou, X.; Li, C.; Gao, J. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, 2023, [arXiv:cs.CV/2310.11441]. arXiv:cs.CV/2310.11441.
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback, 2023, [arXiv:cs.CL/2303.17651].
- Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning, 2023, [arXiv:cs.AI/2303.11366].
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, 2023, [arXiv:cs.CL/2305.10601].
- Koh, J.Y.; McAleer, S.; Fried, D.; Salakhutdinov, R. Tree Search for Language Model Agents, 2024, [arXiv:cs.AI/2407.01476].
- Yu, X.; Chen, M.; Yu, Z. Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning, 2023, [arXiv:cs.CL/2305.13660].
- Yu, X.; Peng, B.; Vajipey, V.; Cheng, H.; Galley, M.; Gao, J.; Yu, Z. ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, 2025, [arXiv:cs.CL/2410.02052].
- Android Developers. Monkey. https://developer.android.com, 2022. Accessed: 2025-06-25.
- Mesbah, A.; Van Deursen, A.; Lenselink, S. Crawling Ajax-based web applications through dynamic analysis of user interface state changes. ACM Transactions on the Web (TWEB) 2012, 6, 1–30. [Google Scholar] [CrossRef]
- Stocco, A.; Willi, A.; Starace, L.L.L.; Biagiola, M.; Tonella, P. Neural embeddings for web testing. arXiv preprint arXiv:2306.07400, arXiv:2306.07400 2023.
- Liu, C.; Wang, J.; Yang, W.; Zhang, Y.; Xie, T. Judge: Effective State Abstraction for Guiding Automated Web GUI Testing. ACM Transactions on Software Engineering and Methodology 2025. [Google Scholar] [CrossRef]
- Zheng, Y.; Liu, Y.; Xie, X.; Liu, Y.; Ma, L.; Hao, J.; Liu, Y. Automatic web testing using curiosity-driven reinforcement learning. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE; 2021; pp. 423–435. [Google Scholar]
- Sherin, S.; Muqeet, A.; Khan, M.U.; Iqbal, M.Z. QExplore: An exploration strategy for dynamic web applications using guided search. Journal of Systems and Software 2023, 195, 111512. [Google Scholar] [CrossRef]
- Liu, Z.; Chen, C.; Wang, J.; Che, X.; Huang, Y.; Hu, J.; Wang, Q. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE; 2023; pp. 1355–1367. [Google Scholar]
- Liu, Z.; Chen, C.; Wang, J.; Chen, M.; Wu, B.; Tian, Z.; Huang, Y.; Hu, J.; Wang, Q. Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model. In Proceedings of the Proceedings of the IEEE/ACM 46th international conference on software engineering, 2024, pp. 1–12.
- Wang, S.; Wang, S.; Fan, Y.; Li, X.; Liu, Y. Leveraging Large Vision-Language Model for Better Automatic Web GUI Testing. In Proceedings of the 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE; 2024; pp. 125–137. [Google Scholar]
- Alian, P.; Nashid, N.; Shahbandeh, M.; Shabani, T.; Mesbah, A. Feature-Driven End-To-End Test Generation. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society; 2025; pp. 678–678. [Google Scholar]
- Shahbandeh, M.; Alian, P.; Nashid, N.; Mesbah, A. Naviqate: Functionality-guided web application navigation. arXiv preprint arXiv:2409.10741, arXiv:2409.10741 2024.
- Liu, C.; Gu, Z.; Wu, G.; Zhang, Y.; Wei, J.; Xie, T. Temac: Multi-Agent Collaboration for Automated Web GUI Testing. arXiv preprint arXiv:2506.00520, arXiv:2506.00520 2025.
- Liu, Z.; Chen, C.; Wang, J.; Chen, M.; Wu, B.; Che, X.; Wang, D.; Wang, Q. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13.
- Yoon, J.; Feldt, R.; Yoo, S. Autonomous large language model agents enabling intent-driven mobile gui testing. arXiv preprint arXiv:2311.08649, arXiv:2311.08649 2023.
- Lee, S.; Choi, J.; Lee, J.; Wasi, M.H.; Choi, H.; Ko, S.; Oh, S.; Shin, I. Mobilegpt: Augmenting llm with human-like app memory for mobile task automation. In Proceedings of the Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 1119–1133.
- Wen, H.; Tian, S.; Pavlov, B.; Du, W.; Li, Y.; Chang, G.; Zhao, S.; Liu, J.; Liu, Y.; Zhang, Y.Q.; et al. AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation. arXiv preprint arXiv:2412.18116, arXiv:2412.18116 2024.
- Chen, M.; Liu, Z.; Chen, C.; Wang, J.; Wu, B.; Hu, J.; Wang, Q. Standing on the Shoulders of Giants: Bug-Aware Automated GUI Testing via Retrieval Augmentation. Proceedings of the ACM on Software Engineering 2025, 2, 825–846. [Google Scholar] [CrossRef]
- Müller, M.; Žunič, G. Browser Use: Enable AI to control your browser, 2024.
- Anthropic, A. Claude 3.7 and Claude Code, 2025.
| 1 | The github repository is available at https://github.com/TianyiPeng/WebProber. The database url is available at https://webbugvid.netlify.app/. Due to privacy concerns with some personal websites, we included only a subset of the bugs in the database. |
| 2 | Since the authors may not have found all possible bugs, the actual coverage may be lower. |


Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).