Generated correct plans for 57 out of 86 reports
Our work consists of two main contributions:
Context: Reproducing game bugs, in our case crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed.
Objective: This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction.
Method: BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision-based LLM agent (GPT-4o) and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash.
Results: Evaluated on BugCraft-Bench, our framework successfully reproduced 30.23% of crash bugs end-to-end. The Step Synthesizer demonstrated a 66.28% accuracy in generating correct bug reproduction plans, highlighting its effectiveness in interpreting and structuring bug report information.
To enhance our framework with comprehensive game knowledge, we scraped the official Minecraft Wiki using MediaWiki API. The extracted content is processed and used for fuzzy matching during bug reproduction, reducing reliance on LLM pre-training data.
Access our curated dataset and contribute to advancing automated game bug reproduction research.
The BugCraft Framework processes bug reports through a two-stage pipeline:
User-submitted crash report
Transforms unstructured bug reports into clear, executable steps
Wiki-powered knowledge enhancement
Multi-stage refinement pipeline
Smart step clustering system
Automated reasoning trajectories
Executes synthesized steps in the game environment
Intelligent step-by-step execution
Screen-based decision making
JSON API integration
Crash trajectory logging
Bug successfully reproduced and crash trajectory logged
Execution limit exceeded or unable to reproduce bug
Our evaluation demonstrates that BugCraft successfully reproduced 30.23% of crash bugs end-to-end on the BugCraft-Bench dataset (86 valid reports). This highlights the potential of LLM agents for automated bug reproduction in complex game environments.
Successfully reproduced 26 out of 86 crash bugs end-to-end
Generated correct plans for 57 out of 86 reports
Inter-rater agreement: Cohen's Kappa: 0.70, Percentage Agreement: 83.0%
Successfully executed 22 out of 57 correct plans
BugCraft demonstrates the potential of LLM agents for automated game bug reproduction. The Step Synthesizer shows promising capability (66.28% accuracy), while the Action Model's execution phase remains the major bottleneck (38.60% success rate with correct plans). The agent's ability to recover from faulty plans (13.79% recovery rate) suggests potential for improvement through enhanced decision-making and interaction capabilities.
For a comprehensive understanding of our methodology, detailed results, and in-depth analysis of BugCraft's performance in automated game bug reproduction, please read our full paper.
If you use BugCraft in your research, please cite our paper:
@article{yapagci2024bugcraft, title={End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft}, author={Yapağcı, Eray and Öztürk, Yavuz Alp Sencer and Tüzün, Eray}, journal={arXiv preprint arXiv:2503.20036}, year={2024} }
For questions, feedback, or collaborations, please reach out to us: