Accepted to ASE 2025 - 40th IEEE/ACM International Conference on Automated Software Engineering

End-to-End Crash Bug Reproduction for Minecraft

Eray Yapagci, Yavuz Alp Sencer Ozturk, Eray Tuzun

Computer Engineering Department

Bilkent University

Leaderboard

Model % Success Org Date Logs Trajs Site
BugCraft (GPT-4.1) 34.9 Bilkent 2025-02-07
BugCraft (GPT-4o) 30.2 Bilkent 2025-02-07
OpenAI Computer Use Agent 25.5 OpenAI 2025-02-07

Abstract

Context: Reproducing game bugs, in our case crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed.

Objective: This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction.

Method: BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by vision-based LLM agents (GPT-4o and GPT-4.1) and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash.

Results: Evaluated on BugCraft-Bench, our framework with GPT-4.1 successfully reproduced 34.9% of crash bugs end-to-end, outperforming baseline models by 37%. The Step Synthesizer demonstrated a 66.28% accuracy in generating correct bug reproduction plans, highlighting its effectiveness in interpreting and structuring bug report information.

BugCraft Overview

Figure 1: Overview of the BugCraft framework

BugCraft Framework Architecture

Figure 2: The BugCraft framework, illustrating the two-stage process of S2R synthesis and action model execution.

Methodology

Our framework processes bug reports through a sophisticated pipeline that includes preprocessing, step synthesis, and action execution. The Step Synthesizer employs knowledge augmentation and multi-stage refinement to generate high-quality reproduction steps.

Step Synthesizer Component

Figure 3: Step Synthesizer Component

Action Model Component

Figure 4: Action Model Component

Results

Our evaluation demonstrates that BugCraft with GPT-4.1 successfully reproduced 34.9% of crash bugs end-to-end on the BugCraft-Bench dataset (86 valid reports), outperforming baseline models by 37%. This highlights the potential of LLM agents for automated bug reproduction in complex game environments.

Multi-Model Performance

GPT-4.1 Success Rate: 34.9% (30 out of 86)

GPT-4o Success Rate: 30.2% (26 out of 86)

Oracle Coverage (Combined): 41.9% (36 out of 86)

When combining both models, 20 bugs were reproduced by both, 10 only by GPT-4.1, and 6 only by GPT-4o, demonstrating complementary strengths.

Step Synthesizer Performance

Accuracy: 66.28% (57 out of 86 reports)

Inter-rater Agreement: Cohen's Kappa: 0.70, Percentage Agreement: 83.0%

Common Failure Patterns:

  • Wrong Command (48.28%): Commands not recognized by Minecraft due to outdated syntax or hallucination
  • Missing Step (34.49%): Critical steps omitted, making reproduction impossible
  • Logic Error (31.03%): Contradictory or impossible step sequences

Baseline Comparison

BugCraft outperformed all baseline models, achieving a 37% relative improvement over OpenAI's Computer Use Agent and 10-fold cost reduction compared to human experts.

System Success Rate Time Cost
Human Expert ~83% 20 min $28.20
BugCraft (GPT-4.1) 34.9% 10 min $1.16
BugCraft (GPT-4o) 30.2% 15.56 min $1.45
OpenAI CUA 25.5% 6.37 min $0.65
UI-TARS-1.5-7B 0.0% 3.27 min $0.02

Key Takeaways:

  • 37% improvement over OpenAI Computer Use Agent through game-specific optimizations
  • $3.44 per successful reproduction vs. $35.25 for human experts (10x cost reduction)
  • 10 minutes average completion time vs. 3.41 days human turnaround (MTTR)

For detailed failure analysis and comprehensive breakdowns, please refer to our full paper.

Watch Our Video

BugCraft-Bench Dataset

We introduce BugCraft-Bench, a curated dataset of 86 Minecraft crash bug reports, carefully selected and validated for reproducibility. This dataset serves as a benchmark for evaluating automated bug reproduction systems in game environments.

86 Reports
70 Game Versions

Dataset Creation

  • Retrieved crash reports from Mojira using REST API
  • Used regex to extract comments, attachments, and external links
  • Selected reports with "Confirmed" or "Community Consensus" status
  • Excluded reports requiring:
    • Server/multiplayer interactions
    • External files (datapacks, worlds)
    • Machine-dependent setups
  • Randomly sampled final dataset from qualified reports

Quality Assurance

  • Multi-stage filtration process
  • Manual review for quality and relevance
  • Verification of reproducibility

Minecraft Wiki Knowledge Base

To enhance our framework with comprehensive game knowledge, we scraped the official Minecraft Wiki using MediaWiki API. The extracted content is processed and used for fuzzy matching during bug reproduction, reducing reliance on LLM pre-training data.

Access our curated dataset and contribute to advancing automated game bug reproduction research.

Download Dataset

How to Cite

If you use BugCraft in your research, please cite our paper:

@inproceedings{Yapagci2025Agents,
  author    = {Yapagci, Eray and Ozturk, Yavuz Alp Sencer and Tuzun, Eray},
  title     = {{Agents in the Sandbox: End-to-End Crash Bug Reproduction for Minecraft}},
  year      = {2025},
  booktitle = {Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  note      = {To appear. Code available at \url{https://bugcraft2025.github.io}}
}

Contact

For more information, please contact:

Minecraft Bug
Minecraft Character