End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft

Arxiv preprint

Eray Yapağcı, Yavuz Alp Sencer Öztürk, Eray Tüzün

Computer Engineering Department, Bilkent University

Ankara, Turkey

Leaderboard

Model % Success Org Date Logs Trajs Site
🤖 BugCraft End-to-End 30.23 Bilkent 2024-03-31

Introduction

Our work consists of two main contributions:

BugCraft-Bench Dataset: A curated collection of 86 reproducible Minecraft crash bug reports, serving as a benchmark for automated bug reproduction systems.
BugCraft Framework: An end-to-end automated system that transforms unstructured bug reports into executable steps and reproduces them in-game.

Bug Report

from BugCraft-Bench

BugCraft Framework

S2R Synthesizer

Action Model Agent

Success
Failure

Abstract

Context: Reproducing game bugs, in our case crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed.

Objective: This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction.

Method: BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision-based LLM agent (GPT-4o) and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash.

Results: Evaluated on BugCraft-Bench, our framework successfully reproduced 30.23% of crash bugs end-to-end. The Step Synthesizer demonstrated a 66.28% accuracy in generating correct bug reproduction plans, highlighting its effectiveness in interpreting and structuring bug report information.

BugCraft-Bench Dataset

BugCraft-Bench

86 Reports
70 Game Versions

Dataset Creation

  • Retrieved crash reports from Mojira using REST API
  • Used regex to extract comments, attachments, and external links
  • Selected reports with "Confirmed" or "Community Consensus" status
  • Excluded reports requiring:
    • Server/multiplayer interactions
    • External files (datapacks, worlds)
    • Machine-dependent setups
  • Randomly sampled final dataset from qualified reports

Quality Assurance

  • Multi-stage filtration process
  • Manual review for quality and relevance
  • Verification of reproducibility

Minecraft Wiki Knowledge Base

To enhance our framework with comprehensive game knowledge, we scraped the official Minecraft Wiki using MediaWiki API. The extracted content is processed and used for fuzzy matching during bug reproduction, reducing reliance on LLM pre-training data.

Access our curated dataset and contribute to advancing automated game bug reproduction research.

Framework Overview

The BugCraft Framework processes bug reports through a two-stage pipeline:

Bug Report

User-submitted crash report

S2R Synthesizer

Transforms unstructured bug reports into clear, executable steps

Wiki-powered knowledge enhancement

Multi-stage refinement pipeline

Smart step clustering system

Automated reasoning trajectories

Action Model Agent

Executes synthesized steps in the game environment

Intelligent step-by-step execution

Screen-based decision making

JSON API integration

Crash trajectory logging

Success

Bug successfully reproduced and crash trajectory logged

Failure

Execution limit exceeded or unable to reproduce bug

Results & Analysis

Our evaluation demonstrates that BugCraft successfully reproduced 30.23% of crash bugs end-to-end on the BugCraft-Bench dataset (86 valid reports). This highlights the potential of LLM agents for automated bug reproduction in complex game environments.

End-to-End Performance

30.23% Success Rate

Successfully reproduced 26 out of 86 crash bugs end-to-end

Step Synthesizer

66.28% Accuracy

Generated correct plans for 57 out of 86 reports

Wrong Command (48.28%)
Missing Step (34.49%)
Logic Error (31.03%)

Inter-rater agreement: Cohen's Kappa: 0.70, Percentage Agreement: 83.0%

Action Model

38.60% Success Rate

Successfully executed 22 out of 57 correct plans

Agent Incapability (50.88%)
  • Stuck in Loop (19.30%)
  • Poor Decision Making (29.82%)
Framework Incapability (7.02%)
Recovery from Faulty Plans (13.79%)

Key Takeaways

BugCraft demonstrates the potential of LLM agents for automated game bug reproduction. The Step Synthesizer shows promising capability (66.28% accuracy), while the Action Model's execution phase remains the major bottleneck (38.60% success rate with correct plans). The agent's ability to recover from faulty plans (13.79% recovery rate) suggests potential for improvement through enhanced decision-making and interaction capabilities.

Watch Our Video

Read the Full Paper

For a comprehensive understanding of our methodology, detailed results, and in-depth analysis of BugCraft's performance in automated game bug reproduction, please read our full paper.

How to Cite

If you use BugCraft in your research, please cite our paper:

@article{yapagci2024bugcraft,
  title={End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft},
  author={Yapağcı, Eray and Öztürk, Yavuz Alp Sencer and Tüzün, Eray},
  journal={arXiv preprint arXiv:2503.20036},
  year={2024}
}
View on arXiv

Contact

For questions, feedback, or collaborations, please reach out to us: