|
|
--- |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- minecraft |
|
|
- action-prediction |
|
|
- grounded-instruction-following |
|
|
- task-oriented-dialog |
|
|
- blocks-world |
|
|
- embodied-ai |
|
|
- synthetic-data |
|
|
- spatial-reasoning |
|
|
language: |
|
|
- en |
|
|
license: llama3 |
|
|
base_model: |
|
|
- meta-llama/Meta-Llama-3-8B |
|
|
metrics: |
|
|
- f1 |
|
|
--- |
|
|
|
|
|
# Llama-CRAFTS: A Minecraft Builder Action Prediction Model |
|
|
|
|
|
**Llama-CRAFTS** (**C**ontext **R**ich **A**nd **F**ine-**T**uned On **S**ynthetic Data) is a Llama-3-8B model fine-tuned for the **Builder Action Prediction (BAP)** task in Minecraft. The model predicts a sequence of block placements or removals based on the current game context. |
|
|
|
|
|
This model establishes a new **state-of-the-art** on the task, achieving an F1 score of **53.0**—a 6-point improvement over the previous SOTA ([Nebula](https://arxiv.org/abs/2406.18164)). Its development is part of a holistic re-examination of the BAP task itself, introducing an improved evaluation framework, new synthetic datasets, and enhanced modeling techniques, thereby forming **BAP v2**, an enchanced task framework. |
|
|
|
|
|
### Key Features: |
|
|
* **State-of-the-Art Performance**: Achieves the highest score on the BAP v2 benchmark. |
|
|
* **Trained on Rich Synthetic Data**: In addition to the original Minecraft BAP data, Llama-CRAFTS was fine-tuned on three novel synthetic datasets specifically designed to teach complex spatial reasoning and instruction following. |
|
|
* **Context-Rich Inputs**: The model leverages richer textual input representations of the game context, which proved crucial for improving spatial awareness. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
* **Model type**: A Llama-3-8B model fine-tuned using QLoRA. |
|
|
* **Language(s)**: English |
|
|
* **Finetuned from model**: `meta-llama/Meta-Llama-3-8B` |
|
|
|
|
|
### Training Data |
|
|
|
|
|
Llama-CRAFTS was trained on the **BAP v2 training set**, which is a combination of: |
|
|
|
|
|
- **The original BAP Dataset:** The original human-human dialogue and game logs in the Minecraft Dialogue Corpus |
|
|
- **Three Synthetic Datasets:** Novel datasets generated to provide rich, targeted examples of spatial language for instruction following. These were crucial for overcoming data scarcity and teaching the model spatial skills. |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
The model was evaluated on the **BAP v2 benchmark**, which features a cleaner test set and fairer, more insightful metrics to better assess model capabilities, including spatial reasoning. |
|
|
|
|
|
## Model Sources |
|
|
|
|
|
- **Paper:** [*BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues*](https://arxiv.org/abs/2501.10836) |
|
|
- **Code and Data:** [https://github.com/prashant-jayan21/bap-v2](https://github.com/prashant-jayan21/bap-v2) |
|
|
- **Blog:** https://www.alphaxiv.org/overview/2501.10836v3 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite our work: |
|
|
|
|
|
```bibtex |
|
|
@misc{jayannavar2025bapv2enhancedtask, |
|
|
title={BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues}, |
|
|
author={Prashant Jayannavar and Liliang Ren and Marisa Hudspeth and Risham Sidhu and Charlotte Lambert and Ariel Cordes and Elizabeth Kaplan and Anjali Narayan-Chen and Julia Hockenmaier}, |
|
|
year={2025}, |
|
|
eprint={2501.10836}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2501.10836}, |
|
|
} |
|
|
``` |