---
license: llama3.2
base_model: meta-llama/Llama-3.2-8B-Instruct
tags:
- text-generation
- instruction
- datafusion
- rust
- code
---

![transformers](https://img.shields.io/badge/transformers-yes-green)


**Author:** yarenty  
**Model type:** Llama 3.2 (fine-tuned)  
**Task:** Instruction-following, code Q/A, DataFusion expert assistant  
**License:** Apache 2.0  
**Visibility:** Public

---


# Llama 3.2 DataFusion Instruct

This model is a fine-tuned version of **meta-llama/Llama-3.2-8B-Instruct**, specialized for the [Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) ecosystem. It's designed to be a helpful assistant for developers, answering technical questions, generating code, and explaining concepts related to DataFusion, Arrow.rs, Ballista, and the broader Rust data engineering landscape.

**GGUF Version:** For quantized, low-resource deployment, you can find the GGUF version [here](<https://huggingface.co/yarenty/llama32-datafusion-instruct-gguf>).

## Model Description

This model was fine-tuned on a curated dataset of high-quality question-answer pairs and instruction-following examples sourced from the official DataFusion documentation, source code, mailing lists, and community discussions.

- **Model Type:** Instruction-following Large Language Model (LLM)
- **Base Model:** `meta-llama/Llama-3.2-8B-Instruct`
- **Primary Use:** Developer assistant for the DataFusion ecosystem.

## Prompt Template

To get the best results, format your prompts using the following instruction template.

```
### Instruction:
{Your question or instruction here}

### Response:
```

## Example Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "yarenty/llama32-datafusion-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

# The model was trained with a specific instruction template.
# For optimal performance, your prompt should follow this structure.
prompt_template = """### Instruction:
How do I register a Parquet file in DataFusion?

### Response:"""

inputs = tokenizer(prompt_template, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id)

# Decode the output, skipping special tokens and the prompt
prompt_length = inputs["input_ids"].shape[1]
print(tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True))
```

## Training Procedure

- **Hardware:** Trained on 1x NVIDIA A100 GPU.
- **Training Script:** Custom script using `transformers.SFTTrainer`.
- **Key Hyperparameters:**
  - Epochs: 3
  - Learning Rate: 2e-5
  - Batch Size: 4
- **Dataset:** A curated dataset of ~5,000 high-quality QA pairs and instructions related to DataFusion. Data was cleaned and deduplicated as per the notes in `pitfalls.md`.

## Intended Use & Limitations

- **Intended Use:** This model is intended for developers and data engineers working with DataFusion. It can be used for code generation, debugging assistance, and learning the library. It can also serve as a strong base for further fine-tuning on more specialized data.
- **Limitations:** The model's knowledge is limited to the data it was trained on. It may produce inaccurate or outdated information for rapidly evolving parts of the library. It is not a substitute for official documentation or expert human review.

## Citation

If you find this model useful in your work, please cite:
```
@misc{yarenty_2025_llama32_datafusion_instruct,
  author = {yarenty},
  title = {Llama 3.2 DataFusion Instruct},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/yarenty/llama32-datafusion-instruct}}
}
```

## Contact
For questions or feedback, please open an issue on the Hugging Face repository or the [source GitHub repository](https://github.com/yarenty/trainer).