CodeLlama-13b-MORepair

CodeLlama-13b-MORepair is a program repair model fine-tuned from CodeLlama-13b-instruct using a novel multi-objective fine-tuning framework called MOREPAIR. This model is specifically designed to improve automated program repair capabilities by learning both code transformations and repair logic reasoning.

Paper | Code

Model Description

  • Base Model: CodeLlama-13b-instruct
  • Training Technique: Multi-objective fine-tuning with MOREPAIR framework
  • Supported Languages: Primarily tested on C++ and Java, but likely generalizes to other languages
  • Primary Use: Automated program repair
  • License: Llama 2 Community License

Training Details

Training Data

  • Dataset: TUTORLLMCODE
  • Size: 1,600 pairs of buggy and repaired code
  • Nature: Programming task corrections with LLM-generated repair guidance

Training Approach

The model was trained using MOREPAIR, which employs:

  • Multi-objective learning with two objectives:
    1. Generating repaired code
    2. Producing repaired code with explanatory guidance
  • QLoRA fine-tuning (only 1.84% of parameters modified)
  • NEFTune for improved generalization
  • LLM-generated guidance for understanding repair logic

Usage

Here's how to use the model with the Hugging Face Transformers library:

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "barty/CodeLlama-13B-MORepair"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
    torch_dtype=torch.float16
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

def repair_code(buggy_code, filename="example.java"):
    # Construct prompt in the format model expects
    prompt = f"""[INST] This is an incorrect code({filename}):
```java
{buggy_code}
```
You are a software engineer. Can you repair the incorrect code?
[/INST]
```java
"""
    
    # Calculate token count for length control
    prompt_tokens = len(tokenizer.tokenize(prompt))
    max_new_tokens = 500 - prompt_tokens
    
    # Generate repair
    output = pipe(
        prompt,
        min_length=prompt_tokens + 64,
        max_length=prompt_tokens + max_new_tokens,
        temperature=1.0,
        do_sample=True
    )
    
    # Extract the generated code
    full_text = output[0]['generated_text']
    fixed_code = full_text.split('[/INST]')[1].strip()
    
    return full_text, fixed_code

# Example usage
buggy_code = """
public static int findMinRotated(int[] arr) {
    int left = 0;
    int right = arr.length - 1;
    
    while (left < right) {
        int mid = (left + right) / 2;
        if (arr[mid] > arr[right])
            left = mid;  // Bug: should be mid + 1
        else
            right = mid;
    }
    return arr[left];
}
"""

full_response, fixed_code = repair_code(buggy_code)
print("Fixed code:")
print(fixed_code)

Important Parameters

  • load_in_8bit=True: Enables 8-bit quantization for efficient inference
  • temperature=1.0: Controls randomness in generation
  • do_sample=True: Enables sampling-based generation
  • min_length: Minimum length of generated text
  • max_length: Maximum length of generated text

Limitations

  • Performance varies across different programming languages
  • May require multiple attempts to generate correct fixes
  • Should be used with appropriate test cases to validate repairs
  • May not handle very complex or multi-file program repairs

Technical Specifications

  • Architecture: Based on CodeLlama-13b-instruct
  • Parameters: Same as base model (13B)
  • Fine-tuning Method: QLoRA + NEFTune
  • Context Window: Same as CodeLlama-13b-instruct
  • Input Format: Code snippets with optional repair guidance

Citation

If you use this model in your research, please cite:

@article{yang2024multi,
  title={Multi-Objective Fine-Tuning for Enhanced Program Repair with LLMs},
  author={Yang, Boyang and Tian, Haoye and Ren, Jiadong and Zhang, Hongyu and Klein, Jacques and Bissyandé, Tegawendé F. and Le Goues, Claire and Jin, Shunfu},
  journal={arXiv preprint arXiv:2404.12636},
  year={2024}
}

Acknowledgments

This model builds upon the CodeLlama model family developed by Meta AI and incorporates the MOREPAIR framework for enhanced program repair capabilities.

Downloads last month
66
Safetensors
Model size
13B params
Tensor type
FP16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for barty/CodeLlama-13B-MORepair

Finetuned
(29)
this model
Quantizations
1 model