YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

license: llama2 base_model: codellama/CodeLlama-7b-Instruct-hf tags:

misra-c

code-generation

safety-critical

fine-tuning

codellama datasets:

oussama120/misra-c-instructions

CodeLlama-7B-Instruct MISRA C Analyzer This is a fine-tuned version of the codellama/CodeLlama-7b-Instruct-hf model, specifically adapted to identify violations of the MISRA C 2012 guidelines in C code snippets and suggest corrections.

The model was fine-tuned on the oussama120/misra-c-instructions dataset. Its primary purpose is to act as an assistant for developers working in safety-critical environments (such as automotive, aerospace, or medical devices) to help write more compliant, safe, and reliable C code.

Model Description Base Model: codellama/CodeLlama-7b-Instruct-hf

Fine-tuning Dataset: oussama120/misra-c-instructions

Language: C

Primary Task: Code analysis for MISRA C 2012 compliance.

This model takes a C code snippet as input and outputs a description of potential MISRA C violations, often citing the specific rule number and providing a corrected or compliant version of the code.

Intended Uses & Limitations intended-use This model is intended to be used as a developer assistant for:

Code Review: Automatically flagging potential MISRA C issues in pull requests or code commits.

Educational Tool: Helping developers learn and understand the nuances of MISRA C rules.

Interactive Code Analysis: Providing quick feedback on C code snippets during development.

limitations ⚠️ This model is not a certified static analysis tool. It should not be used as a replacement for industry-standard static analysis tools like Polyspace, Klocwork, or PVS-Studio.

Potential for Hallucinations: As a language model, it can make mistakes, miss violations, or incorrectly identify issues. Always verify its output.

Limited Context: The model was trained on single-function snippets. It may struggle to understand complex, multi-file projects where context is spread across many files.

Not a Compliance Guarantee: Using this model does not guarantee MISRA C compliance.

How to Get Started You can use this model directly with the transformers library. Ensure you are logged into your Hugging Face account (huggingface-cli login).

import torch from transformers import pipeline

Use the pipeline for easy text generation

pipe = pipeline( "text-generation", model="Utkarsh524/CodeLlama-7B-MISRA-C-Finetuned", # Replace with your repo name torch_dtype=torch.float16, device_map="auto" )

C code with a potential MISRA C violation (pointer arithmetic)

user_c_code = """ #include <stdint.h>

uint16_t get_third_reading(void) { uint16_t sensor_readings[5] = { 100, 105, 102, 108, 101 }; uint16_t *reading_ptr = &sensor_readings[0];

// This is a violation of Rule 18.4 reading_ptr = reading_ptr + 2;

return *reading_ptr; } """

The prompt must follow the instruction format the model was trained on

prompt = f"""~~[INST] Analyze the following C code, identify any MISRA C 2012 violations, and suggest possible corrections.~~

{user_c_code} [/INST]""" Generate the analysis sequences = pipe( prompt, do_sample=True, temperature=0.2, top_p=0.9, num_return_sequences=1, eos_token_id=pipe.tokenizer.eos_token_id, max_new_tokens=512, ) Print the model's output for seq in sequences: print(seq['generated_text']) ## Training Procedure The model was fine-tuned using a script that leverages PEFT (Parameter-Efficient Fine-Tuning) with LoRA and 4-bit quantization to make the process memory-efficient. ### Data Preparation 1. The `oussama120/misra-c-instructions` dataset was loaded, which contains a single `text` column. 2. Each sample in the `text` column was parsed to extract the `###Instruction:` and `###Response:` sections. 3. Any rows that could not be parsed correctly were filtered out and excluded from training. 4. The extracted instruction and response pairs were formatted into the CodeLlama prompt format: `<s>[INST] {instruction} [/INST] {response} </s>`. ### Fine-tuning Technique - **Quantization:** The base model was loaded in 4-bit precision using the `BitsAndBytesConfig` from the `transformers` library. This dramatically reduces the GPU memory footprint. The quantization type used was `nf4`. - **PEFT with LoRA:** Low-Rank Adaptation (LoRA) was applied to the model. Instead of training all the model's weights, only small, low-rank adapter matrices were added to the attention layers and trained. This allows for efficient fine-tuning with far fewer trainable parameters. ### Hyperparameters | Hyperparameter | Value | |-------------------------------|-----------------------------------------------------------------| | `base_model` | `codellama/CodeLlama-7b-Instruct-hf` | | `num_train_epochs` | 3 | | `per_device_train_batch_size` | 1 | | `gradient_accumulation_steps` | 8 | | `learning_rate` | 2e-4 | | `lr_scheduler_type` | `cosine` | | `warmup_ratio` | 0.03 | | `optimizer` | `paged_adamw_8bit` | | `lora_r` (rank) | 64 | | `lora_alpha` | 64 | | `lora_dropout` | 0.1 | | `lora_target_modules` | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | ### Hardware - **GPU:** 1x NVIDIA A100 (48GB VRAM) - **Frameworks:** PyTorch, Transformers, PEFT, TRL, Accelerate ## Testing Procedure The model was evaluated using two primary methods: ### Method 1: Validation on the Dataset Test Split 1. **Model Loading:** The trained LoRA adapters were merged into the base model to create a standalone, fully fine-tuned model. 2. **Data Loading:** The `test` split of the `oussama120/misra-c-instructions` dataset was loaded. 3. **Inference:** An interactive script prompted the user for a row number from the test set. 4. **Comparison:** The script extracted the instruction from the specified row, generated a response from the model, and printed it side-by-side with the ground-truth response from the dataset. 5. **Export:** The results, including the instruction, model response, and actual response, were logged to a CSV file for later analysis. ### Method 2: Interactive Analysis of Custom C Code 1. **Model Loading:** The merged model was loaded and prepared for inference. 2. **User Input:** An interactive script prompted the user to enter a multi-line C code snippet directly into the terminal. 3. **Prompt Engineering:** The user's code was wrapped in a standardized analysis prompt (`<s>[INST] Analyze the following C code...[/INST]`). 4. **Generation:** The model generated a detailed analysis of potential MISRA C violations. 5. **Export:** The user's input C code and the model's generated analysis were printed to the console and saved to a CSV file for record-keeping. ## Citation ```bibtex @misc{codellama, title = {Code Llama: Open Foundation Models for Code}, author = {Rozière, Baptiste and Gehring, Jonas and Gloeckle, Fabian and So, Dmytro and Beeferman, David and Borges, Lui and Liptchinsky, Valentin and Lavril, Tristan and Izacard, Guillaume and Grave, Edouard and Lample, Guillaume and Touvron, Hugo}, year = {2023}, month = {August}, eprint = {2308.12950}, archivePrefix = {arXiv}, primaryClass = {cs.CL} } @misc{misra_c_dataset, author = {Oussama}, title = {MISRA C Instructions}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{[https://huggingface.co/datasets/oussama120/misra-c-instructions](https://huggingface.co/datasets/oussama120/misra-c-instructions)}} }

Downloads last month: 2

Safetensors

Model size

6.74B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support