Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
|
3 |
+
license: apache-2.0
|
4 |
+
language: c++
|
5 |
+
tags:
|
6 |
+
- code-generation
|
7 |
+
- codellama
|
8 |
+
- peft
|
9 |
+
- unit-tests
|
10 |
+
- causal-lm
|
11 |
+
- text-generation
|
12 |
+
- embedded-systems
|
13 |
+
base_model: codellama/CodeLlama-7b-hf
|
14 |
+
model_type: llama
|
15 |
+
pipeline_tag: text-generation
|
16 |
+
---
|
17 |
+
|
18 |
+
# CodeLlama Embedded Test Generator (v9)
|
19 |
+
|
20 |
+
This repository hosts an **instruction-tuned CodeLlama-7B model** that generates production-grade C/C++ unit tests
|
21 |
+
for embedded systems. The model combines the base [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) model
|
22 |
+
with a custom LoRA adapter trained on a curated dataset of embedded software tests.
|
23 |
+
|
24 |
+
---
|
25 |
+
|
26 |
+
## Prompt Schema
|
27 |
+
|
28 |
+
<|system|>
|
29 |
+
Generate unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
|
30 |
+
Output Constraints:
|
31 |
+
1. ONLY include test code (no explanations, headers, or main functions)
|
32 |
+
2. Start directly with TEST(...)
|
33 |
+
3. End after last test case
|
34 |
+
4. Never include framework boilerplate
|
35 |
+
|
36 |
+
<|user|>
|
37 |
+
Write test cases for the following C/C++ code:
|
38 |
+
{your C/C++ function here}
|
39 |
+
|
40 |
+
<|assistant|>
|
41 |
+
|
42 |
+
---
|
43 |
+
|
44 |
+
## Quick Inference Example
|
45 |
+
```python
|
46 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
47 |
+
import torch
|
48 |
+
|
49 |
+
model_id = "Utkarsh524/codellama_utests_full_new_ver9"
|
50 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
51 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
|
52 |
+
|
53 |
+
prompt = f"""<|system|>
|
54 |
+
Generate unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
|
55 |
+
Output Constraints:
|
56 |
+
|
57 |
+
ONLY include test code (no explanations, headers, or main functions)
|
58 |
+
|
59 |
+
Start directly with TEST(...)
|
60 |
+
|
61 |
+
End after last test case
|
62 |
+
|
63 |
+
Never include framework boilerplate
|
64 |
+
|
65 |
+
<|user|>
|
66 |
+
Write test cases for the following C/C++ code:
|
67 |
+
int add(int a, int b) {{ return a + b; }}
|
68 |
+
|
69 |
+
<|assistant|>
|
70 |
+
"""
|
71 |
+
|
72 |
+
inputs = tokenizer(
|
73 |
+
prompt,
|
74 |
+
return_tensors="pt",
|
75 |
+
padding=True,
|
76 |
+
truncation=True,
|
77 |
+
max_length=4096
|
78 |
+
).to("cuda")
|
79 |
+
|
80 |
+
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3, top_p=0.9)
|
81 |
+
print(tokenizer.decode(outputs, skip_special_tokens=True).split("<|assistant|>")[-1].strip())
|
82 |
+
|
83 |
+
```
|
84 |
+
|
85 |
+
---
|
86 |
+
|
87 |
+
## Training & Optimization Details
|
88 |
+
|
89 |
+
| Step | Description |
|
90 |
+
|---------------------|-----------------------------------------------------------------------------|
|
91 |
+
| **Dataset** | athrv/Embedded_Unittest2 (filtered for valid code-test pairs) |
|
92 |
+
| **Preprocessing** | Token length filtering (≤4096), special token injection |
|
93 |
+
| **Quantization** | 8-bit (BitsAndBytesConfig), llm_int8_threshold=6.0 |
|
94 |
+
| **LoRA Config** | r=64, alpha=32, dropout=0.1 on q_proj/v_proj/k_proj/o_proj |
|
95 |
+
| **Training** | 4 epochs, batch=4 (effective 8), lr=2e-4, FP16 |
|
96 |
+
| **Optimization** | Paged AdamW 8-bit, gradient checkpointing, custom data collator |
|
97 |
+
| **Special Tokens** | Added `<|system|>`, `<|user|>`, `<|assistant|>` |
|
98 |
+
|
99 |
+
---
|
100 |
+
|
101 |
+
## Tips for Best Results
|
102 |
+
|
103 |
+
- **Temperature:** 0.2–0.4
|
104 |
+
- **Top-p:** 0.85–0.95
|
105 |
+
- **Max New Tokens:** 256–512
|
106 |
+
- **Input Formatting:**
|
107 |
+
- Include complete function signatures
|
108 |
+
- Remove unnecessary comments
|
109 |
+
- Keep functions under 200 lines
|
110 |
+
- For long functions, split into logical units
|
111 |
+
|
112 |
+
---
|
113 |
+
|
114 |
+
## Feedback & Citation
|
115 |
+
|
116 |
+
**Dataset Credit:** `athrv/Embedded_Unittest2`
|
117 |
+
**Report Issues:** [Model's Hugging Face page](https://huggingface.co/Utkarsh524/codellama_utests_full_new_ver9)
|
118 |
+
|
119 |
+
**Maintainer:** Utkarsh524
|
120 |
+
**Model Version:** v9 (4-epoch trained)
|
121 |
+
---
|