Utkarsh524 commited on
Commit
f210890
·
verified ·
1 Parent(s): b8868d3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ license: apache-2.0
4
+ language: c++
5
+ tags:
6
+ - code-generation
7
+ - codellama
8
+ - peft
9
+ - unit-tests
10
+ - causal-lm
11
+ - text-generation
12
+ - embedded-systems
13
+ base_model: codellama/CodeLlama-7b-hf
14
+ model_type: llama
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # CodeLlama Embedded Test Generator (v9)
19
+
20
+ This repository hosts an **instruction-tuned CodeLlama-7B model** that generates production-grade C/C++ unit tests
21
+ for embedded systems. The model combines the base [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) model
22
+ with a custom LoRA adapter trained on a curated dataset of embedded software tests.
23
+
24
+ ---
25
+
26
+ ## Prompt Schema
27
+
28
+ <|system|>
29
+ Generate unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
30
+ Output Constraints:
31
+ 1. ONLY include test code (no explanations, headers, or main functions)
32
+ 2. Start directly with TEST(...)
33
+ 3. End after last test case
34
+ 4. Never include framework boilerplate
35
+
36
+ <|user|>
37
+ Write test cases for the following C/C++ code:
38
+ {your C/C++ function here}
39
+
40
+ <|assistant|>
41
+
42
+ ---
43
+
44
+ ## Quick Inference Example
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModelForCausalLM
47
+ import torch
48
+
49
+ model_id = "Utkarsh524/codellama_utests_full_new_ver9"
50
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
51
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
52
+
53
+ prompt = f"""<|system|>
54
+ Generate unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
55
+ Output Constraints:
56
+
57
+ ONLY include test code (no explanations, headers, or main functions)
58
+
59
+ Start directly with TEST(...)
60
+
61
+ End after last test case
62
+
63
+ Never include framework boilerplate
64
+
65
+ <|user|>
66
+ Write test cases for the following C/C++ code:
67
+ int add(int a, int b) {{ return a + b; }}
68
+
69
+ <|assistant|>
70
+ """
71
+
72
+ inputs = tokenizer(
73
+ prompt,
74
+ return_tensors="pt",
75
+ padding=True,
76
+ truncation=True,
77
+ max_length=4096
78
+ ).to("cuda")
79
+
80
+ outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3, top_p=0.9)
81
+ print(tokenizer.decode(outputs, skip_special_tokens=True).split("<|assistant|>")[-1].strip())
82
+
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Training & Optimization Details
88
+
89
+ | Step | Description |
90
+ |---------------------|-----------------------------------------------------------------------------|
91
+ | **Dataset** | athrv/Embedded_Unittest2 (filtered for valid code-test pairs) |
92
+ | **Preprocessing** | Token length filtering (≤4096), special token injection |
93
+ | **Quantization** | 8-bit (BitsAndBytesConfig), llm_int8_threshold=6.0 |
94
+ | **LoRA Config** | r=64, alpha=32, dropout=0.1 on q_proj/v_proj/k_proj/o_proj |
95
+ | **Training** | 4 epochs, batch=4 (effective 8), lr=2e-4, FP16 |
96
+ | **Optimization** | Paged AdamW 8-bit, gradient checkpointing, custom data collator |
97
+ | **Special Tokens** | Added `<|system|>`, `<|user|>`, `<|assistant|>` |
98
+
99
+ ---
100
+
101
+ ## Tips for Best Results
102
+
103
+ - **Temperature:** 0.2–0.4
104
+ - **Top-p:** 0.85–0.95
105
+ - **Max New Tokens:** 256–512
106
+ - **Input Formatting:**
107
+ - Include complete function signatures
108
+ - Remove unnecessary comments
109
+ - Keep functions under 200 lines
110
+ - For long functions, split into logical units
111
+
112
+ ---
113
+
114
+ ## Feedback & Citation
115
+
116
+ **Dataset Credit:** `athrv/Embedded_Unittest2`
117
+ **Report Issues:** [Model's Hugging Face page](https://huggingface.co/Utkarsh524/codellama_utests_full_new_ver9)
118
+
119
+ **Maintainer:** Utkarsh524
120
+ **Model Version:** v9 (4-epoch trained)
121
+ ---