Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,74 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- bigcode/the-stack
|
4 |
+
- bigcode/the-stack-v2
|
5 |
+
- bigcode/starcoderdata
|
6 |
+
- bigcode/commitpack
|
7 |
+
- codeparrot/github-code
|
8 |
+
- nvidia/OpenCodeReasoning
|
9 |
+
library_name: transformers
|
10 |
+
tags:
|
11 |
+
- code
|
12 |
+
license: mit
|
13 |
+
pipeline_tag: text-generation
|
14 |
+
---
|
15 |
+
|
16 |
+
# Spec Coder V1
|
17 |
+
**Spec Coder** is a cutting-edge, open-source AI model designed to assist with fundamental coding tasks. It is built on the **Llama architecture**, allowing seamless access via tools like **llama.cpp** and **Ollama**. This makes **Spec Coder** highly compatible with a variety of systems, enabling flexible deployment both locally and in the cloud.
|
18 |
+
|
19 |
+
Trained on vast datasets, **Spec Coder** excels in generating code, completing code snippets, and understanding programming tasks across multiple languages. It can be used for code completion, debugging, and automated code generation, acting as a versatile assistant for developers.
|
20 |
+
|
21 |
+
**Spec Coder** is optimized for integration into developer tools, providing intelligent coding assistance and facilitating research in programming languages. Its advanced transformer-based architecture, with 4 billion parameters, allows it to perform tasks across different environments efficiently.
|
22 |
+
|
23 |
+
The model supports various downstream tasks including supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its performance for specific programming tasks.
|
24 |
+
|
25 |
+
# Training Data
|
26 |
+
- Total Training Tokens: ~4.3 trillion tokens
|
27 |
+
- Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, English Wikipedia
|
28 |
+
|
29 |
+
# Training Details
|
30 |
+
- Context Window: 8,192 tokens
|
31 |
+
- Optimization: Standard language modeling objective
|
32 |
+
- Hardware: Cluster of 5 x RTX 4090 GPUs
|
33 |
+
- Training Duration: ~140 days (approximately 6 months)
|
34 |
+
|
35 |
+
# Benchmarks
|
36 |
+
## RepoBench 1.1 (Python)
|
37 |
+
| Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
|
38 |
+
|--------------------|-------|-------|-------|-------|-------|-------|----------|
|
39 |
+
| Spec-Coder-4b-V1 | 30.42%| 38.55%| 36.91%| 32.75%| 30.34%| 34.59%| 36.23% |
|
40 |
+
|
41 |
+
## Syntax-Aware Fill-in-the-Middle (SAFIM)
|
42 |
+
| Model | Algorithmic | Control | API | Average |
|
43 |
+
|----------------------|-------------|---------|--------|---------|
|
44 |
+
| Spec-Coder-4b-V1 | 38.22% | 41.18% | 60.45% | 46.28% |
|
45 |
+
|
46 |
+
## HumanEval Infilling
|
47 |
+
| Model | Single-Line | Multi-Line | Random Span |
|
48 |
+
|----------------------|-------------|------------|-------------|
|
49 |
+
| Spec-Coder-4b-V1 | 72.34% | 45.65% | 39.12% |
|
50 |
+
|
51 |
+
# Limitations
|
52 |
+
- **Biases**: The model may reflect biases present in the public codebases.
|
53 |
+
- **Security**: Code generated by the model may contain security vulnerabilities. It is essential to verify and audit the code generated by the model for any potential risks.
|
54 |
+
|
55 |
+
# Sample Usage
|
56 |
+
Here are examples of how to run and interact with **Spec Coder**:
|
57 |
+
|
58 |
+
```python
|
59 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
60 |
+
|
61 |
+
model_name = "SVECTOR-CORPORATION/Spec-Coder-4b-V1"
|
62 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
63 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
64 |
+
|
65 |
+
input_code = "def factorial(n):\n if n == 0:"
|
66 |
+
|
67 |
+
inputs = tokenizer(input_code, return_tensors="pt")
|
68 |
+
|
69 |
+
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)
|
70 |
+
|
71 |
+
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
72 |
+
|
73 |
+
print("Generated Python code:\n", generated_code)
|
74 |
+
```
|