SVECTOR-OFFICIAL commited on
Commit
7b20ef6
·
verified ·
1 Parent(s): fd7a59c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -3
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - bigcode/the-stack
4
+ - bigcode/the-stack-v2
5
+ - bigcode/starcoderdata
6
+ - bigcode/commitpack
7
+ - codeparrot/github-code
8
+ - nvidia/OpenCodeReasoning
9
+ library_name: transformers
10
+ tags:
11
+ - code
12
+ license: mit
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # Spec Coder V1
17
+ **Spec Coder** is a cutting-edge, open-source AI model designed to assist with fundamental coding tasks. It is built on the **Llama architecture**, allowing seamless access via tools like **llama.cpp** and **Ollama**. This makes **Spec Coder** highly compatible with a variety of systems, enabling flexible deployment both locally and in the cloud.
18
+
19
+ Trained on vast datasets, **Spec Coder** excels in generating code, completing code snippets, and understanding programming tasks across multiple languages. It can be used for code completion, debugging, and automated code generation, acting as a versatile assistant for developers.
20
+
21
+ **Spec Coder** is optimized for integration into developer tools, providing intelligent coding assistance and facilitating research in programming languages. Its advanced transformer-based architecture, with 4 billion parameters, allows it to perform tasks across different environments efficiently.
22
+
23
+ The model supports various downstream tasks including supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its performance for specific programming tasks.
24
+
25
+ # Training Data
26
+ - Total Training Tokens: ~4.3 trillion tokens
27
+ - Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, English Wikipedia
28
+
29
+ # Training Details
30
+ - Context Window: 8,192 tokens
31
+ - Optimization: Standard language modeling objective
32
+ - Hardware: Cluster of 5 x RTX 4090 GPUs
33
+ - Training Duration: ~140 days (approximately 6 months)
34
+
35
+ # Benchmarks
36
+ ## RepoBench 1.1 (Python)
37
+ | Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
38
+ |--------------------|-------|-------|-------|-------|-------|-------|----------|
39
+ | Spec-Coder-4b-V1 | 30.42%| 38.55%| 36.91%| 32.75%| 30.34%| 34.59%| 36.23% |
40
+
41
+ ## Syntax-Aware Fill-in-the-Middle (SAFIM)
42
+ | Model | Algorithmic | Control | API | Average |
43
+ |----------------------|-------------|---------|--------|---------|
44
+ | Spec-Coder-4b-V1 | 38.22% | 41.18% | 60.45% | 46.28% |
45
+
46
+ ## HumanEval Infilling
47
+ | Model | Single-Line | Multi-Line | Random Span |
48
+ |----------------------|-------------|------------|-------------|
49
+ | Spec-Coder-4b-V1 | 72.34% | 45.65% | 39.12% |
50
+
51
+ # Limitations
52
+ - **Biases**: The model may reflect biases present in the public codebases.
53
+ - **Security**: Code generated by the model may contain security vulnerabilities. It is essential to verify and audit the code generated by the model for any potential risks.
54
+
55
+ # Sample Usage
56
+ Here are examples of how to run and interact with **Spec Coder**:
57
+
58
+ ```python
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+
61
+ model_name = "SVECTOR-CORPORATION/Spec-Coder-4b-V1"
62
+ model = AutoModelForCausalLM.from_pretrained(model_name)
63
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
64
+
65
+ input_code = "def factorial(n):\n if n == 0:"
66
+
67
+ inputs = tokenizer(input_code, return_tensors="pt")
68
+
69
+ outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)
70
+
71
+ generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
72
+
73
+ print("Generated Python code:\n", generated_code)
74
+ ```