Text Generation
Transformers
Safetensors
gemma
conversational
text-generation-inference
md-nishat-008 commited on
Commit
44930f6
Β·
verified Β·
1 Parent(s): 7bc771f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +221 -3
README.md CHANGED
@@ -1,3 +1,221 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ datasets:
5
+ - md-nishat-008/Mojo-Corpus
6
+ - md-nishat-008/Mojo-SFT
7
+ - md-nishat-008/Mojo-mSFT
8
+ pipeline_tag: text-generation
9
+ ---
10
+
11
+ <div align="center">
12
+ <h1>πŸ”₯ Mojo-Coder πŸ”₯</h1>
13
+ <em>State-of-the-art Language Model for Mojo Programming</em>
14
+ </div>
15
+
16
+
17
+ <div align="center">
18
+ <table><tr>
19
+ <td><a href="https://arxiv.org/abs/2410.17736"><img src="https://img.shields.io/badge/arXiv-Read_Paper-blue?style=for-the-badge&logo=arxiv" /></a></td>
20
+ <td><a href="mailto:[email protected]"><img src="https://img.shields.io/badge/Email-Contact_Us-blue?style=for-the-badge&logo=gmail" /></a></td>
21
+ </tr></table>
22
+ </div>
23
+
24
+
25
+
26
+
27
+ <div align="center">
28
+ <h2>🎯 Background and Motivation</h2>
29
+ </div>
30
+
31
+ Mojo programming language, developed by Modular, has emerged as a game-changing technology in high-performance computing and AI development. Despite its growing popularity and impressive capabilities (up to 68,000x faster than Python!), existing LLMs struggle with Mojo code generation. Mojo-Coder addresses this gap by providing specialized support for Mojo programming, built upon the robust architecture of [CodeGemma-7B-IT](https://huggingface.co/google/codegemma-7b-it/).
32
+
33
+ <div align="center">
34
+ <h2>πŸ€– Model Information</h2>
35
+ </div>
36
+
37
+ Mojo-Coder transforms natural language instructions into optimized Mojo code, supporting multiple languages (English, German, French, Spanish, and Bangla) while maintaining high-quality code generation capabilities.
38
+
39
+ <div align="center">
40
+ <h2>πŸ“ Description</h2>
41
+ </div>
42
+
43
+ The Mojo-Coder family consists of three specialized 7B-parameter models, each built on CodeGemma's architecture:
44
+
45
+ | | [mojo-coder](https://huggingface.co/md-nishat-008/mojo-coder) | [mojo-coder-it](https://huggingface.co/md-nishat-008/mojo-coder-it) | [**mojo-coder-it-m**](https://huggingface.co/md-nishat-008/mojo-coder-it-m) |
46
+ |---------------------------|:---:|:---:|:---:|
47
+ | πŸ”„ Code Completion | βœ… | βœ… | βœ… |
48
+ | πŸ’‘ NL β†’ Code Generation | | βœ… | βœ… |
49
+ | 🌏 Multilingual Support | | | βœ… |
50
+ | πŸ“ Instruction Following | | βœ… | βœ… |
51
+
52
+ <div align="center">
53
+ <h2>πŸš€ Sample Usage</h2>
54
+ </div>
55
+
56
+ Choose the model that best fits your needs:
57
+ - For basic Mojo code completion: [mojo-coder](https://huggingface.co/md-nishat-008/mojo-coder)
58
+ - For English instruction-based code generation: [mojo-coder-it](https://huggingface.co/md-nishat-008/mojo-coder-it)
59
+ - For multilingual support: [mojo-coder-it-m](https://huggingface.co/md-nishat-008/mojo-coder-it-m)
60
+
61
+ Notably, our models significantly outperform current state-of-the-art models including GPT-4o and Claude-3.5-Sonnet on the HumanEval-Mojo benchmark.
62
+
63
+ <div align="center">
64
+ <h3>✨ Let's revolutionize Mojo programming together! ✨</h3>
65
+ </div>
66
+
67
+
68
+
69
+
70
+ <div style="color: red; text-align: center; padding: 10px; margin: 20px 0; border: 2px solid red; border-radius: 5px;">
71
+ <strong>⚠️ IMPORTANT: When using the model, you MUST explicitly mention "Mojo" in your prompts (e.g., "Write a Mojo function to...", "Create Mojo code that...") otherwise the model may not generate Mojo code!</strong>
72
+ </div>
73
+
74
+ #### For Code Generation
75
+
76
+ ```python
77
+ from transformers import GemmaTokenizer, AutoModelForCausalLM
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/Mojo-Coder-it")
80
+ model = AutoModelForCausalLM.from_pretrained("md-nishat-008/Mojo-Coder-it")
81
+
82
+ input_text = "Write me a Mojo function to calculate the nth fibonacci number."
83
+ input_ids = tokenizer(input_text, return_tensors="pt")
84
+
85
+ outputs = model.generate(**input_ids)
86
+ print(tokenizer.decode(outputs[0]))
87
+ ```
88
+
89
+ #### Chat Template
90
+
91
+ The instruction-tuned models use a chat template that must be adhered to for conversational use.
92
+ The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
93
+
94
+ Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:
95
+
96
+ ```py
97
+ from transformers import AutoModelForCausalLM, AutoTokenizer
98
+ import torch
99
+
100
+ tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/Mojo-Coder-it")
101
+ model = AutoModelForCausalLM.from_pretrained("md-nishat-008/Mojo-Coder-it")
102
+
103
+ chat = [{"role": "user", "content": "Write a function that calculates factorial of a number in Mojo"}]
104
+ inputs = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors="pt").to("cuda")
105
+
106
+ with torch.no_grad():
107
+ outputs = model.generate(
108
+ inputs=inputs,
109
+ max_new_tokens=1000,
110
+ temperature=0.7,
111
+ top_p=0.95,
112
+ pad_token_id=tokenizer.eos_token_id,
113
+ eos_token_id=tokenizer.eos_token_id
114
+ )
115
+
116
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
117
+ ```
118
+
119
+ At this point, the prompt contains the following text:
120
+
121
+ ```
122
+ <bos><start_of_turn>user
123
+ Write a hello world program in Mojo<end_of_turn>
124
+ <start_of_turn>model
125
+ ```
126
+
127
+ As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
128
+ (either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
129
+ the `<end_of_turn>` token.
130
+
131
+ You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
132
+ chat template.
133
+
134
+ After the prompt is ready, generation can be performed like this:
135
+
136
+ ```py
137
+ inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
138
+ outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
139
+ ```
140
+
141
+ <div align="center">
142
+ <h2>βš™οΈ Inputs and Outputs</h2>
143
+ </div>
144
+
145
+ **Inputs**:
146
+ - For base model (mojo-coder): code prefix and/or suffix for Mojo code completion
147
+ - For instruction-tuned models (mojo-coder-it & mojo-coder-it-m): natural language prompts/instructions
148
+
149
+ <p style="color: red;"><strong>Note: In prompts, you must explicitly mention "Mojo" (e.g., "Write a Mojo function to...", "Write Mojo code to...") otherwise the models may not generate Mojo code.</strong></p>
150
+
151
+ **Outputs**:
152
+ - For all variants: Mojo code snippets and natural language responses
153
+ - Additional explanations and documentation when requested
154
+
155
+ <div align="center">
156
+ <h2>πŸ“š Model Data</h2>
157
+ </div>
158
+
159
+ ### Training Dataset
160
+
161
+ Using [CodeGemma-7B-IT](https://huggingface.co/google/codegemma-7b-it/) as our base model, we further trained on:
162
+ - [Mojo-Corpus](https://huggingface.co/datasets/md-nishat-008/Mojo_Corpus): 6.5M tokens of curated Mojo code from public repositories
163
+ - [Mojo-SFT](https://huggingface.co/datasets/md-nishat-008/Mojo_SFT): 3,200 instruction-code pairs for English
164
+ - [Mojo-mSFT](https://huggingface.co/datasets/md-nishat-008/Mojo_mSFT): Multilingual instruction-code pairs in 5 languages
165
+
166
+ ### Training Data Processing
167
+
168
+ The following data pre-processing techniques were applied:
169
+ - Rigorous filtering pipeline (F1-F6) to ensure code quality
170
+ - Apache 2.0 license compliance
171
+ - Language detection using fastText
172
+ - Duplicate removal and content validation
173
+ - Expert review for instruction-code pairs
174
+
175
+ <div align="center">
176
+ <h2>πŸ“Š Evaluation Information</h2>
177
+ </div>
178
+
179
+ ### Evaluation Approach
180
+
181
+ We evaluate Mojo-Coder on:
182
+ - [HumanEval-Mojo](https://huggingface.co/datasets/md-nishat-008/HumanEval-Mojo): First benchmark for Mojo code generation
183
+ - Multi-language instruction following
184
+ - Code quality and execution success
185
+
186
+ ### Evaluation Results
187
+
188
+ #### Code Generation Benchmarks (Pass@1)
189
+
190
+ | Model | HumanEval-Mojo |
191
+ |-------|----------------|
192
+ | GPT-4o | 25.5% |
193
+ | Claude-3.5-Sonnet | 39.8% |
194
+ | mojo-coder | 36.7% |
195
+ | mojo-coder-it-m | 61.5% |
196
+ | mojo-coder-it | 66.4% |
197
+
198
+ <div align="center">
199
+ <h2>⚠️ Limitations and Usage</h2>
200
+ </div>
201
+
202
+ ### Intended Usage
203
+ - Mojo code completion and generation
204
+ - Multi-language instruction following
205
+ - Code documentation and explanation
206
+ - Educational support for Mojo programming
207
+
208
+ ### Known Limitations
209
+ - Limited to Mojo programming language
210
+ - Requires explicit mention of "Mojo" in prompts
211
+ - Performance may vary with complex algorithms
212
+ - May occasionally generate Python-like syntax
213
+ - Based on data available up to 2024
214
+
215
+ ### Ethical Considerations
216
+ The model is designed for:
217
+ - Educational and development purposes
218
+ - Open-source contribution to Mojo ecosystem
219
+ - Supporting multilingual access to Mojo programming
220
+
221
+ Code should be reviewed and tested before production use, especially for performance-critical applications.