GPT-OSS-Code-Reasoning-20B-GGUF / README.md

Update README.md

6e4fcfc verified 9 days ago

4.99 kB

	---
	datasets:
	- GetSoloTech/Code-Reasoning
	language:
	- en
	base_model:
	- GetSoloTech/GPT-OSS-Code-Reasoning-20B
	pipeline_tag: text-generation
	tags:
	- coding
	- reasoning
	- problem-solving
	- algorithms
	- python
	- c++
	- code-reasoning
	- competitive-programming
	---

	# GPT-OSS-Code-Reasoning-20B-GGUF

	<img src="gpt-oss-reasoning.png" width="700"/>

	This is the GGUF quantized version of the [GPT-OSS-Code-Reasoning-20B](https://huggingface.co/GetSoloTech/GPT-OSS-Code-Reasoning-20B) model, optimized for efficient inference with reduced memory requirements.

	## Overview

	- Base model: `openai/gpt-oss-20b`
	- Objective: Supervised fine-tuning for competitive programming and algorithmic reasoning
	- Format: GGUF (optimized for llama.cpp and compatible inference engines)

	## Model Variants

	This GGUF model is available in multiple quantization levels to suit different hardware requirements:

	\| Quantization \| Size \| Memory Usage \| Quality \|
	\|--------------\|------\|--------------\|---------\|
	\| Q3_K_M \| 12.9 GB \| ~13 GB \| Average \|
	\| Q4_K_M \| 15.8 GB \| ~16 GB \| Good \|
	\| Q5_K_M \| 16.9 GB \| ~17 GB \| Better \|
	\| Q8_0 \| 22.3 GB \| ~23 GB \| Best \|

	## Intended Use

	- Intended: Generating Python/C++ solutions and reasoning for competitive programming tasks
	- Out of scope: Safety-critical applications. May hallucinate or produce incorrect/inefficient code

	## Quick Start

	### Using llama.cpp

	```bash
	# Download the model
	wget https://huggingface.co/GetSoloTech/GPT-OSS-Code-Reasoning-20B-GGUF/resolve/main/gpt-oss-code-reasoning-20b.Q4_K_M.gguf

	# Run inference
	./llama.cpp -m gpt-oss-code-reasoning-20b.Q4_K_M.gguf -n 512 --repeat_penalty 1.1
	```

	### Using Python with llama-cpp-python

	```python
	from llama_cpp import Llama

	# Load the model
	llm = Llama(
	model_path="./gpt-oss-code-reasoning-20b.Q4_K_M.gguf",
	n_ctx=4096,
	n_threads=8
	)

	# Example problem
	problem_text = """
	You are given an array of integers nums and an integer target.
	Return indices of the two numbers such that they add up to target.
	"""

	# Create the prompt
	prompt = f"""<\|im_start\|>system
	You are an expert competitive programmer. Read the problem and produce a correct, efficient solution. Include reasoning if helpful.
	<\|im_end\|>
	<\|im_start\|>user
	{problem_text}
	<\|im_end\|>
	<\|im_start\|>assistant
	"""

	# Generate response
	output = llm(
	prompt,
	max_tokens=768,
	temperature=0.3,
	top_p=0.9,
	repeat_penalty=1.1,
	stop=["<\|im_end\|>"]
	)

	print(output['choices'][0]['text'])
	```

	### Using Ollama

	```bash
	# Create a Modelfile
	cat > Modelfile << EOF
	FROM ./gpt-oss-code-reasoning-20b.Q4_K_M.gguf
	TEMPLATE """<\|im_start\|>system
	{{ .System }}
	<\|im_end\|>
	<\|im_start\|>user
	{{ .Prompt }}
	<\|im_end\|>
	<\|im_start\|>assistant
	"""
	PARAMETER temperature 0.3
	PARAMETER top_p 0.9
	PARAMETER repeat_penalty 1.1
	EOF

	# Create and run the model
	ollama create code-reasoning -f Modelfile
	ollama run code-reasoning "Solve this competitive programming problem: [your problem here]"
	```

	## Prompt Format

	This model was trained in a chat format. Recommended structure:

	```python
	messages = [
	{"role": "system", "content": "You are an expert competitive programmer. Read the problem and produce a correct, efficient solution. Include reasoning if helpful."},
	{"role": "user", "content": problem_text},
	]
	```

	For GGUF models, use the following format:

	```
	<\|im_start\|>system
	You are an expert competitive programmer. Read the problem and produce a correct, efficient solution. Include reasoning if helpful.
	<\|im_end\|>
	<\|im_start\|>user
	{problem_text}
	<\|im_end\|>
	<\|im_start\|>assistant
	```

	## Generation Tips

	- Reasoning style: Lower temperature (0.2–0.5) for clearer step-by-step reasoning
	- Length: Use `max_tokens` 512–1024 for full solutions; shorter for hints
	- Stop tokens: The model uses `<\|im_end\|>` as a stop token
	- Memory optimization: Choose the appropriate quantization level based on your hardware

	## Hardware Requirements

	\| Quantization \| Minimum RAM \| Recommended RAM \| GPU VRAM \|
	\|--------------\|-------------\|-----------------\|----------\|
	\| Q3_K_M \| 8 GB \| 16 GB \| 8 GB \|
	\| Q4_K_M \| 12 GB \| 24 GB \| 12 GB \|
	\| Q5_K_M \| 16 GB \| 32 GB \| 16 GB \|
	\| Q8_0 \| 24 GB \| 48 GB \| 24 GB \|

	## Performance Notes

	- Speed: GGUF models are optimized for fast inference
	- Memory: Significantly reduced memory footprint compared to the original model
	- Quality: Minimal quality loss with appropriate quantization levels
	- Compatibility: Works with llama.cpp, llama-cpp-python, Ollama, and other GGUF-compatible engines


	## Acknowledgements

	- Original model: [GetSoloTech/GPT-OSS-Code-Reasoning-20B](https://huggingface.co/GetSoloTech/GPT-OSS-Code-Reasoning-20B)
	- Base model: `openai/gpt-oss-20b`
	- Dataset: `nvidia/OpenCodeReasoning-2`
	- Upstream benchmarks: TACO, APPS, DeepMind CodeContests, `open-r1/codeforces`