--- license: mit datasets: - gretelai/synthetic_text_to_sql base_model: - eagle0504/openai-gsm8k-codealpaca-20k-enhanced-deepseek-r1-distill-qwen-1.5b library_name: transformers --- # ๐Ÿง  eagle0504/qwen-distilled-scout-1.5b-gen2 This model is a fine-tuned version of [`deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), enhanced with chain-of-thought (CoT) reasoning on a hybrid dataset combining GSM8K-style reasoning and structured `text-to-SQL` generation. Fine-tuning was conducted using DeepSpeed on a multi-A100 GPU setup via RunPod for efficient training in memory-constrained environments. The training dataset includes complex logical SQL queries generated synthetically with corresponding natural language prompts and CoT explanations. Inference notebook is publicly available [here](https://colab.research.google.com/drive/10CJqyIAOd9QnEp0W8NN_SxdiOrFsBz0-?usp=sharing). --- ## ๐Ÿงพ Model Details - **Base Model:** [`deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) - **Language:** English - **Architecture:** Causal Language Model (Decoder-only) - **Tokenizer:** AutoTokenizer from base model - **Parameter Count:** 1.5 Billion - **Training Framework:** ๐Ÿค— Transformers + DeepSpeed - **Compute Environment:** RunPod (6x A100 SXM, 192 vCPU, 1.5TB RAM) --- ## ๐Ÿงช Training Dataset **Dataset Used:** - [`gretelai/synthetic_text_to_sql`](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) - [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1) - [`eagle0504/augmented_codealpaca-20k-using-together-ai-deepseek-v1`](https://huggingface.co/datasets/eagle0504/augmented_codealpaca-20k-using-together-ai-deepseek-v1) The dataset contains structured training examples of the form: ```xml ... ... ... ```` Each example is constructed to include: * A natural language question about tabular data (`sql_prompt`) * An intermediate reasoning step (`sql_explanation`) * The final SQL output (`sql`) This format allows the model to internalize step-by-step logical reasoning for SQL generation. --- ## ๐Ÿ—๏ธ Training Configuration Training was performed with the following configuration: * **Batch Size:** 2 (with gradient accumulation steps = 4) * **Epochs:** 15 * **Max Length:** 1024 tokens * **Optimizer:** AdamW * **Learning Rate:** 5e-5 (with warmup + linear decay) * **Precision:** FP16 * **DeepSpeed Config:** * Zero Redundancy Optimizer Stage 2 * Gradient Clipping: 1.0 * AllGather + ReduceScatter optimization * **Checkpoint Saving:** Disabled to minimize disk usage --- ## ๐Ÿงฎ Evaluation Metric The model is evaluated with a custom token-level accuracy metric: * **Metric:** Mean token-level accuracy * **Definition:** Accuracy over all non-masked tokens (`labels != -100`) * **Implementation:** NumPy-based vectorized comparison between predicted tokens and ground truth --- ## ๐Ÿš€ Use Case The model is designed for **chain-of-thought text-to-SQL generation**, useful in: * AI teaching agents * Conversational agents with data query capabilities * Auto SQL generation tools for tabular backends * Educational applications in logical reasoning --- ## ๐Ÿ“ฆ How to Use ```python from transformers import StoppingCriteria, StoppingCriteriaList import torch class StopOnTokens(StoppingCriteria): def __init__(self, stop_token_ids: list): super().__init__() self.stop_token_ids = stop_token_ids def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: # Check if the last token matches any of the stop tokens return any(input_ids[0, -len(token):].tolist() == token for token in self.stop_token_ids) ``` ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("eagle0504/qwen-distilled-scout-1.5b-gen2") tokenizer = AutoTokenizer.from_pretrained("eagle0504/qwen-distilled-scout-1.5b-gen2") # Example stop sequence stop_sequence = "" stop_ids = tokenizer.encode(stop_sequence, add_special_tokens=False) stopping_criteria = StoppingCriteriaList([StopOnTokens([stop_ids])]) # Run generation with stop sequence inputs = tokenizer( "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", return_tensors="pt" ) outputs = model.generate( **inputs, max_new_tokens=230, stopping_criteria=stopping_criteria ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## ๐Ÿ“Š Limitations * The model is tuned for text-to-SQL tasks with CoT supervision and may not generalize well to free-form text generation or other domains without additional fine-tuning. * Maximum input length is 1024 tokens โ€” longer contexts will be truncated. --- ## ๐Ÿง‘โ€๐Ÿ’ป Author * **Name:** Yiqiao Yin * **Hugging Face:** [eagle0504](https://huggingface.co/eagle0504) * **Organization:** \[WYN AI / Independent AI Researcher] --- ## ๐Ÿ“ Citation If you use this model in your work, please cite: ```bibtex @misc{yin2025enhanceddeepseek, title={Enhanced DeepSeek-R1-Distill-Qwen-1.5B Fine-tuned on GSM8K + CoT SQL}, author={Yiqiao Yin}, year={2025}, howpublished={\url{https://huggingface.co/eagle0504/enhanced-deepseek-r1-distill-qwen-1.5b-finetuned-on-gsm8k-codealpaca20k-text2sql}}, } ``` --- ## ๐Ÿ“ฌ Contact For questions or collaborations, reach out via [LinkedIn](https://www.linkedin.com/in/yiqiaoyin) or email: [eagle0504@gmail.com](mailto:eagle0504@gmail.com)