ThomasTheMaker
/

Deepseek-Small-Stories

Model card Files Files and versions Community

ThomasTheMaker commited on 11 days ago

Commit

01ae771

verified ·

1 Parent(s): 4620167

Upload folder using huggingface_hub

Browse files

Files changed (43) hide show

.gitattributes +3 -0
LICENSE +21 -0
README.md +160 -0
checkpoints/best_model.pt +3 -0
checkpoints/checkpoint_0.pt +3 -0
checkpoints/checkpoint_1000.pt +3 -0
checkpoints/checkpoint_10000.pt +3 -0
checkpoints/checkpoint_11000.pt +3 -0
checkpoints/checkpoint_12000.pt +3 -0
checkpoints/checkpoint_13000.pt +3 -0
checkpoints/checkpoint_14000.pt +3 -0
checkpoints/checkpoint_15000.pt +3 -0
checkpoints/checkpoint_16000.pt +3 -0
checkpoints/checkpoint_17000.pt +3 -0
checkpoints/checkpoint_18000.pt +3 -0
checkpoints/checkpoint_19000.pt +3 -0
checkpoints/checkpoint_2000.pt +3 -0
checkpoints/checkpoint_20000.pt +3 -0
checkpoints/checkpoint_3000.pt +3 -0
checkpoints/checkpoint_4000.pt +3 -0
checkpoints/checkpoint_5000.pt +3 -0
checkpoints/checkpoint_6000.pt +3 -0
checkpoints/checkpoint_7000.pt +3 -0
checkpoints/checkpoint_8000.pt +3 -0
checkpoints/checkpoint_9000.pt +3 -0
checkpoints/final_model.pt +3 -0
deepseek-arch.png +3 -0
deepseek_training_metrics.png +3 -0
process_data.py +16 -0
requirements.txt +13 -0
setup.sh +313 -0
src/data/__pycache__/data_processor.cpython-310.pyc +0 -0
src/data/data_processor.py +287 -0
src/data/finetune.bin +3 -0
src/data/train.bin +3 -0
src/data/validation.bin +3 -0
src/generate.py +281 -0
src/model/__pycache__/deepseek.cpython-310.pyc +0 -0
src/model/deepseek.py +513 -0
src/run_training.py +307 -0
src/training/__pycache__/trainer.cpython-310.pyc +0 -0
src/training/trainer.py +408 -0
training_metrics.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+deepseek-arch.png filter=lfs diff=lfs merge=lfs -text
+deepseek_training_metrics.png filter=lfs diff=lfs merge=lfs -text
+training_metrics.png filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 IdeaWeaver AI
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# DeepSeek-Children-Stories
+A state-of-the-art DeepSeek model optimized for children's story generation, featuring advanced architecture with just ~15-18M parameters.
+## Architecture Highlights
+![DeepSeek Architecture](deepseek-arch.png)
+- **Multihead Latent Attention (MLA)** - DeepSeek's efficient attention mechanism
+- **Mixture of Experts (MoE)** - 4 experts with top-2 routing for increased capacity
+- **Multi-token Prediction** - Predicts next 2 tokens simultaneously for efficiency
+- **Rotary Positional Encodings (RoPE)** - Better position understanding
+## Model Specifications
+- **Parameters**: ~15-18M (6 layers, 8 heads, 512 embedding dim)
+- **Context Window**: 1024 tokens
+- **Vocabulary**: GPT-2 compatible (50,257 tokens)
+- **Training Data**: 2,000+ children's stories from Hugging Face
+## Hardware Used
+Training was performed on the following hardware:
+- **GPU**: NVIDIA RTX 4090 (24 GB VRAM)
+- **RAM**: 41 GB
+- **CPU**: 6 vCPU
+## Quick Start
+### Installation
+```bash
+# Clone the repository
+git clone https://github.com/ideaweaver-ai/DeepSeek-Children-Stories-15M-model.git
+cd DeepSeek-Children-Stories-15M-model
+# Install dependencies
+pip install -r requirements.txt
+# Setup the environment
+chmod +x setup.sh
+./setup.sh
+```
+### Training
+```bash
+# Start training
+python src/run_training.py
+# With custom parameters
+python src/run_training.py --batch-size 8 --max-iters 10000 --learning-rate 6e-4
+```
+### Generation
+```bash
+# Generate stories
+python src/generate.py --prompt "Once upon a time, there was a brave little mouse"
+# With custom parameters
+python src/generate.py --prompt "A magical forest adventure" --max-tokens 200 --temperature 0.8
+```
+## 📖 Example Output
+Here's an example of a story generated by the model:
+**Prompt**: "Once upon a time"
+**Generated Story**:
+```
+it was a bright, sunny day, and lily and her little brother max were playing in their backyard. they found a piece of paper with two sentence written on it. "let's make sense of some of these sentences," said max, pointing to the first sentence. "these people are playing on the grass," "but i don't know," replied lily. she thought for a moment. "maybe they only talk with the others or not, right?" she asked. max nodded. "yeah, and what about 'he', 'he', 'an', 'man', and 'man'?" lily explained, "it means they're playing with their dogs. but they don't say anything about someone talking." max asked, "but what about the others? we don't talk to each other!" lily thought for a moment before answering, "that's right! sometimes, people try to talk to each other. when we talk about something, we need to tell others
+```
+## Training Metrics
+<p align="center">
+  <img src="training_metrics.png" alt="Training and Validation Loss and Learning Rate" width="800"/>
+</p>
+## Configuration
+The model can be configured through command-line arguments:
+```bash
+# Model configuration
+--n-layer 6          # Number of transformer layers
+--n-head 8           # Number of attention heads
+--n-embd 512         # Embedding dimension
+--block-size 1024    # Context window size
+# Training configuration
+--batch-size 12      # Batch size
+--max-iters 20000    # Maximum training iterations
+--learning-rate 6e-4 # Learning rate
+--eval-interval 1000 # Evaluation interval
+# Advanced features
+--moe-experts 4      # Number of MoE experts
+--multi-token 2      # Multi-token prediction
+```
+## 🤗 Model Available on Hugging Face
+The trained model is now available on Hugging Face Hub! You can use it directly:
+**Model**: [lakhera2023/deepseek-children-stories](https://huggingface.co/lakhera2023/deepseek-children-stories)
+## Features
+### Advanced Architecture
+- **MLA**: Efficient attention with shared key-value heads
+- **MoE**: Mixture of experts for increased model capacity
+- **Multi-token Prediction**: Simultaneous prediction of multiple tokens
+- **RoPE**: Rotary positional encodings for better position understanding
+### Training Optimizations
+- Mixed precision training with gradient scaling
+- PyTorch 2.0 compilation for speed
+- Automatic checkpointing and model saving
+- MoE auxiliary loss for load balancing
+### Story Generation
+- Creative and engaging children's stories
+- Moral lessons and educational content
+- Age-appropriate language and themes
+- Consistent character development
+## Performance
+The model achieves:
+- Efficient training with ~2.24GB GPU memory usage
+- Fast inference for real-time story generation
+- High-quality output suitable for children
+- Scalable architecture for different use cases
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Acknowledgments
+- DeepSeek team for the original architecture
+- Hugging Face for the children's stories dataset
+- PyTorch team for the excellent framework
+## Links
+- **GitHub**: https://github.com/ideaweaver-ai/DeepSeek-Children-Stories-15M-model
+---
+⭐ **Star this repository if you think Advanced Architecture + Tiny Models can do Big Things!**

checkpoints/best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6ed045ca7f558caa89ae9345afc8f279d85838fd88ee066525d3e0497c2b1903
+size 943196083

checkpoints/checkpoint_0.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:596d20b3003def973b7ed263f8a3bc8178d4527182e3d8b3d73b9f08a880487a
+size 942850634

checkpoints/checkpoint_1000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef3cd05895dab3ad7b9605e99fd44437886e504064074283f1c01d35d4bdde2c
+size 942871055

checkpoints/checkpoint_10000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e932a30f92cf27cc5cc90220c11cca2a9ba26ca662e1dcc9fbe0c9e56947ea8
+size 943036196

checkpoints/checkpoint_11000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:741587b23e54005aa59122359d1645f9647509283315b1e688cecaa244e14d37
+size 943054443

checkpoints/checkpoint_12000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68e0cfb7e5e1c841ddb4f63a4e35f609ed3ae33a71b2fcae3a42713807716288
+size 943072690

checkpoints/checkpoint_13000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8faec5132a9233ae661cf85447612451692fd568f55b77a331ef372ce7298b1d
+size 943091001

checkpoints/checkpoint_14000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21000abe6d1ce61af34a3a348379b75ceb32c5b3e34934b6bf7f0f56900d57da
+size 943109248

checkpoints/checkpoint_15000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:989c4f47419fbdb682d798c440eb8fa657f28717fe46a8a977cff5438daf6a32
+size 943127495

checkpoints/checkpoint_16000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d821499c970ea1176632b80fa534c13cd941991e7219a2e4b98d443f733ea8d9
+size 943145742

checkpoints/checkpoint_17000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c4239cfcac50b335caba1be84fd182100b30a43e199558bdf4320eb6a569a355
+size 943164053

checkpoints/checkpoint_18000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a95a2b08118aafbe1649a94cdcd25c092c9e3337fac55ffb176af1c829a997cd
+size 943182300

checkpoints/checkpoint_19000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dc22f5fc4bbfce05ab2b422929f31666dfc5536e8ffe6cf066ae1aa64feb3215
+size 943200547

checkpoints/checkpoint_2000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7031946ec00c7c3433b130ee42a593a8295bb6f9036bef87fa922ce8ef7d3c0
+size 942889365

checkpoints/checkpoint_20000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7f2abb8b5c12e411ecc662223dcc00ebb8e76bb0e492bf3782d03b4a5f0d5298
+size 943218531

checkpoints/checkpoint_3000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3e676cb28f0aa2fe17da68bb5df5af422601606c11141e56bd26dfe126238971
+size 942907611

checkpoints/checkpoint_4000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f363f026b391e07f1610299450a74a32e217a782cb867a4b4307bb2a7d75559
+size 942925857

checkpoints/checkpoint_5000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:20a34a77821367b9708831979f89af4f32b45bac592466cc04e778ee99a973b5
+size 942944103

checkpoints/checkpoint_6000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:935b2adcd4575c3ad9732e898afdff258fd36a7571b7e97ca2796d116526ee0a
+size 942962413

checkpoints/checkpoint_7000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:390a641bc6d086b41d16c8ec6888fc79f2a7970525f6192528ea35a1b4f44728
+size 942980659

checkpoints/checkpoint_8000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90b2614013ed2c2df7078052867a9c19bdd36acc386f3fa2ed17e3013bdbdfb4
+size 942998905

checkpoints/checkpoint_9000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1cba9b64ec3237fe1c33366159a883c344d103d35b4cced89d514617f89278c2
+size 943017151

checkpoints/final_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62e5cf559d3abdb97036c62c7daa25e61256d98f6cbab403d692ca648db2e078
+size 942849715

deepseek-arch.png ADDED Viewed

Git LFS Details

SHA256: 5447c3856700c59997bb6e8463fb50b6e577c3d75db93c5c291a9a6af26ab32d
Pointer size: 131 Bytes
Size of remote file: 123 kB

deepseek_training_metrics.png ADDED Viewed

Git LFS Details

SHA256: d4034df4cf3c8bb9f9c463ef6b884aa4724d6bb27a7b2fa82b9ea16a8aeb8f7d
Pointer size: 131 Bytes
Size of remote file: 288 kB

process_data.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import os
+import sys
+# Add the src directory to Python path
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+from data.data_processor import DeepSeekDataProcessor
+def main():
+    print("[+] Processing dataset into binary files...")
+    processor = DeepSeekDataProcessor()
+    processor.prepare_dataset()
+    print("[+] Data processing completed successfully!")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+torch>=2.0.0
+transformers>=4.30.0
+datasets>=2.12.0
+tiktoken>=0.5.0
+numpy>=1.20.0
+tqdm>=4.65.0
+matplotlib>=3.5.0
+peft>=0.4.0
+accelerate>=0.20.0
+bitsandbytes>=0.41.0
+huggingface_hub>=0.16.0
+wandb>=0.15.0
+psutil>=5.8.0

setup.sh ADDED Viewed

	@@ -0,0 +1,313 @@

+#!/bin/bash
+# Colors for output
+GREEN='\033[0;32m'
+RED='\033[0;31m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+# Default configuration
+PROJECT_ROOT="${PROJECT_ROOT:-$(pwd)}"
+VENV_PATH="${VENV_PATH:-${PROJECT_ROOT}/venv}"
+CHECKPOINT_DIR="${CHECKPOINT_DIR:-${PROJECT_ROOT}/checkpoints}"
+LORA_CHECKPOINT_DIR="${LORA_CHECKPOINT_DIR:-${PROJECT_ROOT}/lora_checkpoints}"
+REQUIRED_SPACE_MB="${REQUIRED_SPACE_MB:-2000}"
+# Function to print status messages
+print_status() {
+    echo -e "${GREEN}[+] $1${NC}"
+}
+print_error() {
+    echo -e "${RED}[-] $1${NC}"
+}
+print_warning() {
+    echo -e "${YELLOW}[!] $1${NC}"
+}
+print_info() {
+    echo -e "${BLUE}[i] $1${NC}"
+}
+# Function to handle errors
+handle_error() {
+    print_error "$1"
+    exit 1
+}
+# Function to check if a command exists
+command_exists() {
+    command -v "$1" &> /dev/null
+}
+# Function to check disk space
+check_disk_space() {
+    local available_space_mb=$(df -m . | awk 'NR==2 {print $4}')
+    if [ "$available_space_mb" -lt "$REQUIRED_SPACE_MB" ]; then
+        print_warning "Low disk space. Only ${available_space_mb}MB available, ${REQUIRED_SPACE_MB}MB required."
+        return 1
+    fi
+    return 0
+}
+# Function to check GPU memory
+check_gpu_memory() {
+    if command_exists nvidia-smi; then
+        local total_memory=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
+        local free_memory=$(nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits)
+        local used_memory=$((total_memory - free_memory))
+        print_status "GPU Memory: ${used_memory}MB used, ${free_memory}MB free of ${total_memory}MB total"
+        # Check if we have enough memory for training
+        if [ "$free_memory" -lt 4000 ]; then
+            print_warning "Low GPU memory. Consider reducing batch size or model size."
+        fi
+    else
+        print_warning "nvidia-smi not found. GPU training may not be available."
+    fi
+}
+# Function to create project structure
+create_project_structure() {
+    print_status "Creating project structure..."
+    mkdir -p "${PROJECT_ROOT}/src/data" \
+            "${PROJECT_ROOT}/src/model" \
+            "${PROJECT_ROOT}/src/training" \
+            "${PROJECT_ROOT}/src/inference" \
+            "${CHECKPOINT_DIR}" \
+            "${LORA_CHECKPOINT_DIR}" || handle_error "Failed to create directories"
+}
+# Function to setup virtual environment
+setup_virtual_env() {
+    print_status "Creating virtual environment..."
+    python3 -m venv "${VENV_PATH}" || handle_error "Failed to create virtual environment"
+    source "${VENV_PATH}/bin/activate" || handle_error "Failed to activate virtual environment"
+    print_status "Installing dependencies..."
+    pip install --upgrade pip
+    pip install -r requirements.txt || handle_error "Failed to install requirements"
+}
+# Function to prepare dataset
+prepare_dataset() {
+    print_status "Preparing dataset..."
+    cd "${PROJECT_ROOT}" || handle_error "Failed to change to project directory"
+    # Create a Python script to process the data
+    cat > process_data.py << 'EOF'
+import os
+import sys
+# Add the src directory to Python path
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+from data.data_processor import DeepSeekDataProcessor
+def main():
+    print("[+] Processing dataset into binary files...")
+    processor = DeepSeekDataProcessor()
+    processor.prepare_dataset()
+    print("[+] Data processing completed successfully!")
+if __name__ == "__main__":
+    main()
+EOF
+    # Run the data processing script
+    python3 process_data.py || handle_error "Failed to process dataset"
+    # Verify the files were created
+    if [ ! -f "${PROJECT_ROOT}/src/data/train.bin" ] || [ ! -f "${PROJECT_ROOT}/src/data/validation.bin" ]; then
+        handle_error "Data processing failed - required files not created"
+    fi
+}
+# Function to train base model
+train_base_model() {
+    print_status "Starting DeepSeek base model training..."
+    cd "${PROJECT_ROOT}" || handle_error "Failed to change to project directory"
+    python3 src/run_training.py \
+        --batch-size "${BATCH_SIZE:-12}" \
+        --max-iters "${MAX_ITERS:-20000}" \
+        --eval-interval "${EVAL_INTERVAL:-1000}" \
+        --eval-iters "${EVAL_ITERS:-200}" \
+        --learning-rate "${LEARNING_RATE:-6e-4}" \
+        --weight-decay "${WEIGHT_DECAY:-0.1}" \
+        --warmup-iters "${WARMUP_ITERS:-2000}" \
+        --lr-decay-iters "${LR_DECAY_ITERS:-20000}" \
+        --min-lr "${MIN_LR:-6e-5}" \
+        --moe-experts "${MOE_EXPERTS:-4}" \
+        --multi-token "${MULTI_TOKEN:-2}" || handle_error "Base model training failed"
+}
+# Function to perform LoRA finetuning
+finetune_lora() {
+    while true; do
+        read -p "Do you want to perform LoRA finetuning? (y/n) " do_finetune
+        case $do_finetune in
+            [Yy]* )
+                print_status "Starting LoRA finetuning..."
+                cd "${PROJECT_ROOT}" || handle_error "Failed to change to project directory"
+                # Create LoRA finetuning script
+                cat > finetune_lora.py << 'EOF'
+import torch
+import os
+import sys
+sys.path.append('src')
+from model.deepseek import DeepSeek, DeepSeekConfig
+from peft import get_peft_model, LoraConfig, TaskType
+def main():
+    print("Loading base model...")
+    checkpoint = torch.load('checkpoints/best_model.pt', map_location='cuda' if torch.cuda.is_available() else 'cpu')
+    model = DeepSeek(checkpoint['config'])
+    model.load_state_dict(checkpoint['model'])
+    # Define LoRA configuration
+    lora_config = LoraConfig(
+        task_type=TaskType.CAUSAL_LM,
+        r=8,  # rank
+        lora_alpha=32,
+        lora_dropout=0.1,
+        target_modules=["q_a_proj", "q_b_proj", "kv_a_proj", "kv_b_proj"]
+    )
+    # Get PEFT model
+    model = get_peft_model(model, lora_config)
+    model.print_trainable_parameters()
+    print("LoRA finetuning setup complete!")
+if __name__ == "__main__":
+    main()
+EOF
+                python3 finetune_lora.py || handle_error "LoRA finetuning failed"
+                break
+                ;;
+            [Nn]* )
+                print_status "Skipping LoRA finetuning..."
+                break
+                ;;
+            * )
+                echo "Please answer 'y' or 'n'"
+                ;;
+        esac
+    done
+}
+# Function to test the trained model
+test_model() {
+    while true; do
+        read -p "Do you want to test the trained model? (y/n) " do_test
+        case $do_test in
+            [Yy]* )
+                print_status "Testing the trained model..."
+                cd "${PROJECT_ROOT}" || handle_error "Failed to change to project directory"
+                # Create test prompts
+                prompts=(
+                    "Once upon a time"
+                    "In a magical forest"
+                    "The little robot"
+                    "The brave knight"
+                )
+                # Test each prompt
+                for prompt in "${prompts[@]}"; do
+                    print_status "Testing with prompt: '$prompt'"
+                    python3 src/generate.py \
+                        --model-path "${CHECKPOINT_DIR}/best_model.pt" \
+                        --prompt "$prompt" \
+                        --max-tokens 100 \
+                        --temperature 0.8 \
+                        --top-k 40
+                    echo
+                done
+                break
+                ;;
+            [Nn]* )
+                print_status "Skipping model testing..."
+                break
+                ;;
+            * )
+                echo "Please answer 'y' or 'n'"
+                ;;
+        esac
+    done
+}
+# Function to show usage information
+show_usage() {
+    print_info "DeepSeek Children's Stories Model Setup Complete!"
+    print_info ""
+    print_info "Next steps:"
+    print_info "1. Activate virtual environment: source venv/bin/activate"
+    print_info "2. Train the model: python src/run_training.py"
+    print_info "3. Generate stories: python src/generate.py --prompt 'your prompt'"
+    print_info "4. Interactive mode: python src/generate.py --interactive"
+    print_info ""
+    print_info "Model files:"
+    print_info "- Base model: checkpoints/best_model.pt"
+    print_info "- LoRA model: lora_checkpoints/best_lora_model.pt"
+    print_info ""
+    print_info "Configuration options:"
+    print_info "- Adjust model size: --n-layer, --n-head, --n-embd"
+    print_info "- Training parameters: --batch-size, --learning-rate, --max-iters"
+    print_info "- Advanced features: --moe-experts, --multi-token"
+}
+# Main setup function
+main() {
+    print_info "DeepSeek Children's Stories Model Setup"
+    print_info "======================================"
+    # Check prerequisites
+    if ! command_exists python3; then
+        handle_error "Python 3 is required but not installed"
+    fi
+    if ! command_exists pip; then
+        handle_error "pip is required but not installed"
+    fi
+    # Check disk space
+    if ! check_disk_space; then
+        print_warning "Continuing with low disk space..."
+    fi
+    # Check GPU
+    check_gpu_memory
+    # Create project structure
+    create_project_structure
+    # Setup virtual environment
+    setup_virtual_env
+    # Prepare dataset
+    prepare_dataset
+    # Train base model
+    train_base_model
+    # Optional LoRA finetuning
+    finetune_lora
+    # Optional model testing
+    test_model
+    # Show usage information
+    show_usage
+    print_status "Setup completed successfully!"
+}
+# Run main function
+main "$@"

src/data/__pycache__/data_processor.cpython-310.pyc ADDED Viewed

Binary file (8.3 kB). View file

src/data/data_processor.py ADDED Viewed

	@@ -0,0 +1,287 @@

+"""
+Data Processor for DeepSeek Children's Stories Model
+Handles dataset loading, preprocessing, and tokenization for children's story generation
+"""
+import tiktoken
+import os
+import numpy as np
+from datasets import load_dataset
+from tqdm.auto import tqdm
+import torch
+from typing import Dict, List, Optional
+def load_encoder_decoder():
+    """Load the encoder and decoder for text processing"""
+    enc = tiktoken.get_encoding("gpt2")
+    return enc, enc
+class DeepSeekDataProcessor:
+    def __init__(self, config=None):
+        # Initialize tokenizer with GPT-2 encoding
+        self.enc = tiktoken.get_encoding("gpt2")
+        # Special tokens for story structure (optimized for children's stories)
+        self.special_tokens = {
+            "story_start": "<|story|>",
+            "story_end": "</|story|>",
+            "prompt_start": "<|prompt|>",
+            "prompt_end": "</|prompt|>",
+            "moral_start": "<|moral|>",
+            "moral_end": "</|moral|>",
+            "character_start": "<|character|>",
+            "character_end": "</|character|>"
+        }
+        # Ensure data directory exists
+        self.data_dir = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "data")
+        os.makedirs(self.data_dir, exist_ok=True)
+        print(f"Data directory: {self.data_dir}")
+        # Configuration for processing
+        self.max_length = 1024  # DeepSeek context window
+        self.min_length = 50    # Minimum story length
+    def preprocess_text(self, text: str) -> str:
+        """Preprocess text for children's stories"""
+        # Basic text cleaning
+        text = text.lower()  # Convert to lowercase for consistency
+        text = text.replace('\n', ' ')  # Replace newlines with spaces
+        text = ' '.join(text.split())  # Normalize whitespace
+        # Remove any inappropriate content markers (basic filtering)
+        inappropriate_phrases = ['adult content', 'mature', 'explicit']
+        for phrase in inappropriate_phrases:
+            if phrase in text:
+                return ""
+        # Ensure the text is child-friendly
+        if len(text) < self.min_length:
+            return ""
+        return text
+    def extract_story_elements(self, example: Dict) -> Dict:
+        """Extract story elements for better structure"""
+        prompt = self.preprocess_text(example.get('prompt', ''))
+        story = self.preprocess_text(example.get('text', ''))
+        # Extract potential moral or lesson
+        moral = ""
+        if 'moral' in example:
+            moral = self.preprocess_text(example['moral'])
+        elif 'lesson' in example:
+            moral = self.preprocess_text(example['lesson'])
+        # Extract main character if available
+        character = ""
+        if 'character' in example:
+            character = self.preprocess_text(example['character'])
+        return {
+            'prompt': prompt,
+            'story': story,
+            'moral': moral,
+            'character': character
+        }
+    def process(self, example: Dict) -> Dict:
+        """Process a single example for DeepSeek model"""
+        # Extract story elements
+        elements = self.extract_story_elements(example)
+        # Skip if no valid content
+        if not elements['story'] or not elements['prompt']:
+            return {'ids': [], 'len': 0}
+        # Create structured text with special tokens
+        full_text = (
+            f"{self.special_tokens['prompt_start']} {elements['prompt']} {self.special_tokens['prompt_end']} "
+        )
+        # Add character information if available
+        if elements['character']:
+            full_text += f"{self.special_tokens['character_start']} {elements['character']} {self.special_tokens['character_end']} "
+        # Add the main story
+        full_text += f"{self.special_tokens['story_start']} {elements['story']} {self.special_tokens['story_end']}"
+        # Add moral if available
+        if elements['moral']:
+            full_text += f" {self.special_tokens['moral_start']} {elements['moral']} {self.special_tokens['moral_end']}"
+        # Tokenize with error handling
+        try:
+            ids = self.enc.encode_ordinary(full_text)
+            # Ensure the sequence isn't too long
+            if len(ids) > self.max_length:
+                ids = ids[:self.max_length]
+            # Skip if too short
+            if len(ids) < 20:
+                return {'ids': [], 'len': 0}
+            out = {'ids': ids, 'len': len(ids)}
+            return out
+        except Exception as e:
+            print(f"Error tokenizing text: {e}")
+            return {'ids': [], 'len': 0}
+    def prepare_dataset(self) -> Dict:
+        """Prepare the Children Stories Collection dataset for DeepSeek training"""
+        # Load the Children Stories Collection dataset
+        print("Loading Children Stories Collection dataset...")
+        ds = load_dataset("ajibawa-2023/Children-Stories-Collection")
+        train_bin_path = os.path.join(self.data_dir, "train.bin")
+        val_bin_path = os.path.join(self.data_dir, "validation.bin")
+        finetune_bin_path = os.path.join(self.data_dir, "finetune.bin")
+        print(f"Checking for existing processed files...")
+        # Check if all files exist
+        if (os.path.exists(train_bin_path) and
+            os.path.exists(val_bin_path) and
+            os.path.exists(finetune_bin_path)):
+            print("Found existing processed files!")
+            print(f"Train file: {os.path.getsize(train_bin_path) / (1024*1024):.2f} MB")
+            print(f"Validation file: {os.path.getsize(val_bin_path) / (1024*1024):.2f} MB")
+            print(f"Finetune file: {os.path.getsize(finetune_bin_path) / (1024*1024):.2f} MB")
+            return {
+                "train": train_bin_path,
+                "validation": val_bin_path,
+                "finetune": finetune_bin_path
+            }
+        print("Processing dataset...")
+        # Filter out examples that are too short or too long
+        def filter_by_length(example):
+            text_length = len(example.get('text', ''))
+            return self.min_length <= text_length <= 2000  # Reasonable length for children's stories
+        ds = ds.filter(filter_by_length)
+        print(f"After filtering: {len(ds['train'])} examples")
+        # Split the dataset into train, validation, and finetune sets
+        train_val_test = ds["train"].train_test_split(test_size=0.2, seed=42)
+        val_finetune = train_val_test["test"].train_test_split(test_size=0.5, seed=42)
+        # Create a new dataset dictionary with all splits
+        ds = {
+            "train": train_val_test["train"],
+            "validation": val_finetune["train"],
+            "finetune": val_finetune["test"]
+        }
+        print(f"Dataset split sizes:")
+        print(f"Training set: {len(ds['train'])} examples")
+        print(f"Validation set: {len(ds['validation'])} examples")
+        print(f"Finetune set: {len(ds['finetune'])} examples")
+        # Process each split
+        for split_name, split_data in ds.items():
+            print(f"\nProcessing {split_name} split...")
+            # Process the data
+            tokenized = split_data.map(
+                self.process,
+                remove_columns=['text', 'prompt', 'text_token_length'],
+                desc=f"tokenizing {split_name} split",
+                num_proc=8,
+            )
+            # Filter out empty sequences
+            tokenized = tokenized.filter(lambda x: x['len'] > 0)
+            print(f"After processing: {len(tokenized)} valid examples")
+            # Save to binary file
+            filename = os.path.join(self.data_dir, f"{split_name}.bin")
+            print(f"Saving {split_name} split to: {filename}")
+            # Calculate total length
+            arr_len = np.sum(tokenized['len'], dtype=np.uint64)
+            dtype = np.uint16
+            arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
+            total_batches = 1024
+            idx = 0
+            for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
+                batch = tokenized.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
+                arr_batch = np.concatenate(batch['ids'])
+                arr[idx : idx + len(arr_batch)] = arr_batch
+                idx += len(arr_batch)
+            arr.flush()
+            # Verify file was created
+            if os.path.exists(filename):
+                print(f"Successfully created {filename}")
+                print(f"File size: {os.path.getsize(filename) / (1024*1024):.2f} MB")
+            else:
+                raise RuntimeError(f"Failed to create {filename}")
+        return {
+            "train": train_bin_path,
+            "validation": val_bin_path,
+            "finetune": finetune_bin_path
+        }
+    def load_binary_data(self, filepath: str) -> torch.Tensor:
+        """Load binary data file as tensor"""
+        try:
+            data = np.memmap(filepath, dtype=np.uint16, mode='r')
+            return torch.from_numpy(data.copy())
+        except Exception as e:
+            print(f"Error loading data from {filepath}: {e}")
+            raise
+    def get_batch(self, data: torch.Tensor, batch_size: int, block_size: int) -> tuple:
+        """Get a batch of data for training"""
+        # Generate random indices
+        ix = torch.randint(len(data) - block_size, (batch_size,))
+        # Get input sequences
+        x = torch.stack([data[i:i+block_size].long() for i in ix])
+        # Get target sequences (shifted by 1)
+        y = torch.stack([data[i+1:i+1+block_size].long() for i in ix])
+        return x, y
+    def decode_tokens(self, token_ids: List[int]) -> str:
+        """Decode token IDs back to text"""
+        try:
+            return self.enc.decode(token_ids)
+        except Exception as e:
+            print(f"Error decoding tokens: {e}")
+            return ""
+    def encode_text(self, text: str) -> List[int]:
+        """Encode text to token IDs"""
+        try:
+            return self.enc.encode_ordinary(text)
+        except Exception as e:
+            print(f"Error encoding text: {e}")
+            return []
+def main():
+    """Main function to process the dataset"""
+    print("DeepSeek Children's Stories Data Processor")
+    print("=" * 50)
+    processor = DeepSeekDataProcessor()
+    processor.prepare_dataset()
+    print("\nData processing completed successfully!")
+    print("Files created:")
+    print("- src/data/train.bin")
+    print("- src/data/validation.bin")
+    print("- src/data/finetune.bin")
+if __name__ == "__main__":
+    main()

src/data/finetune.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d4f4819be598b40b35540e069ed44efc44ffc732ab6c29269e5f1a227fb9e77f
+size 61167196

src/data/train.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:09234a8bf9f55e6e59f48b765bea680f3c2aa4a8305e9a553d9593de4652d0aa
+size 488961682

src/data/validation.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:98148ea1408dc8b01abe08126221c6ef3cd01362240b9d0bb1ce39e477ae211c
+size 61065952

src/generate.py ADDED Viewed

	@@ -0,0 +1,281 @@

+"""
+DeepSeek Children's Stories Text Generation
+Generate children's stories using the trained DeepSeek model
+"""
+import os
+import sys
+import argparse
+import torch
+import tiktoken
+from typing import List, Optional
+# Add the src directory to Python path
+sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
+from model.deepseek import DeepSeek, DeepSeekConfig
+# Allowlist DeepSeekConfig for safe deserialization
+torch.serialization.add_safe_globals([DeepSeekConfig])
+class DeepSeekStoryGenerator:
+    def __init__(self, model_path: str, device: str = 'auto'):
+        """Initialize the story generator"""
+        self.device = self._get_device(device)
+        self.model = self._load_model(model_path)
+        self.tokenizer = tiktoken.get_encoding("gpt2")
+        # Special tokens for story structure
+        self.special_tokens = {
+            "story_start": "<|story|>",
+            "story_end": "</|story|>",
+            "prompt_start": "<|prompt|>",
+            "prompt_end": "</|prompt|>",
+            "moral_start": "<|moral|>",
+            "moral_end": "</|moral|>",
+            "character_start": "<|character|>",
+            "character_end": "</|character|>"
+        }
+    def _get_device(self, device: str) -> str:
+        """Get the appropriate device"""
+        if device == 'auto':
+            return 'cuda' if torch.cuda.is_available() else 'cpu'
+        return device
+    def _load_model(self, model_path: str) -> DeepSeek:
+        """Load the trained model"""
+        print(f"Loading model from {model_path}...")
+        # Load checkpoint
+        checkpoint = torch.load(model_path, map_location=self.device, weights_only=False)
+        # Create model with the same configuration
+        config = checkpoint['config']
+        model = DeepSeek(config)
+        # Handle compiled model state dict by removing _orig_mod prefix
+        state_dict = checkpoint['model']
+        if all(k.startswith('_orig_mod.') for k in state_dict.keys()):
+            state_dict = {k[10:]: v for k, v in state_dict.items()}  # Remove '_orig_mod.' prefix
+        # Load model weights
+        model.load_state_dict(state_dict)
+        model.to(self.device)
+        model.eval()
+        print(f"Model loaded successfully!")
+        print(f"Model configuration: {config.n_layer}L/{config.n_head}H/{config.n_embd}D")
+        print(f"Device: {self.device}")
+        return model
+    def encode_prompt(self, prompt: str, character: Optional[str] = None) -> torch.Tensor:
+        """Encode a prompt for generation"""
+        # Create structured prompt
+        full_prompt = f"{self.special_tokens['prompt_start']} {prompt.lower()} {self.special_tokens['prompt_end']}"
+        if character:
+            full_prompt += f" {self.special_tokens['character_start']} {character.lower()} {self.special_tokens['character_end']}"
+        full_prompt += f" {self.special_tokens['story_start']}"
+        # Tokenize
+        token_ids = self.tokenizer.encode_ordinary(full_prompt)
+        return torch.tensor([token_ids], dtype=torch.long, device=self.device)
+    def generate_story(self, prompt: str, character: Optional[str] = None,
+                      max_tokens: int = 200, temperature: float = 0.8,
+                      top_k: int = 40, top_p: float = 0.9) -> str:
+        """Generate a children's story"""
+        print(f"Generating story for prompt: '{prompt}'")
+        if character:
+            print(f"Character: {character}")
+        # Encode prompt
+        input_ids = self.encode_prompt(prompt, character)
+        # Generate
+        with torch.no_grad():
+            generated_ids = self.model.generate(
+                input_ids,
+                max_new_tokens=max_tokens,
+                temperature=temperature,
+                top_k=top_k
+            )
+        # Decode the generated text
+        generated_text = self.tokenizer.decode(generated_ids[0].tolist())
+        # Extract the story part
+        story = self._extract_story(generated_text)
+        return story
+    def _extract_story(self, text: str) -> str:
+        """Extract the story from the generated text"""
+        # Find story start and end markers
+        story_start = text.find(self.special_tokens['story_start'])
+        story_end = text.find(self.special_tokens['story_end'])
+        if story_start != -1 and story_end != -1:
+            # Extract story content
+            story_content = text[story_start + len(self.special_tokens['story_start']):story_end].strip()
+            return story_content
+        else:
+            # Fallback: return the text after the last prompt
+            prompt_end = text.find(self.special_tokens['prompt_end'])
+            if prompt_end != -1:
+                return text[prompt_end + len(self.special_tokens['prompt_end']):].strip()
+            else:
+                return text.strip()
+    def generate_multiple_stories(self, prompts: List[str], num_stories: int = 3,
+                                **kwargs) -> List[str]:
+        """Generate multiple stories from a list of prompts"""
+        stories = []
+        for i, prompt in enumerate(prompts):
+            print(f"\nGenerating story {i+1}/{len(prompts)}...")
+            story = self.generate_story(prompt, **kwargs)
+            stories.append(story)
+        return stories
+    def interactive_generation(self):
+        """Interactive story generation mode"""
+        print("DeepSeek Children's Stories - Interactive Mode")
+        print("Type 'quit' to exit")
+        print("-" * 50)
+        while True:
+            try:
+                # Get prompt from user
+                prompt = input("\nEnter a story prompt: ").strip()
+                if prompt.lower() in ['quit', 'exit', 'q']:
+                    print("Goodbye!")
+                    break
+                if not prompt:
+                    print("Please enter a valid prompt.")
+                    continue
+                # Get character (optional)
+                character = input("Enter a character name (optional): ").strip()
+                if not character:
+                    character = None
+                # Get generation parameters
+                try:
+                    max_tokens = int(input("Max tokens (default 200): ") or "200")
+                    temperature = float(input("Temperature (default 0.8): ") or "0.8")
+                except ValueError:
+                    max_tokens = 200
+                    temperature = 0.8
+                # Generate story
+                story = self.generate_story(
+                    prompt,
+                    character=character,
+                    max_tokens=max_tokens,
+                    temperature=temperature
+                )
+                # Display story
+                print("\n" + "="*50)
+                print("GENERATED STORY:")
+                print("="*50)
+                print(story)
+                print("="*50)
+            except KeyboardInterrupt:
+                print("\nGoodbye!")
+                break
+            except Exception as e:
+                print(f"Error generating story: {e}")
+def main():
+    """Main generation function"""
+    parser = argparse.ArgumentParser(description='Generate children\'s stories with DeepSeek')
+    # Model configuration
+    parser.add_argument('--model-path', type=str, default='checkpoints/best_model.pt',
+                       help='Path to the trained model checkpoint')
+    parser.add_argument('--device', type=str, default='auto',
+                       help='Device to use (auto, cuda, cpu)')
+    # Generation parameters
+    parser.add_argument('--prompt', type=str, help='Story prompt')
+    parser.add_argument('--character', type=str, help='Character name')
+    parser.add_argument('--max-tokens', type=int, default=200, help='Maximum tokens to generate')
+    parser.add_argument('--temperature', type=float, default=0.8, help='Sampling temperature')
+    parser.add_argument('--top-k', type=int, default=40, help='Top-k sampling')
+    parser.add_argument('--top-p', type=float, default=0.9, help='Top-p sampling')
+    # Multiple generation
+    parser.add_argument('--num-stories', type=int, default=1, help='Number of stories to generate')
+    parser.add_argument('--interactive', action='store_true', help='Interactive mode')
+    args = parser.parse_args()
+    # Check if model exists
+    if not os.path.exists(args.model_path):
+        print(f"Error: Model file not found at {args.model_path}")
+        print("Please train the model first or specify the correct path.")
+        return
+    # Create generator
+    generator = DeepSeekStoryGenerator(args.model_path, args.device)
+    if args.interactive:
+        # Interactive mode
+        generator.interactive_generation()
+    else:
+        # Single or multiple generation
+        if args.prompt:
+            if args.num_stories == 1:
+                # Single story
+                story = generator.generate_story(
+                    args.prompt,
+                    character=args.character,
+                    max_tokens=args.max_tokens,
+                    temperature=args.temperature,
+                    top_k=args.top_k,
+                    top_p=args.top_p
+                )
+                print(f"\nPrompt: {args.prompt}")
+                if args.character:
+                    print(f"Character: {args.character}")
+                print("\n" + "="*50)
+                print("GENERATED STORY:")
+                print("="*50)
+                print(story)
+                print("="*50)
+            else:
+                # Multiple stories
+                prompts = [args.prompt] * args.num_stories
+                stories = generator.generate_multiple_stories(
+                    prompts,
+                    num_stories=args.num_stories,
+                    character=args.character,
+                    max_tokens=args.max_tokens,
+                    temperature=args.temperature,
+                    top_k=args.top_k,
+                    top_p=args.top_p
+                )
+                for i, story in enumerate(stories):
+                    print(f"\nStory {i+1}:")
+                    print("="*50)
+                    print(story)
+                    print("="*50)
+        else:
+            print("Please provide a prompt or use --interactive mode.")
+            print("Example: python generate.py --prompt 'A brave little mouse' --character 'Mickey'")
+if __name__ == "__main__":
+    main()

src/model/__pycache__/deepseek.cpython-310.pyc ADDED Viewed

Binary file (13.8 kB). View file

src/model/deepseek.py ADDED Viewed

	@@ -0,0 +1,513 @@

+"""
+DeepSeek Model Architecture for Children's Stories
+Implements advanced features:
+- Multihead Latent Attention (MLA)
+- Mixture of Experts (MoE)
+- Multi-token prediction
+- Quantization support
+- Rotary Positional Encodings (RoPE)
+- Optimized for children's story generation
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple, List
+from dataclasses import dataclass
+@dataclass
+class DeepSeekConfig:
+    """Configuration for DeepSeek model optimized for children's stories"""
+    vocab_size: int = 50257  # GPT-2 vocabulary size
+    n_layer: int = 6         # Reduced for efficiency
+    n_head: int = 8          # Number of attention heads
+    n_embd: int = 512        # Embedding dimension
+    block_size: int = 1024   # Context window
+    dropout: float = 0.1     # Dropout rate
+    bias: bool = True        # Use bias in linear layers
+    # MLA (Multihead Latent Attention) config
+    use_mla: bool = True     # Enable MLA
+    mla_kv_heads: int = 4    # Number of key-value heads for MLA
+    mla_q_lora_rank: int = 32  # LoRA rank for query projection
+    mla_kv_lora_rank: int = 16  # LoRA rank for key-value projection
+    # MoE (Mixture of Experts) config
+    moe_num_experts: int = 4  # Number of experts
+    moe_top_k: int = 2       # Number of experts per token
+    moe_expert_capacity: float = 1.25
+    moe_aux_loss_coeff: float = 0.01
+    # Multi-token prediction
+    multi_token_predict: int = 2  # Predict next 2 tokens for children's stories
+    # Quantization
+    use_quantization: bool = False
+    quantization_bits: int = 8
+class RoPEPositionalEncoding(nn.Module):
+    """Rotary Positional Encoding (RoPE) for better position understanding"""
+    def __init__(self, dim: int, max_seq_len: int = 2048, base: float = 10000.0):
+        super().__init__()
+        self.dim = dim
+        self.max_seq_len = max_seq_len
+        self.base = base
+        # Precompute frequency matrix
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer('inv_freq', inv_freq)
+        # Cache for efficiency
+        self._cached_cos = None
+        self._cached_sin = None
+        self._cached_seq_len = 0
+    def _compute_cos_sin(self, seq_len: int, device: torch.device):
+        """Compute cosine and sine values for given sequence length"""
+        if seq_len > self._cached_seq_len or self._cached_cos is None:
+            # Create position indices
+            t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype)
+            # Compute frequencies
+            freqs = torch.outer(t, self.inv_freq)
+            # Create rotation matrix components
+            cos_vals = torch.cos(freqs)
+            sin_vals = torch.sin(freqs)
+            # Cache results
+            self._cached_cos = cos_vals
+            self._cached_sin = sin_vals
+            self._cached_seq_len = seq_len
+        return self._cached_cos[:seq_len], self._cached_sin[:seq_len]
+    def apply_rope(self, x: torch.Tensor, position_ids: Optional[torch.Tensor] = None):
+        """Apply RoPE to input tensor"""
+        batch_size, seq_len, n_heads, head_dim = x.shape
+        # Get cos/sin values
+        cos, sin = self._compute_cos_sin(seq_len, x.device)
+        # Handle position_ids if provided
+        if position_ids is not None:
+            cos = cos[position_ids]
+            sin = sin[position_ids]
+        # Reshape for broadcasting
+        cos = cos.unsqueeze(0).unsqueeze(2)  # [1, seq_len, 1, head_dim//2]
+        sin = sin.unsqueeze(0).unsqueeze(2)
+        # Split x into two halves
+        x1 = x[..., ::2]  # Even indices
+        x2 = x[..., 1::2]  # Odd indices
+        # Apply rotation
+        rotated_x1 = x1 * cos - x2 * sin
+        rotated_x2 = x1 * sin + x2 * cos
+        # Recombine
+        rotated_x = torch.stack([rotated_x1, rotated_x2], dim=-1).flatten(-2)
+        return rotated_x
+class MultiheadLatentAttention(nn.Module):
+    """
+    Multihead Latent Attention (MLA) - DeepSeek's efficient attention mechanism
+    Uses shared key-value heads with LoRA-style projections for efficiency
+    """
+    def __init__(self, config: DeepSeekConfig):
+        super().__init__()
+        self.config = config
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+        self.head_dim = config.n_embd // config.n_head
+        self.kv_heads = config.mla_kv_heads
+        self.kv_head_dim = self.head_dim
+        # Query projection with LoRA-style decomposition
+        self.q_a_proj = nn.Linear(config.n_embd, config.mla_q_lora_rank, bias=False)
+        self.q_b_proj = nn.Linear(config.mla_q_lora_rank, config.n_embd, bias=False)
+        # Key-Value projection with shared heads
+        self.kv_a_proj = nn.Linear(config.n_embd, config.mla_kv_lora_rank, bias=False)
+        self.kv_b_proj = nn.Linear(config.mla_kv_lora_rank, self.kv_heads * self.head_dim * 2, bias=False)
+        # Output projection
+        self.out_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
+        # RoPE for positional encoding
+        self.rope = RoPEPositionalEncoding(self.head_dim)
+        # Dropout
+        self.dropout = nn.Dropout(config.dropout)
+        # Scaling factor
+        self.scale = self.head_dim ** -0.5
+    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
+        batch_size, seq_len, _ = x.shape
+        # Query projection through LoRA-style decomposition
+        q_latent = self.q_a_proj(x)  # [B, T, rank]
+        q = self.q_b_proj(q_latent)  # [B, T, n_embd]
+        q = q.view(batch_size, seq_len, self.n_head, self.head_dim)
+        # Key-Value projection through shared heads
+        kv_latent = self.kv_a_proj(x)  # [B, T, kv_rank]
+        kv = self.kv_b_proj(kv_latent)  # [B, T, kv_heads * kv_head_dim * 2]
+        kv = kv.view(batch_size, seq_len, self.kv_heads, self.head_dim, 2)
+        k, v = kv.unbind(dim=-1)  # Each: [B, T, kv_heads, kv_head_dim]
+        # Apply RoPE to queries and keys before expansion
+        q = self.rope.apply_rope(q)
+        k = self.rope.apply_rope(k)
+        # Expand key-value to match query heads
+        k = k.repeat_interleave(self.n_head // self.kv_heads, dim=2)
+        v = v.repeat_interleave(self.n_head // self.kv_heads, dim=2)
+        # Transpose for attention computation
+        q = q.transpose(1, 2)  # [B, n_head, T, head_dim]
+        k = k.transpose(1, 2)  # [B, n_head, T, head_dim]
+        v = v.transpose(1, 2)  # [B, n_head, T, head_dim]
+        # Compute attention scores
+        attn_scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
+        # Apply causal mask
+        if attention_mask is None:
+            causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool()
+            attn_scores.masked_fill_(causal_mask, float('-inf'))
+        else:
+            attn_scores = attn_scores + attention_mask
+        # Apply softmax
+        attn_weights = F.softmax(attn_scores, dim=-1)
+        attn_weights = self.dropout(attn_weights)
+        # Apply attention to values
+        out = torch.matmul(attn_weights, v)  # [B, n_head, T, head_dim]
+        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, self.n_embd)
+        # Output projection
+        out = self.out_proj(out)
+        return out
+class MoEExpert(nn.Module):
+    """Expert network for Mixture of Experts"""
+    def __init__(self, config: DeepSeekConfig):
+        super().__init__()
+        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
+        self.gelu = nn.GELU()
+        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
+        self.dropout = nn.Dropout(config.dropout)
+    def forward(self, x: torch.Tensor):
+        return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))
+class MixtureOfExperts(nn.Module):
+    """Mixture of Experts (MoE) for increased model capacity"""
+    def __init__(self, config: DeepSeekConfig):
+        super().__init__()
+        self.config = config
+        self.num_experts = config.moe_num_experts
+        self.top_k = config.moe_top_k
+        self.expert_capacity = config.moe_expert_capacity
+        # Router
+        self.router = nn.Linear(config.n_embd, config.moe_num_experts, bias=False)
+        # Experts
+        self.experts = nn.ModuleList([MoEExpert(config) for _ in range(config.moe_num_experts)])
+        # Layer norm
+        self.ln = nn.LayerNorm(config.n_embd, bias=config.bias)
+    def forward(self, x: torch.Tensor):
+        batch_size, seq_len, hidden_dim = x.shape
+        # Get router logits
+        router_logits = self.router(x)  # [B, T, num_experts]
+        # Get top-k experts
+        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
+        top_k_probs = F.softmax(top_k_logits, dim=-1)
+        # Initialize output
+        output = torch.zeros_like(x)
+        # Process each expert
+        for expert_idx in range(self.num_experts):
+            # Find tokens that use this expert
+            expert_mask = (top_k_indices == expert_idx).any(dim=-1)  # [B, T]
+            if expert_mask.any():
+                # Get tokens for this expert
+                expert_tokens = x[expert_mask]  # [num_tokens, hidden_dim]
+                # Get routing weights for this expert
+                expert_weights = top_k_probs[expert_mask]  # [num_tokens, top_k]
+                expert_weights = expert_weights[top_k_indices[expert_mask] == expert_idx]  # [num_tokens]
+                # Apply expert
+                expert_output = self.experts[expert_idx](expert_tokens)  # [num_tokens, hidden_dim]
+                # Weight the output
+                weighted_output = expert_output * expert_weights.unsqueeze(-1)
+                # Add to output
+                output[expert_mask] += weighted_output
+        # Apply layer norm
+        output = self.ln(output)
+        return output, router_logits
+    def _compute_aux_loss(self, router_logits: torch.Tensor):
+        """Compute auxiliary loss for load balancing"""
+        router_probs = F.softmax(router_logits, dim=-1)
+        mean_expert_usage = router_probs.mean(dim=[0, 1])  # [num_experts]
+        target_usage = 1.0 / self.num_experts
+        aux_loss = torch.sum((mean_expert_usage - target_usage) ** 2)
+        return aux_loss
+class DeepSeekBlock(nn.Module):
+    """DeepSeek transformer block with MLA and MoE"""
+    def __init__(self, config: DeepSeekConfig):
+        super().__init__()
+        self.config = config
+        # Layer norms
+        self.ln1 = nn.LayerNorm(config.n_embd, bias=config.bias)
+        self.ln2 = nn.LayerNorm(config.n_embd, bias=config.bias)
+        # Attention - use MLA if enabled, otherwise use standard attention
+        if config.use_mla:
+            self.attn = MultiheadLatentAttention(config)
+        else:
+            # Standard multihead attention as fallback
+            self.attn = nn.MultiheadAttention(
+                config.n_embd,
+                config.n_head,
+                dropout=config.dropout,
+                bias=config.bias,
+                batch_first=True
+            )
+        # MoE
+        self.moe = MixtureOfExperts(config)
+    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
+        # Attention with residual connection
+        if self.config.use_mla:
+            x = x + self.attn(self.ln1(x), attention_mask)
+        else:
+            attn_out, _ = self.attn(self.ln1(x), self.ln1(x), self.ln1(x), attn_mask=attention_mask)
+            x = x + attn_out
+        # MoE with residual connection
+        moe_output, router_logits = self.moe(self.ln2(x))
+        x = x + moe_output
+        return x, router_logits
+class MultiTokenPredictor(nn.Module):
+    """Multi-token prediction head for improved training efficiency"""
+    def __init__(self, config: DeepSeekConfig):
+        super().__init__()
+        self.config = config
+        self.num_tokens = config.multi_token_predict
+        # Separate prediction heads for each future token
+        self.predictors = nn.ModuleList([
+            nn.Linear(config.n_embd, config.vocab_size, bias=False)
+            for _ in range(config.multi_token_predict)
+        ])
+    def forward(self, hidden_states: torch.Tensor):
+        """Forward pass for multi-token prediction"""
+        batch_size, seq_len, hidden_dim = hidden_states.shape
+        # Predict multiple future tokens
+        logits = []
+        for i, predictor in enumerate(self.predictors):
+            # Use hidden states shifted by i+1 positions
+            if i + 1 < seq_len:
+                token_logits = predictor(hidden_states[:, i+1:i+2, :])  # [B, 1, vocab_size]
+                logits.append(token_logits)
+            else:
+                # Pad with zeros if not enough sequence length
+                token_logits = torch.zeros(batch_size, 1, self.config.vocab_size,
+                                         device=hidden_states.device)
+                logits.append(token_logits)
+        return torch.cat(logits, dim=1)  # [B, num_tokens, vocab_size]
+class DeepSeek(nn.Module):
+    """DeepSeek model for children's story generation"""
+    def __init__(self, config: DeepSeekConfig):
+        super().__init__()
+        assert isinstance(config, DeepSeekConfig), "config must be an instance of DeepSeekConfig"
+        self.config = config
+        # Token and position embeddings
+        self.transformer = nn.ModuleDict(dict(
+            wte=nn.Embedding(config.vocab_size, config.n_embd),
+            wpe=nn.Embedding(config.block_size, config.n_embd),
+            drop=nn.Dropout(config.dropout),
+            h=nn.ModuleList([DeepSeekBlock(config) for _ in range(config.n_layer)]),
+            ln_f=nn.LayerNorm(config.n_embd, bias=config.bias),
+        ))
+        # Language model head
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        # Multi-token predictor
+        if config.multi_token_predict > 0:
+            self.multi_token_predictor = MultiTokenPredictor(config)
+        else:
+            self.multi_token_predictor = None
+        # Weight tying
+        self.transformer.wte.weight = self.lm_head.weight
+        # Initialize weights
+        self.apply(self._init_weights)
+        # Setup quantization if enabled
+        if config.use_quantization:
+            self._setup_quantization()
+    def _init_weights(self, module):
+        """Initialize model weights"""
+        if isinstance(module, nn.Linear):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            nn.init.ones_(module.weight)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def _setup_quantization(self):
+        """Setup quantization for the model"""
+        # This would implement quantization logic
+        # For now, just a placeholder
+        pass
+    def forward(self, input_ids: torch.Tensor, targets: Optional[torch.Tensor] = None):
+        """Forward pass"""
+        device = input_ids.device
+        batch_size, seq_len = input_ids.size()
+        assert seq_len <= self.config.block_size
+        # Position indices
+        pos = torch.arange(0, seq_len, dtype=torch.long, device=device)
+        # Token and position embeddings
+        tok_emb = self.transformer.wte(input_ids)
+        pos_emb = self.transformer.wpe(pos)
+        x = self.transformer.drop(tok_emb + pos_emb)
+        # Forward through transformer blocks
+        router_logits_list = []
+        for block in self.transformer.h:
+            x, router_logits = block(x)
+            router_logits_list.append(router_logits)
+        # Final layer norm
+        x = self.transformer.ln_f(x)
+        if targets is not None:
+            # Training mode
+            if self.multi_token_predictor is not None:
+                # Multi-token prediction
+                multi_logits = self.multi_token_predictor(x)
+                loss = self._compute_multi_token_loss(multi_logits, targets)
+            else:
+                # Standard single-token prediction
+                logits = self.lm_head(x)
+                loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
+                                     targets.view(-1), ignore_index=-1)
+            # Add MoE auxiliary loss
+            if router_logits_list:
+                aux_loss = sum(self.transformer.h[i].moe._compute_aux_loss(router_logits_list[i])
+                              for i in range(len(router_logits_list)))
+                loss += self.config.moe_aux_loss_coeff * aux_loss
+            return logits if self.multi_token_predictor is None else multi_logits, loss
+        else:
+            # Inference mode
+            logits = self.lm_head(x[:, [-1], :])
+            return logits, None
+    def _compute_multi_token_loss(self, logits: torch.Tensor, targets: torch.Tensor):
+        """Compute loss for multi-token prediction"""
+        batch_size, num_tokens, vocab_size = logits.shape
+        # Reshape for loss computation
+        logits_flat = logits.view(-1, vocab_size)
+        targets_flat = targets.view(-1)
+        # Compute cross-entropy loss
+        loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=-1)
+        return loss
+    @torch.no_grad()
+    def generate(self, input_ids: torch.Tensor, max_new_tokens: int = 100,
+                 temperature: float = 1.0, top_k: Optional[int] = None):
+        """Generate text using the model"""
+        for _ in range(max_new_tokens):
+            # Ensure input doesn't exceed block size
+            idx_cond = input_ids if input_ids.size(1) <= self.config.block_size else input_ids[:, -self.config.block_size:]
+            # Forward pass
+            logits, _ = self(idx_cond)
+            logits = logits[:, -1, :] / temperature
+            # Apply top-k filtering
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = -float('Inf')
+            # Sample next token
+            probs = F.softmax(logits, dim=-1)
+            idx_next = torch.multinomial(probs, num_samples=1)
+            input_ids = torch.cat((input_ids, idx_next), dim=1)
+        return input_ids
+    @classmethod
+    def from_pretrained(cls, model_type: str, override_args: Optional[dict] = None):
+        """Load a pretrained model"""
+        # This would implement loading from pretrained weights
+        # For now, return a default configuration
+        config = DeepSeekConfig()
+        if override_args:
+            for key, value in override_args.items():
+                setattr(config, key, value)
+        return cls(config)

src/run_training.py ADDED Viewed

	@@ -0,0 +1,307 @@

+"""
+DeepSeek Children's Stories Training Script
+Main training script for the DeepSeek model on children's stories
+"""
+import os
+import sys
+import argparse
+import torch
+from dataclasses import dataclass
+from typing import Optional
+# Add the src directory to Python path
+sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
+from model.deepseek import DeepSeek, DeepSeekConfig
+from training.trainer import DeepSeekTrainer, create_deepseek_trainer
+from data.data_processor import DeepSeekDataProcessor
+@dataclass
+class TrainingConfig:
+    """Configuration for DeepSeek training"""
+    # Model configuration
+    vocab_size: int = 50257
+    n_layer: int = 6
+    n_head: int = 8
+    n_embd: int = 512
+    block_size: int = 1024
+    dropout: float = 0.1
+    bias: bool = True
+    # MLA configuration
+    use_mla: bool = True
+    mla_kv_heads: int = 4
+    mla_q_lora_rank: int = 32
+    mla_kv_lora_rank: int = 16
+    # MoE configuration
+    moe_num_experts: int = 4
+    moe_top_k: int = 2
+    moe_expert_capacity: float = 1.25
+    moe_aux_loss_coeff: float = 0.01
+    # Multi-token prediction
+    multi_token_predict: int = 0  # Predict next 2 tokens for efficiency
+    # Quantization
+    use_quantization: bool = False
+    quantization_bits: int = 8
+    # Training configuration
+    batch_size: int = 12
+    max_iters: int = 20000
+    eval_interval: int = 1000
+    eval_iters: int = 200
+    learning_rate: float = 6e-4
+    weight_decay: float = 0.1
+    warmup_iters: int = 2000
+    lr_decay_iters: int = 20000
+    min_lr: float = 6e-5
+    # System configuration
+    checkpoint_dir: str = 'checkpoints'
+    use_mixed_precision: bool = True
+    compile_model: bool = True
+    # Data configuration
+    dataset_name: str = "ajibawa-2023/Children-Stories-Collection"
+    data_dir: str = 'src/data'
+def setup_environment():
+    """Setup the training environment"""
+    print("Setting up DeepSeek Children's Stories training environment...")
+    # Check CUDA availability
+    if torch.cuda.is_available():
+        print(f"CUDA available: {torch.cuda.get_device_name(0)}")
+        print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
+    else:
+        print("CUDA not available, using CPU")
+    # Create necessary directories
+    os.makedirs('checkpoints', exist_ok=True)
+    os.makedirs('lora_checkpoints', exist_ok=True)
+    os.makedirs('src/data', exist_ok=True)
+    print("Environment setup complete!")
+def prepare_data():
+    """Prepare the dataset for training"""
+    print("Preparing dataset...")
+    processor = DeepSeekDataProcessor()
+    data_files = processor.prepare_dataset()
+    print("Dataset preparation complete!")
+    return data_files
+def create_model(config: TrainingConfig) -> DeepSeek:
+    """Create the DeepSeek model"""
+    print("Creating DeepSeek model...")
+    # Create model configuration
+    model_config = DeepSeekConfig(
+        vocab_size=config.vocab_size,
+        n_layer=config.n_layer,
+        n_head=config.n_head,
+        n_embd=config.n_embd,
+        block_size=config.block_size,
+        dropout=config.dropout,
+        bias=config.bias,
+        use_mla=config.use_mla,
+        mla_kv_heads=config.mla_kv_heads,
+        mla_q_lora_rank=config.mla_q_lora_rank,
+        mla_kv_lora_rank=config.mla_kv_lora_rank,
+        moe_num_experts=config.moe_num_experts,
+        moe_top_k=config.moe_top_k,
+        moe_expert_capacity=config.moe_expert_capacity,
+        moe_aux_loss_coeff=config.moe_aux_loss_coeff,
+        multi_token_predict=config.multi_token_predict,
+        use_quantization=config.use_quantization,
+        quantization_bits=config.quantization_bits
+    )
+    # Create model
+    model = DeepSeek(model_config)
+    # Compile model if requested
+    if config.compile_model and hasattr(torch, 'compile'):
+        print("Compiling model with torch.compile...")
+        model = torch.compile(model)
+    # Print model info
+    total_params = sum(p.numel() for p in model.parameters())
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Model created successfully!")
+    print(f"Total parameters: {total_params:,}")
+    print(f"Trainable parameters: {trainable_params:,}")
+    print(f"Model configuration:")
+    print(f"  - Layers: {config.n_layer}")
+    print(f"  - Heads: {config.n_head}")
+    print(f"  - Embedding dim: {config.n_embd}")
+    print(f"  - MLA enabled: {config.use_mla}")
+    print(f"  - MLA KV heads: {config.mla_kv_heads}")
+    print(f"  - MoE experts: {config.moe_num_experts}")
+    print(f"  - Multi-token prediction: {config.multi_token_predict}")
+    return model
+def train_model(model: DeepSeek, config: TrainingConfig):
+    """Train the DeepSeek model"""
+    print(f"[+] Starting training with config:")
+    print(f"    - Model size: {sum(p.numel() for p in model.parameters()):,} parameters")
+    print(f"    - Multi-token prediction: {config.multi_token_predict}")
+    print(f"    - MoE experts: {config.moe_num_experts}")
+    print(f"    - MLA enabled: {config.use_mla}")
+    # Setup device
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    model = model.to(device)
+    # Create optimizer
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=config.learning_rate,
+        weight_decay=config.weight_decay,
+        betas=(0.9, 0.95)
+    )
+    # Initialize trainer with individual parameters
+    trainer = DeepSeekTrainer(
+        model=model,
+        optimizer=optimizer,
+        device=device,
+        batch_size=config.batch_size,
+        max_iters=config.max_iters,
+        eval_interval=config.eval_interval,
+        eval_iters=config.eval_iters,
+        learning_rate=config.learning_rate,
+        weight_decay=config.weight_decay,
+        warmup_iters=config.warmup_iters,
+        lr_decay_iters=config.lr_decay_iters,
+        min_lr=config.min_lr,
+        checkpoint_dir=config.checkpoint_dir,
+        use_mixed_precision=config.use_mixed_precision
+    )
+    try:
+        # Start training
+        trainer.train()
+        print("[+] Training completed successfully!")
+        # Save final model
+        final_model_path = os.path.join(config.checkpoint_dir, "final_model.pt")
+        torch.save({
+            'model_state_dict': model.state_dict(),
+            'config': config,
+            'optimizer_state_dict': trainer.optimizer.state_dict(),
+        }, final_model_path)
+        print(f"[+] Final model saved to {final_model_path}")
+    except Exception as e:
+        print(f"[-] Training failed: {e}")
+        import traceback
+        traceback.print_exc()
+        raise
+def main():
+    """Main training function"""
+    parser = argparse.ArgumentParser(description='Train DeepSeek model on children\'s stories')
+    # Model configuration
+    parser.add_argument('--n-layer', type=int, default=6, help='Number of layers')
+    parser.add_argument('--n-head', type=int, default=8, help='Number of attention heads')
+    parser.add_argument('--n-embd', type=int, default=512, help='Embedding dimension')
+    parser.add_argument('--block-size', type=int, default=1024, help='Context window size')
+    # Training configuration
+    parser.add_argument('--batch-size', type=int, default=12, help='Batch size')
+    parser.add_argument('--max-iters', type=int, default=20000, help='Maximum iterations')
+    parser.add_argument('--learning-rate', type=float, default=6e-4, help='Learning rate')
+    parser.add_argument('--eval-interval', type=int, default=1000, help='Evaluation interval')
+    parser.add_argument('--eval-iters', type=int, default=200, help='Number of evaluation iterations')
+    parser.add_argument('--weight-decay', type=float, default=0.1, help='Weight decay')
+    parser.add_argument('--warmup-iters', type=int, default=2000, help='Warmup iterations')
+    parser.add_argument('--lr-decay-iters', type=int, default=20000, help='Learning rate decay iterations')
+    parser.add_argument('--min-lr', type=float, default=6e-5, help='Minimum learning rate')
+    # Advanced features
+    parser.add_argument('--moe-experts', type=int, default=4, help='Number of MoE experts')
+    parser.add_argument('--multi-token', type=int, default=2, help='Multi-token prediction')
+    parser.add_argument('--no-compile', action='store_true', help='Disable model compilation')
+    parser.add_argument('--no-mixed-precision', action='store_true', help='Disable mixed precision')
+    # Resume training
+    parser.add_argument('--resume', type=str, help='Resume from checkpoint')
+    args = parser.parse_args()
+    # Create configuration
+    config = TrainingConfig(
+        n_layer=args.n_layer,
+        n_head=args.n_head,
+        n_embd=args.n_embd,
+        block_size=args.block_size,
+        batch_size=args.batch_size,
+        max_iters=args.max_iters,
+        learning_rate=args.learning_rate,
+        eval_interval=args.eval_interval,
+        eval_iters=args.eval_iters,
+        weight_decay=args.weight_decay,
+        warmup_iters=args.warmup_iters,
+        lr_decay_iters=args.lr_decay_iters,
+        min_lr=args.min_lr,
+        moe_num_experts=args.moe_experts,
+        multi_token_predict=args.multi_token,
+        compile_model=not args.no_compile,
+        use_mixed_precision=not args.no_mixed_precision
+    )
+    print("DeepSeek Children's Stories Training")
+    print("=" * 50)
+    print(f"Configuration:")
+    print(f"  - Model: {config.n_layer}L/{config.n_head}H/{config.n_embd}D")
+    print(f"  - MoE: {config.moe_num_experts} experts")
+    print(f"  - Multi-token: {config.multi_token_predict}")
+    print(f"  - Batch size: {config.batch_size}")
+    print(f"  - Max iterations: {config.max_iters}")
+    print(f"  - Learning rate: {config.learning_rate}")
+    print(f"  - Weight decay: {config.weight_decay}")
+    print(f"  - Warmup iterations: {config.warmup_iters}")
+    print(f"  - LR decay iterations: {config.lr_decay_iters}")
+    print(f"  - Min learning rate: {config.min_lr}")
+    print("=" * 50)
+    # Setup environment
+    setup_environment()
+    # Prepare data
+    data_files = prepare_data()
+    # Create model
+    model = create_model(config)
+    # Resume from checkpoint if specified
+    if args.resume:
+        print(f"Resuming from checkpoint: {args.resume}")
+        checkpoint = torch.load(args.resume, map_location='cpu')
+        model.load_state_dict(checkpoint['model'])
+        print("Checkpoint loaded successfully!")
+    # Train model
+    train_model(model, config)
+    print("Training completed successfully!")
+    print("Best model saved to: checkpoints/best_model.pt")
+if __name__ == "__main__":
+    main()

src/training/__pycache__/trainer.cpython-310.pyc ADDED Viewed

Binary file (10.8 kB). View file

src/training/trainer.py ADDED Viewed

	@@ -0,0 +1,408 @@

+"""
+DeepSeek Trainer for Children's Stories
+Advanced training with MLA, MoE, and multi-token prediction
+"""
+import torch
+import numpy as np
+from tqdm.auto import tqdm
+from torch.optim.lr_scheduler import LinearLR, SequentialLR, CosineAnnealingLR
+import matplotlib.pyplot as plt
+import os
+import datetime
+import time
+import shutil
+import psutil
+import math
+import gc
+import torch.nn as nn
+from torch.nn import functional as F
+from torch.utils.data.distributed import DistributedSampler
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed import init_process_group, destroy_process_group
+from typing import Dict, List, Optional, Tuple
+class DeepSeekTrainer:
+    def __init__(self, model, optimizer, device, batch_size, max_iters, eval_interval,
+                 eval_iters, learning_rate, weight_decay, warmup_iters, lr_decay_iters,
+                 min_lr, checkpoint_dir='checkpoints', use_mixed_precision=True):
+        self.model = model
+        self.optimizer = optimizer
+        self.device = device
+        self.batch_size = batch_size
+        self.max_iters = max_iters
+        self.eval_interval = eval_interval
+        self.eval_iters = eval_iters
+        self.learning_rate = learning_rate
+        self.weight_decay = weight_decay
+        self.warmup_iters = warmup_iters
+        self.lr_decay_iters = lr_decay_iters
+        self.min_lr = min_lr
+        self.checkpoint_dir = checkpoint_dir
+        self.use_mixed_precision = use_mixed_precision
+        self.best_loss = float('inf')
+        # Training state
+        self.current_iter = 0
+        self.train_losses = []
+        self.val_losses = []
+        self.learning_rates = []
+        # Create checkpoint directory if it doesn't exist
+        os.makedirs(checkpoint_dir, exist_ok=True)
+        # Initialize gradient scaler for mixed precision training
+        if use_mixed_precision and device == 'cuda':
+            self.scaler = torch.cuda.amp.GradScaler()
+        else:
+            self.scaler = None
+        # Initialize training metrics
+        self.metrics = {
+            'train_loss': [],
+            'val_loss': [],
+            'learning_rates': [],
+            'grad_norm': [],
+            'memory_usage': [],
+            'moe_aux_loss': [],
+            'multi_token_loss': []
+        }
+        # Load data
+        self.data = self.load_data()
+        self.n = len(self.data)
+    def load_data(self):
+        """Load the training data"""
+        try:
+            data_file = os.path.join('src', 'data', 'train.bin')
+            if not os.path.exists(data_file):
+                raise FileNotFoundError(f"Training data file not found at {data_file}")
+            # Load data as numpy array first
+            data = np.memmap(data_file, dtype=np.uint16, mode='r')
+            # Convert to tensor
+            data = torch.from_numpy(data.copy())  # Make a copy to ensure it's writable
+            return data
+        except Exception as e:
+            print(f"Error loading data: {str(e)}")
+            raise
+    def get_batch(self, split):
+        """Get a batch of data"""
+        try:
+            # Generate random indices
+            ix = torch.randint(len(self.data) - self.model.config.block_size, (self.batch_size,))
+            # Get input sequences
+            x = torch.stack([self.data[i:i+self.model.config.block_size].long() for i in ix])
+            # Get target sequences (shifted by 1)
+            y = torch.stack([self.data[i+1:i+1+self.model.config.block_size].long() for i in ix])
+            # Move to device
+            x, y = x.to(self.device), y.to(self.device)
+            return x, y
+        except Exception as e:
+            print(f"Error in get_batch: {str(e)}")
+            raise
+    def get_lr(self, it):
+        """Get learning rate for current iteration"""
+        # 1) linear warmup for warmup_iters steps
+        if it < self.warmup_iters:
+            return self.learning_rate * it / self.warmup_iters
+        # 2) if it > lr_decay_iters, return min learning rate
+        if it > self.lr_decay_iters:
+            return self.min_lr
+        # 3) in between, use cosine decay down to min learning rate
+        decay_ratio = (it - self.warmup_iters) / (self.lr_decay_iters - self.warmup_iters)
+        assert 0 <= decay_ratio <= 1
+        coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
+        return self.min_lr + coeff * (self.learning_rate - self.min_lr)
+    def estimate_loss(self):
+        """Estimate loss on validation set"""
+        out = {}
+        self.model.eval()
+        for split in ['train', 'val']:
+            losses = torch.zeros(self.eval_iters)
+            for k in range(self.eval_iters):
+                try:
+                    X, Y = self.get_batch(split)
+                    with torch.no_grad():
+                        if self.scaler is not None:
+                            with torch.cuda.amp.autocast():
+                                logits, loss = self.model(X, Y)
+                        else:
+                            logits, loss = self.model(X, Y)
+                    losses[k] = loss.item()
+                except Exception as e:
+                    print(f"Error during evaluation: {str(e)}")
+                    continue
+            out[split] = losses.mean()
+        self.model.train()
+        return out
+    def check_disk_space(self, required_space_mb=1000):
+        """Check if there's enough disk space for saving the model"""
+        try:
+            # Get disk usage statistics
+            disk_usage = psutil.disk_usage('/')
+            free_space_mb = disk_usage.free / (1024 * 1024)  # Convert to MB
+            if free_space_mb < required_space_mb:
+                print(f"Warning: Low disk space. Only {free_space_mb:.2f}MB free, {required_space_mb}MB required")
+                return False
+            return True
+        except Exception as e:
+            print(f"Warning: Could not check disk space: {e}")
+            return True  # Continue anyway if we can't check
+    def save_checkpoint(self, iter_num, loss, is_best=False):
+        """Save model checkpoint"""
+        try:
+            checkpoint = {
+                'model': self.model.state_dict(),
+                'optimizer': self.optimizer.state_dict(),
+                'iter_num': iter_num,
+                'loss': loss,
+                'config': self.model.config,
+                'train_losses': self.train_losses,
+                'val_losses': self.val_losses,
+                'learning_rates': self.learning_rates,
+                'metrics': self.metrics,
+                'best_loss': self.best_loss
+            }
+            checkpoint_path = os.path.join(self.checkpoint_dir, f'checkpoint_{iter_num}.pt')
+            torch.save(checkpoint, checkpoint_path)
+            if is_best:
+                best_path = os.path.join(self.checkpoint_dir, 'best_model.pt')
+                torch.save(checkpoint, best_path)
+                print(f"Saved best model with loss {loss:.4f}")
+            print(f"Saved checkpoint to {checkpoint_path}")
+            return True
+        except Exception as e:
+            print(f"Error saving checkpoint: {str(e)}")
+            return False
+    def load_checkpoint(self, checkpoint_path):
+        """Load model checkpoint with error handling"""
+        try:
+            checkpoint = torch.load(checkpoint_path, map_location=self.device)
+            self.model.load_state_dict(checkpoint['model'])
+            self.optimizer.load_state_dict(checkpoint['optimizer'])
+            self.current_iter = checkpoint['iter_num']
+            self.best_loss = checkpoint['loss']
+            self.train_losses = checkpoint.get('train_losses', [])
+            self.val_losses = checkpoint.get('val_losses', [])
+            self.learning_rates = checkpoint.get('learning_rates', [])
+            self.metrics = checkpoint.get('metrics', self.metrics)
+            print(f"Successfully loaded checkpoint from iteration {self.current_iter}")
+            return True
+        except Exception as e:
+            print(f"Error loading checkpoint: {e}")
+            return False
+    def train(self):
+        """Train the DeepSeek model"""
+        print(f"DeepSeek Training started at: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+        print(f"Model: {self.model.config.n_layer} layers, {self.model.config.n_head} heads, {self.model.config.n_embd} dims")
+        print(f"MLA: {self.model.config.mla_kv_heads} KV heads, MoE: {self.model.config.moe_num_experts} experts")
+        print(f"Multi-token prediction: {self.model.config.multi_token_predict} tokens")
+        start_time = time.time()
+        try:
+            # Initialize training
+            X, Y = self.get_batch('train')
+            best_loss = float('inf')
+            current_loss = None
+            for iter_num in range(self.current_iter, self.max_iters):
+                self.current_iter = iter_num
+                # Determine and set the learning rate for this iteration
+                lr = self.get_lr(iter_num)
+                for param_group in self.optimizer.param_groups:
+                    param_group['lr'] = lr
+                # Forward pass with mixed precision
+                if self.scaler is not None:
+                    with torch.cuda.amp.autocast():
+                        logits, loss = self.model(X, Y)
+                else:
+                    logits, loss = self.model(X, Y)
+                # Backward pass
+                if self.scaler is not None:
+                    self.scaler.scale(loss).backward()
+                    self.scaler.step(self.optimizer)
+                    self.scaler.update()
+                else:
+                    loss.backward()
+                    self.optimizer.step()
+                self.optimizer.zero_grad(set_to_none=True)
+                # Get new batch
+                X, Y = self.get_batch('train')
+                # Track metrics
+                current_loss = loss.item()
+                self.train_losses.append(current_loss)
+                self.learning_rates.append(lr)
+                # Update best loss
+                if current_loss < best_loss:
+                    best_loss = current_loss
+                # Evaluation
+                if iter_num % self.eval_interval == 0:
+                    losses = self.estimate_loss()
+                    self.val_losses.append(losses['val'])
+                    # Save checkpoint if it's the best so far
+                    if losses['val'] < self.best_loss:
+                        self.best_loss = losses['val']
+                        self.save_checkpoint(iter_num, losses['val'], is_best=True)
+                    # Regular checkpoint saving
+                    if iter_num % (self.eval_interval * 5) == 0:
+                        self.save_checkpoint(iter_num, losses['val'])
+                    # Print progress
+                    elapsed = time.time() - start_time
+                    print(f"iter {iter_num}: train_loss {current_loss:.4f}, val_loss {losses['val']:.4f}, "
+                          f"lr {lr:.2e}, time {elapsed:.2f}s")
+                    # Memory usage
+                    if self.device == 'cuda':
+                        memory_used = torch.cuda.memory_allocated() / 1024**3
+                        print(f"GPU memory: {memory_used:.2f} GB")
+                # Memory cleanup
+                if iter_num % 100 == 0:
+                    gc.collect()
+                    if self.device == 'cuda':
+                        torch.cuda.empty_cache()
+            # Final checkpoint
+            self.save_checkpoint(self.max_iters, current_loss)
+            # Plot training metrics
+            self.plot_metrics()
+            print(f"Training completed in {time.time() - start_time:.2f} seconds")
+        except Exception as e:
+            print(f"Error during training: {str(e)}")
+            # Save emergency checkpoint
+            if current_loss is not None:
+                self.save_checkpoint(self.current_iter, current_loss)
+            raise
+    def plot_losses(self, train_losses, val_losses):
+        """Plot training and validation losses"""
+        plt.figure(figsize=(12, 4))
+        plt.subplot(1, 2, 1)
+        plt.plot(train_losses, label='Training Loss')
+        plt.plot(val_losses, label='Validation Loss')
+        plt.title('Training and Validation Loss')
+        plt.xlabel('Iteration')
+        plt.ylabel('Loss')
+        plt.legend()
+        plt.grid(True)
+        plt.subplot(1, 2, 2)
+        plt.plot(self.learning_rates)
+        plt.title('Learning Rate Schedule')
+        plt.xlabel('Iteration')
+        plt.ylabel('Learning Rate')
+        plt.grid(True)
+        plt.tight_layout()
+        plt.savefig('training_metrics.png', dpi=300, bbox_inches='tight')
+        plt.close()
+    def plot_metrics(self):
+        """Plot comprehensive training metrics"""
+        if not self.train_losses or not self.val_losses:
+            print("No metrics to plot")
+            return
+        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
+        # Training and validation loss
+        axes[0, 0].plot(self.train_losses, label='Training Loss', alpha=0.7)
+        axes[0, 0].plot(self.val_losses, label='Validation Loss', alpha=0.7)
+        axes[0, 0].set_title('Training and Validation Loss')
+        axes[0, 0].set_xlabel('Iteration')
+        axes[0, 0].set_ylabel('Loss')
+        axes[0, 0].legend()
+        axes[0, 0].grid(True)
+        # Learning rate
+        axes[0, 1].plot(self.learning_rates)
+        axes[0, 1].set_title('Learning Rate Schedule')
+        axes[0, 1].set_xlabel('Iteration')
+        axes[0, 1].set_ylabel('Learning Rate')
+        axes[0, 1].grid(True)
+        # Memory usage
+        if self.metrics['memory_usage']:
+            axes[1, 0].plot(self.metrics['memory_usage'])
+            axes[1, 0].set_title('GPU Memory Usage')
+            axes[1, 0].set_xlabel('Iteration')
+            axes[1, 0].set_ylabel('Memory (GB)')
+            axes[1, 0].grid(True)
+        # Gradient norm
+        if self.metrics['grad_norm']:
+            axes[1, 1].plot(self.metrics['grad_norm'])
+            axes[1, 1].set_title('Gradient Norm')
+            axes[1, 1].set_xlabel('Iteration')
+            axes[1, 1].set_ylabel('Norm')
+            axes[1, 1].grid(True)
+        plt.tight_layout()
+        plt.savefig('deepseek_training_metrics.png', dpi=300, bbox_inches='tight')
+        plt.close()
+        print("Training metrics saved to deepseek_training_metrics.png")
+def create_deepseek_trainer(model, config):
+    """Create a DeepSeek trainer with the given configuration"""
+    # Optimizer
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=config.learning_rate,
+        weight_decay=config.weight_decay,
+        betas=(0.9, 0.95)
+    )
+    # Device
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    model = model.to(device)
+    # Trainer
+    trainer = DeepSeekTrainer(
+        model=model,
+        optimizer=optimizer,
+        device=device,
+        batch_size=config.batch_size,
+        max_iters=config.max_iters,
+        eval_interval=config.eval_interval,
+        eval_iters=config.eval_iters,
+        learning_rate=config.learning_rate,
+        weight_decay=config.weight_decay,
+        warmup_iters=config.warmup_iters,
+        lr_decay_iters=config.lr_decay_iters,
+        min_lr=config.min_lr,
+        checkpoint_dir=config.checkpoint_dir,
+        use_mixed_precision=config.use_mixed_precision
+    )
+    return trainer

training_metrics.png ADDED Viewed

Git LFS Details

SHA256: 73f8b42b8ab0aa737ed8a7f4c16b607e28ee36b32de6cdf170ec8bd92281d31b
Pointer size: 131 Bytes
Size of remote file: 209 kB