DrNerd
/

LLAMA-3-From-Scratch

+---
+license: apache-2.0
+datasets:
+- Salesforce/wikitext
+language:
+- en
+metrics:
+- perplexity
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- text-generation-inference
+- custom-code
+- from-scratch
+- llama-inspired
+- educational
+- wikitext
+- deepseek-tokenizer
+- ~200M-params
+---
+---
+license: apache-2.0
+language: en
+library_name: pytorch
+tags:
+- text-generation
+- custom-model
+- educational
+- from-scratch
+- llama-inspired
+pipeline_tag: text-generation
+---
+# LLAMA-3-From-Scartch (Custom ~221M Model)
+## Model Description
+This repository contains a ~221 Million parameter decoder-only Transformer language model built "from scratch". The architecture is inspired by models like Llama but significantly scaled down and adapted during development to run within limited hardware constraints (specifically, training was demonstrated on a single ~4GB VRAM GPU).
+**This model is primarily an educational project** demonstrating the implementation of core Transformer components (RMSNorm, RoPE, Attention, FFN with SwiGLU, Weight Tying) and a basic training pipeline (including checkpointing, LR scheduling, AMP, and validation).
+**Key Architectural Features:**
+*   **Parameters:** ~221 Million (Weight Tied)
+*   **Layers:** 12 Transformer Blocks
+*   **Hidden Size:** 768
+*   **Attention:** 12 Heads (Multi-Head Attention, `head_dim=64`)
+*   **FFN Intermediate Size:** 2048 (using SwiGLU activation)
+*   **Normalization:** RMSNorm
+*   **Positional Embeddings:** Rotary Positional Embeddings (RoPE)
+*   **Weight Tying:** Input embeddings and output projection layer share weights.
+*   **Tokenizer:** `deepseek-ai/DeepSeek-R1` (Vocab Size: 128,000) - *Note: Requires `trust_remote_code=True`*
+*   **Context Length:** Trained with `max_position_embeddings=4096`, demo training used `sequence_length=256`.
+**Training:**
+*   **Dataset:** Primarily trained on **WikiText-2** (`wikitext-2-raw-v1`, ~2M tokens) for demonstration purposes. The tokenized versions (`wikitext2_tokens_128k.pt`, `wikitext2_val_tokens_128k.pt`) are included in the repository.
+*   **Procedure:** Trained on a single GPU using PyTorch with AMP (float16), AdamW optimizer, and a Cosine learning rate schedule with warmup. The provided checkpoints (`step_600.pt`, `step_800.pt`, potentially others) represent states after very limited training.
+*   **Performance:** Due to the extremely limited training data and duration, the model exhibits basic pattern learning but **lacks coherence, factual accuracy, and instruction-following capabilities.** The training and validation loss decreased but remained high. See `loss_plot_*.png` for visualization.
+**Intended Use:**
+*   Educational purposes: Studying Transformer architecture implementation and training basics.
+*   Experimentation: Serving as a base for further training or architectural modifications.
+*   **Not suitable for production or reliable text generation.**
+## How to Use
+**Important:** This model requires the custom Python code (`model_architecture.py`) from this repository. It cannot be loaded directly using `AutoModelForCausalLM`.
+1.  **Clone the repository:**
+    ```bash
+    git clone https://huggingface.co/DrNerd/LLAMA-3-From-Scartch
+    cd LLAMA-3-From-Scartch
+    # Ensure LFS files are downloaded (if needed)
+    # git lfs pull
+    ```
+2.  **Install dependencies:**
+    ```bash
+    pip install torch transformers datasets matplotlib tqdm # Add others if needed
+    ```
+3.  **Run Inference (using `inference.py`):**
+    The `inference.py` script loads a checkpoint (defaults to `step_1200.pt` or latest if not found, **edit the script to specify `step_800.pt` or `step_600.pt`**) and runs generation.
+    ```bash
+    # Make sure step_600.pt or step_800.pt exists in the directory
+    # Edit inference.py to point to the desired checkpoint file
+    python inference.py
+    ```
+    *Alternatively, adapt the loading logic from `inference.py` into your own script.*