Spaces:

gitesh-grover
/

SmolLM2-135m

Sleeping

App Files Files Community

gitesh-grover commited on Feb 7

Commit

960a17b

verified ·

1 Parent(s): 8b4bb02

Upload 6 files

Browse files

Files changed (6) hide show

README.md +226 -7
app.py +61 -0
config.py +40 -0
model.py +306 -0
requirements.txt +8 -0
utils.py +26 -0

README.md CHANGED Viewed

@@ -1,13 +1,232 @@
 ---
-title: SmolLM2 135m
-emoji: 🌖
-colorFrom: yellow
-colorTo: gray
 sdk: gradio
-sdk_version: 5.15.0
 app_file: app.py
 pinned: false
-short_description: Demo SmolLM2-135m model trained for only 5k steps
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SmolLM2 135M Text Generation Demo
+emoji: 📚
+colorFrom: blue
+colorTo: red
 sdk: gradio
+sdk_version: 3.50.2
 app_file: app.py
 pinned: false
 ---
+# SmolLM2 Text Generation Demo
+This is a simple text generation demo using the SmolLM2 language model with a Gradio interface.
+## Description
+This application provides a web interface for text generation using the SmolLM2 language model. Users can input a prompt and adjust various generation parameters to control the output.
+## Features
+- Interactive web interface built with Gradio
+- Adjustable generation parameters:
+  - Maximum new tokens (1-150)
+  - Temperature (0.1-2.0)
+  - Top-K sampling (1-100)
+- Real-time text generation
+## Usage
+1. Enter your prompt in the text input field
+2. Adjust the generation parameters (optional):
+   - **Max New Tokens**: Controls the length of the generated text
+   - **Temperature**: Controls randomness (higher = more creative, lower = more focused)
+   - **Top-K**: Controls diversity of word choices
+3. Click submit to generate text
+## Installation
+1. Clone the repository
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+ ## Run the application:
+   ```bash
+   python app.py
+   ```
+   The interface will be available at `http://localhost:7860`
+## Train the model:
+```bash
+python train.py
+```
+# Model details
+SmolLM2 is a language model designed for [add your model's specific details here]. The model uses the [specify tokenizer] tokenizer from Hugging Face's transformers library.
+## Llama 2 Architecture
+![Llama 2 Architecture](./static/llamaModel.jpg)
+Read https://pub.towardsai.net/llama-explained-a70e71e706e9 for more details.
+# Compare Custom SmolLM2-135 with HuggingFaceTB/SmolLM2-135M
+ HuggingFaceTB/SmolLM2-135M
+```bash
+LlamaForCausalLM(
+  (model): LlamaModel(
+    (embed_tokens): Embedding(49152, 576)
+    (layers): ModuleList(
+      (0-29): 30 x LlamaDecoderLayer(
+        (self_attn): LlamaAttention(
+          (q_proj): Linear(in_features=576, out_features=576, bias=False)
+          (k_proj): Linear(in_features=576, out_features=192, bias=False)
+          (v_proj): Linear(in_features=576, out_features=192, bias=False)
+          (o_proj): Linear(in_features=576, out_features=576, bias=False)
+        )
+        (mlp): LlamaMLP(
+          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
+          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
+          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
+          (act_fn): SiLU()
+        )
+        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
+        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
+      )
+    )
+    (norm): LlamaRMSNorm((576,), eps=1e-05)
+    (rotary_emb): LlamaRotaryEmbedding()
+  )
+  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
+)
+```
+Custom SmolLM2-135
+```bash
+SmolLM2(
+  (embedding): Embedding(49152, 576)
+  (layers): ModuleList(
+    (0-29): 30 x LlamaBlock(
+      (attention): LlamaAttention(
+        (q_proj): Linear(in_features=576, out_features=576, bias=False)
+        (k_proj): Linear(in_features=576, out_features=192, bias=False)
+        (v_proj): Linear(in_features=576, out_features=192, bias=False)
+        (o_proj): Linear(in_features=576, out_features=576, bias=False)
+      )
+      (feed_forward): LlamaFFN(
+        (gate): Linear(in_features=576, out_features=1536, bias=False)
+        (up): Linear(in_features=576, out_features=1536, bias=False)
+        (down): Linear(in_features=1536, out_features=576, bias=False)
+        (act_fn): SiLU()
+      )
+      (attention_norm): RMSNorm((576,), eps=1e-05, elementwise_affine=True)
+      (ffn_norm): RMSNorm((576,), eps=1e-05, elementwise_affine=True)
+    )
+  )
+  (norm): RMSNorm((576,), eps=1e-05, elementwise_affine=True)
+  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
+)
+```
+# Training Logs
+## Training with 5000 steps (without checkpoint)
+```bash
+(venv) gitesh.grover@Giteshs-MacBook-Pro ai-era-assignment13 % python train.py
+Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 720.56it/s]
+Resolving data files: 100%|████████████████████████████████████████��████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 562123.22it/s]
+Epoch: 0, Step: 0, Batch: 0, Loss: 10.9101, Time: 1.44s, Token/s: 2842.75
+Saved checkpoint at step 0
+What is Gravity? thymopenedi something aneur checklist fertiliserlete hiding Watching [[GuardinnamonGuard thym thym something multilinguali runway astronlighten runwayinnamon nastylighten disadvant snout plumquest
+Epoch: 0, Step: 1, Batch: 1, Loss: 10.6729, Time: 2.00s, Token/s: 2044.98
+Epoch: 0, Step: 2, Batch: 2, Loss: 9.2034, Time: 1.16s, Token/s: 3517.56
+Epoch: 0, Step: 3, Batch: 3, Loss: 8.5723, Time: 1.09s, Token/s: 3766.14
+Epoch: 0, Step: 4, Batch: 4, Loss: 8.1478, Time: 1.07s, Token/s: 3845.85
+:
+:
+Epoch: 0, Step: 500, Batch: 500, Loss: 5.9723, Time: 1.07s, Token/s: 3825.45
+Saved checkpoint at step 500
+What is Gravity? We call us to use, I can create a `e` function to do to add a few to calculate their lives.
+* An the need
+Epoch: 0, Step: 501, Batch: 501, Loss: 6.0491, Time: 1.58s, Token/s: 2595.98
+:
+:
+Epoch: 0, Step: 998, Batch: 998, Loss: 5.8647, Time: 1.25s, Token/s: 3289.61
+Epoch: 0, Step: 999, Batch: 999, Loss: 6.0096, Time: 1.10s, Token/s: 3726.16
+Epoch: 0, Step: 1000, Batch: 1000, Loss: 6.4388, Time: 1.09s, Token/s: 3763.74
+Saved checkpoint at step 1000
+What is Gravity? These tales of sharing a beautiful blend of the art, where will understand these questions where remain.
+III. **4.g., the Individuals
+:
+:
+Epoch: 0, Step: 1498, Batch: 1498, Loss: 7.3296, Time: 1.06s, Token/s: 3878.60
+Epoch: 0, Step: 1499, Batch: 1499, Loss: 6.0611, Time: 1.06s, Token/s: 3864.26
+Epoch: 0, Step: 1500, Batch: 1500, Loss: 6.1140, Time: 1.08s, Token/s: 3789.80
+Saved checkpoint at step 1500
+What is Gravity?
+Now imagine don't forget, "It have been the game?" But there are just as an 'L', does not can he noticed,
+:
+:
+:
+:
+Epoch: 0, Step: 3498, Batch: 3498, Loss: 5.7145, Time: 1.07s, Token/s: 3830.33
+Epoch: 0, Step: 3499, Batch: 3499, Loss: 5.7578, Time: 1.09s, Token/s: 3767.61
+Epoch: 0, Step: 3500, Batch: 3500, Loss: 6.0798, Time: 1.07s, Token/s: 3811.98
+Saved checkpoint at step 3500
+What is Gravity? Let's how a "P"? You might need to play and a new environment that makes it up a big planet of the whole piece of the information
+Epoch: 0, Step: 3501, Batch: 3501, Loss: 5.8375, Time: 1.47s, Token/s: 2790.70
+Epoch: 0, Step: 3502, Batch: 3502, Loss: 6.3435, Time: 1.07s, Token/s: 3838.95
+Epoch: 0, Step: 3503, Batch: 3503, Loss: 5.8192, Time: 1.05s, Token/s: 3901.14
+:
+:
+Epoch: 0, Step: 4496, Batch: 4496, Loss: 5.5488, Time: 1.06s, Token/s: 3862.06
+Epoch: 0, Step: 4497, Batch: 4497, Loss: 5.8281, Time: 1.07s, Token/s: 3821.71
+Epoch: 0, Step: 4498, Batch: 4498, Loss: 5.5703, Time: 1.07s, Token/s: 3844.92
+Epoch: 0, Step: 4499, Batch: 4499, Loss: 6.0630, Time: 1.06s, Token/s: 3854.04
+Epoch: 0, Step: 4500, Batch: 4500, Loss: 5.5889, Time: 1.06s, Token/s: 3860.19
+Saved checkpoint at step 4500
+What is Gravity?
+V. **Additional 2: Prepare a Power
+* **I and the Eaught of Life
+Before our exploration, understanding
+:
+:
+Epoch: 0, Step: 4996, Batch: 4996, Loss: 6.1501, Time: 1.06s, Token/s: 3865.19
+Epoch: 0, Step: 4997, Batch: 4997, Loss: 5.9107, Time: 1.05s, Token/s: 3884.67
+Epoch: 0, Step: 4998, Batch: 4998, Loss: 5.7005, Time: 1.07s, Token/s: 3834.26
+Epoch: 0, Step: 4999, Batch: 4999, Loss: 5.8820, Time: 1.07s, Token/s: 3814.07
+Saved final checkpoint
+What is Gravity? You would be a better big way, there are people have just like!
+As they saw out to the world in the world or making a
+Training complete
+```
+## Training with Additional 50 steps (with checkpoint)
+```bash
+Loading checkpoint from checkpoints/checkpoint_final.pt
+Resuming from epoch 0 at step 5000 with loss 5.881985664367676
+Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 313.79it/s]
+Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 462574.35it/s]
+Epoch: 0, Step: 5000, Batch: 0, Loss: 5.6473, Time: 2.69s, Token/s: 1520.90
+Saved checkpoint at step 5000
+What is Gravity? Well, remember, there's where those who do something as part of art and animals, family around us. For instance, there's like! But
+Epoch: 0, Step: 5001, Batch: 1, Loss: 6.1124, Time: 1.54s, Token/s: 2660.36
+Epoch: 0, Step: 5002, Batch: 2, Loss: 5.8381, Time: 1.11s, Token/s: 3680.22
+:
+:
+Epoch: 0, Step: 5044, Batch: 44, Loss: 6.1118, Time: 1.09s, Token/s: 3749.53
+Epoch: 0, Step: 5045, Batch: 45, Loss: 5.8618, Time: 1.11s, Token/s: 3676.88
+Epoch: 0, Step: 5046, Batch: 46, Loss: 5.8893, Time: 1.08s, Token/s: 3784.70
+Epoch: 0, Step: 5047, Batch: 47, Loss: 5.7507, Time: 1.10s, Token/s: 3729.83
+Epoch: 0, Step: 5048, Batch: 48, Loss: 5.6882, Time: 1.10s, Token/s: 3715.38
+Epoch: 0, Step: 5049, Batch: 49, Loss: 5.7396, Time: 1.09s, Token/s: 3745.38
+Saved final checkpoint
+What is Gravity? Have you would be wondering what life, you don't just how to do? She needed, they have had to know that "but these things has
+Training complete
+```

app.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import gradio as gr
+import torch
+from model import SmolLM2
+from transformers import AutoTokenizer
+from config import Config
+from utils import get_device
+# Initialize model and tokenizer
+config = Config()
+device = get_device(config.seed)
+print("device: ", device)
+def load_model():
+    model = SmolLM2(config)
+    # Load model weights to CPU first
+    model.load_state_dict(torch.load(config.checkpoints_path + "/model_final.pt", map_location=torch.device("cpu")))
+    model.to(device)
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name_or_path)
+    return model, tokenizer
+model, tokenizer = load_model()  # Get device from load_model
+def generate_text(input_text, max_new_tokens=100, temperature=0.8, top_k=50):
+    """
+    Generate text based on the input prompt
+    """
+    # Tokenize input
+    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
+    # Generate
+    with torch.no_grad():
+        output_ids = model.generate(
+            input_ids=input_ids,
+            max_new_tokens=max_new_tokens,
+            temperature=temperature,
+            top_k=top_k
+        )
+    # Move output back to CPU before decoding
+    output_ids = output_ids.cpu()
+    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+    return generated_text
+# Create Gradio interface
+demo = gr.Interface(
+    fn=generate_text,
+    inputs=[
+        gr.Textbox(label="Input Text", placeholder="Enter your prompt here..."),
+        gr.Slider(minimum=1, maximum=150, value=30, step=1, label="Max New Tokens"),
+        gr.Slider(minimum=0.1, maximum=2.0, value=0.8, step=0.1, label="Temperature"),
+        gr.Slider(minimum=1, maximum=100, value=50, step=1, label="Top-K"),
+    ],
+    outputs=gr.Textbox(label="Generated Text"),
+    title="SmolLM2 Text Generation",
+    description="Enter a prompt and the model will generate text based on it.",
+)
+if __name__ == "__main__":
+    demo.launch()

config.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from dataclasses import dataclass
+@dataclass
+class Config:
+    seed: int = 49
+    vocab_size: int = 49152 # it should match the vocab size of the tokenizer
+    num_hidden_layers: int = 30 # number of layers
+    num_attention_heads: int = 9 # number of heads
+    num_key_value_heads: int = 3 # number of key and value heads
+    nn_embed: int = 576 # embedding dimension or hidden_size
+    max_sequence_len: int = 2048 # max token sequence length (for pos embedding) # Block size
+    ffn_intermediate_size: int = 1536
+    rms_norm_eps: float = 1.0e-05
+    nn_top_k: int = 50 # top k for the model
+    nn_temperature: float = 1.0 # temperature for the model
+    tokenizer_name_or_path: str = "HuggingFaceTB/cosmo2-tokenizer"
+    checkpoint_interval: int = 1000
+    checkpoints_path = "checkpoints"
+    # init_method_std: 0.041666666666666664
+    nn_train_tok_seq: int = 65 # Actual training token sequence block size 64 + 1 as we are shifting the targets by 1
+    # nn_mlp_expansion: int = 4 # Expansion in the MLP layer
+    batch_size: int = 64
+    # train_tok_size: int = 32
+    # saved_model_path = 'data/model_tf.pth'
+    # train_input_file = 'data/input.txt'
+    optimizer_learning_rate_scheduler_learning_rate: float = 0.003
+    optimizer_learning_rate_scheduler_lr_decay_starting_step: int = 1600000
+    optimizer_learning_rate_scheduler_lr_decay_steps: int = 400000
+    optimizer_learning_rate_scheduler_lr_decay_style: str = "linear"
+    optimizer_learning_rate_scheduler_lr_warmup_steps: int = 2000
+    optimizer_learning_rate_scheduler_lr_warmup_style: str = "linear"
+    optimizer_learning_rate_scheduler_min_decay_lr: float = 0
+    optimizer_factory_adam_beta1: float = 0.9
+    optimizer_factory_adam_beta2: float = 0.95
+    optimizer_factory_adam_eps: float = 1.0e-08
+    optimizer_factory_name: str = "adamW"
+    optimizer_factory_torch_adam_is_fused: bool = True
+    optimizer_weight_decay: float = 0.01
+    optimizer_zero_stage: int = 0
+    optimizer_clip_grad: float = 1.0

model.py ADDED Viewed

	@@ -0,0 +1,306 @@

+import torch
+import torch.nn as nn
+import math
+from typing import Optional
+import torch.nn.functional as F
+# This llama model is based on the paper: https://arxiv.org/pdf/2302.13971.pdf
+# Model Architecturte: static/llamaModel.jpg
+# It is a transformer model with rotary position embeddings (RoPE) and SwiGLU
+# activation function. It uses RMSNorm for normalization.
+# Other Good reads: https://pub.towardsai.net/llama-explained-a70e71e706e9
+def precompute_rotary_emb(dim: int, max_seq_len: int, base: int = 10000) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Precompute the rotary position embeddings
+    Args:
+        dim: Dimension of the embeddings
+        max_seq_len: Maximum sequence length
+        base: Base for the angle calculations
+    Returns:
+        Tuple of (sin, cos) tensors of shape (max_seq_len, dim//2)
+    """
+    # Create position indices tensor
+    position = torch.arange(max_seq_len).unsqueeze(1)  # (seq_len, 1)
+    # Create dimension indices tensor
+    div_term = torch.exp(torch.arange(0, dim, 2) * (-math.log(base) / dim))  # (dim//2)
+    # Compute angles
+    angles = position * div_term  # (seq_len, dim//2)
+    # Return sin and cos
+    return torch.sin(angles), torch.cos(angles)
+def apply_rotary_emb(x: torch.Tensor, sin: torch.Tensor, cos: torch.Tensor) -> torch.Tensor:
+    """
+    Apply rotary position embeddings to the input tensor
+    Args:
+        x: Input tensor of shape (batch_size, seq_len, num_heads, head_dim)
+        sin: Sine tensor of shape (seq_len, head_dim//2)
+        cos: Cosine tensor of shape (seq_len, head_dim//2)
+    Returns:
+        Tensor with rotary position embeddings applied
+    """
+    # Reshape x to split last dimension in half
+    x_reshape = x.float().reshape(*x.shape[:-1], -1, 2)
+    # Extract even and odd dimensions
+    x1, x2 = x_reshape[..., 0], x_reshape[..., 1]
+    # Reshape sin and cos for broadcasting
+    sin = sin.view(1, sin.shape[0], 1, sin.shape[1])  # (1, seq_len, 1, dim//2)
+    cos = cos.view(1, cos.shape[0], 1, cos.shape[1])  # (1, seq_len, 1, dim//2)
+    # Apply rotation using the rotation matrix multiplication
+    result = torch.stack([
+        x1 * cos - x2 * sin,
+        x2 * cos + x1 * sin
+    ], dim=-1)
+    return result.flatten(-2)  # Flatten last 2 dimensions
+class LlamaAttention(nn.Module):
+    def __init__(self, dim: int, num_heads: int, num_kv_heads: Optional[int] = None, max_position_embeddings=2048):
+        super().__init__()
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads if num_kv_heads is not None else num_heads
+        self.num_queries_per_kv = self.num_heads // self.num_kv_heads
+        self.head_dim = dim // num_heads
+        self.scale = 1.0 / math.sqrt(self.head_dim)
+        # self.q_proj = nn.Linear(dim, dim, bias=False)
+        # self.k_proj = nn.Linear(dim, dim, bias=False)
+        # self.v_proj = nn.Linear(dim, dim, bias=False)
+        # Adjust projections for GQA
+        self.q_proj = nn.Linear(dim, num_heads * self.head_dim, bias=False) # (B, T, D) -> (B, T, D) or (B, T, H * D/H)
+        self.k_proj = nn.Linear(dim, self.num_kv_heads * self.head_dim, bias=False) # (B, T, D) -> (B, T, H_kv * D/H)
+        self.v_proj = nn.Linear(dim, self.num_kv_heads * self.head_dim, bias=False) # (B, T, D) -> (B, T, H_kv * D/H)
+        self.o_proj = nn.Linear(dim, dim, bias=False)
+        # self.o_proj.NANGPT_SCALE_INIT = 1 TODO do we need weight initialization scaling?
+        # Cache attributes
+        self.k_cache = None
+        self.v_cache = None
+        self.cache_seq_len = 0
+        # Precompute sin and cos for all positions
+        self.sin, self.cos = precompute_rotary_emb(self.head_dim, max_position_embeddings)
+    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None, use_cache: bool = False):
+        batch_size, seq_len, _ = x.shape
+        # Project inputs
+        q = self.q_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
+        k = self.k_proj(x).view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
+        v = self.v_proj(x).view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
+        # Get rotary embeddings for the new tokens
+        # sin = self.sin[self.cache_seq_len:self.cache_seq_len + seq_len].to(x.device)
+        # cos = self.cos[self.cache_seq_len:self.cache_seq_len + seq_len].to(x.device)
+        sin = self.sin[:seq_len].to(x.device)
+        cos = self.cos[:seq_len].to(x.device)
+        # Apply rotary embeddings
+        q = apply_rotary_emb(q, sin, cos)
+        k = apply_rotary_emb(k, sin, cos)
+        # Handle KV caching
+        # if use_cache:
+        #     if self.k_cache is None:
+        #         # Initialize cache if empty
+        #         self.k_cache = k
+        #         self.v_cache = v
+        #     else:
+        #         # Concatenate new KV with cached KV
+        #         self.k_cache = torch.cat([self.k_cache, k], dim=1)
+        #         self.v_cache = torch.cat([self.v_cache, v], dim=1)
+        #     # Use concatenated KV pairs
+        #     k = self.k_cache
+        #     v = self.v_cache
+        #     # Update cache sequence length
+        #     self.cache_seq_len += seq_len
+        # Reshape for attention computation
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        # Handle GQA (Grouped Query Attention)
+        if self.num_queries_per_kv > 1:
+            k = k.unsqueeze(2).expand(-1, -1, self.num_queries_per_kv, -1, -1)
+            v = v.unsqueeze(2).expand(-1, -1, self.num_queries_per_kv, -1, -1)
+            k = k.reshape(batch_size, self.num_heads, -1, self.head_dim)
+            v = v.reshape(batch_size, self.num_heads, -1, self.head_dim)
+        # Compute attention
+        scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, float('-inf'))
+        attn = F.softmax(scores, dim=-1)
+        out = torch.matmul(attn, v)
+        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
+        # Speed up - Flash Attention (calculation happens in GPU sram and not GPU RAM)  TODO Not sure how to apply this in group query attention?
+        # out F.scaled_dot_product_attention(q, k, v, is_causal = True)
+        return self.o_proj(out)
+    def clear_cache(self):
+        self.k_cache = None
+        self.v_cache = None
+        self.cache_seq_len = 0
+class LlamaFFN(nn.Module):
+    def __init__(self, dim: int, hidden_dim: int):
+        super().__init__()
+        self.gate = nn.Linear(dim, hidden_dim, bias=False)
+        self.up = nn.Linear(dim, hidden_dim, bias=False)
+        self.down = nn.Linear(hidden_dim, dim, bias=False)
+        # self.down.NANGPT_SCALE_INIT = 1 # TODO do we need weight initialization scaling - Optimization ?
+        self.act_fn = nn.SiLU() # SwiGLU activation function
+    def forward(self, x):
+        return self.down(self.act_fn(self.gate(x)) * self.up(x))
+class LlamaBlock(nn.Module):
+    def __init__(self, config):
+        # nn_embed or dim is the dimension of the input to the block
+        super().__init__()
+        self.attention = LlamaAttention(
+            config.nn_embed,
+            config.num_attention_heads,
+            config.num_key_value_heads,
+            config.max_sequence_len
+        )
+        self.feed_forward = LlamaFFN(config.nn_embed, config.ffn_intermediate_size)
+        self.attention_norm = nn.RMSNorm(config.nn_embed, eps=config.rms_norm_eps)
+        self.ffn_norm = nn.RMSNorm(config.nn_embed, eps=config.rms_norm_eps)
+    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None, use_cache: bool = False):
+        x = x + self.attention(self.attention_norm(x), mask, use_cache)
+        x = x + self.feed_forward(self.ffn_norm(x))
+        return x
+class SmolLM2(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        # Normal Embedding (position embedding will be part of Attention layer)
+        self.embedding = nn.Embedding(config.vocab_size, config.nn_embed)
+        # total num_hidden_layers Blocks (Each block has attention and feedforward layer)
+        self.layers = nn.ModuleList([
+            LlamaBlock(config) for _ in range(config.num_hidden_layers)
+        ])
+        self.norm = nn.RMSNorm(config.nn_embed, eps=config.rms_norm_eps)
+        # final layer returning the logits of size (batch_size, vocab_size)
+        self.lm_head = nn.Linear(config.nn_embed, config.vocab_size, bias=False)
+        # Optimization Weight sharing between lm_head and embedding
+        self.lm_head.weight = self.embedding.weight
+        # Initialize weights
+        self.apply(self._init_weights)
+    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None, use_cache: bool = False, targets: Optional[torch.Tensor] = None):
+        if (mask is None):
+            mask = self.create_causal_mask(x.shape[1], device=x.device)
+        x = self.embedding(x)
+        for layer in self.layers:
+            x = layer(x, mask, use_cache)
+        x = self.norm(x)
+        logits = self.lm_head(x)
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.shape[-1]), targets.view(-1))
+            return logits, loss
+        return logits
+    # Linear layers (attention projections, FFN layers, lm_head) are initialized from N(0, 0.02)
+    # Embedding layer is initialized from N(0, 0.02)
+    # All RMSNorm weights are initialized to 1.0
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            std = 0.02
+            if hasattr(module, 'NANGPT_SCALE_INIT'):
+                std *= (2 * self.config.n_layer) ** -0.5
+            torch.nn.init.normal_(module.weight, mean = 0.0, std = std)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.RMSNorm):
+            torch.nn.init.ones_(module.weight)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std = 0.02)
+    def clear_cache(self):
+        """Clear KV cache in all attention layers"""
+        for layer in self.layers:
+            layer.attention.clear_cache()
+    def create_causal_mask(self, seq_len, device):
+        """Creates a causal attention mask where each position can only attend to previous positions"""
+        # Create lower triangular matrix (including diagonal)
+        # mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
+        # mask = torch.triu(torch.ones(1, 1, seq_len, seq_len), diagonal=1).bool()
+        # # Invert and convert to float
+        # return (~mask).float()
+        return torch.tril(torch.ones(seq_len, seq_len)).view(1, 1, seq_len, seq_len).to(device)
+    @torch.no_grad()
+    def generate(self, input_ids: torch.Tensor, max_new_tokens: int = 20,
+                temperature: float = 1.0, top_k: int = 50) -> torch.Tensor:
+        """
+        Generate text using the model
+        Args:
+            input_ids: Starting token ids (B, T)
+            max_new_tokens: Number of tokens to generate
+            temperature: Controls randomness (1.0 = neutral, <1.0 = more deterministic, >1.0 = more random)
+            top_k: Number of highest probability tokens to consider for sampling
+        Returns:
+            Generated token ids (B, T+max_new_tokens)
+        """
+        batch_size, seq_len = input_ids.shape
+        # clear existing KV caching
+        self.clear_cache()
+        # Create a new tensor to store the generated tokens
+        input_ids = torch.cat([input_ids, torch.zeros((batch_size, max_new_tokens),
+                            dtype=torch.long, device=input_ids.device)], dim=1)
+        # Generate tokens one at a time
+        for idx in range(max_new_tokens):
+            # print(f"Generating token {idx+1} of {max_new_tokens}")
+            # Get the current sequence length including cached tokens
+            current_seq_len = seq_len + idx
+            next_mask = self.create_causal_mask(current_seq_len, device=input_ids.device)
+            # Create mask that includes both the current input and cached tokens
+            # if idx == 0:
+            #     # First iteration - create mask for the full input sequence
+            #     next_mask = self.create_causal_mask(current_seq_len, device=input_ids.device)
+            # else:
+            #     # Subsequent iterations - create mask for the new token attending to all previous tokens
+            #     next_mask = torch.ones((1, 1, 1, current_seq_len), device=input_ids.device)
+            # Process including the new tokens
+            logits = self(input_ids[:, :current_seq_len], next_mask, use_cache=False)
+            # Get the last token's logits
+            next_token_logits = logits[:, -1, :] / temperature
+            # Apply top-k filtering
+            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k, dim=-1)
+            probs = F.softmax(top_k_logits, dim=-1)
+            # Sample from the filtered distribution
+            next_token = top_k_indices[
+                torch.arange(batch_size, device=input_ids.device),
+                torch.multinomial(probs, num_samples=1).squeeze(1)
+            ]
+            # Update input_ids with the new token
+            input_ids[:, current_seq_len] = next_token
+        return input_ids

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+torch>=2.0.0
+transformers>=4.30.0
+datasets>=2.12.0
+numpy>=1.24.0
+tqdm>=4.65.0
+huggingface-hub>=0.16.0
+tokenizers>=0.13.0
+gradio>=4.0.0

utils.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import torch
+def get_device(seed = 1):
+     # Seed is to generate the same random data for each run
+    # For reproducibility
+    torch.manual_seed(seed)
+    # Set device
+    device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
+    if torch.cuda.is_available():
+        print(f"[INFO] GPU: {torch.cuda.get_device_name(0)}")
+        print(f"[INFO] CUDA Version: {torch.version.cuda}\n")
+        torch.cuda.manual_seed(seed)
+    if not torch.backends.mps.is_available():
+        if not torch.backends.mps.is_built():
+            print("MPS not available because the current PyTorch install was not "
+              "built with MPS enabled.")
+        else:
+            print("MPS not available because the current MacOS version is not 12.3+ "
+              "and/or you do not have an MPS-enabled device on this machine.")
+    else:
+        torch.mps.manual_seed(seed)
+    return device