Spaces:

gitesh-grover
/

Shakespeare-Coriolanus

Sleeping

App Files Files Community

gitesh-grover commited on Jan 29

Commit

a48e448

verified ·

1 Parent(s): f413aaf

Upload 9 files

Browse files

Committed Files!

Files changed (9) hide show

README.md +169 -7
app.py +68 -0
config.py +15 -0
data/model_tf.pth +3 -0
datasets.py +37 -0
inference.py +37 -0
model.py +143 -0
requirements.txt +8 -0
utils.py +26 -0

README.md CHANGED Viewed

@@ -1,13 +1,175 @@
 ---
-title: Shakespeare Coriolanus
-emoji: 📊
-colorFrom: gray
-colorTo: pink
 sdk: gradio
-sdk_version: 5.13.2
 app_file: app.py
 pinned: false
-short_description: Recreated the Space for Assignment 12
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Shakespeare Coriolanus Transformer
+emoji: 📚
+colorFrom: blue
+colorTo: red
 sdk: gradio
+sdk_version: 3.50.2
 app_file: app.py
 pinned: false
 ---
+# Shakespeare Coriolanus Transformer
+This is a test model created to train and test a basic small decoder only transfomer with 124m parameters. The code has modules to both train and test the model. The trained model can be tested on HugginFace.
+# Steps to Run Locally
+1. Create and activate a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+2. Install the requirements and the Hugging Face CLI:
+```bash
+pip install -r requirements.txt
+pip install --upgrade huggingface-hub
+```
+4. To train the model:
+```bash
+python src/train.py
+```
+5. To run the app:
+```bash
+python src/app.py
+```
+    The interface will be available at `http://localhost:7860` by default.
+# Training Logs
+```
+loaded 338025 tokens
+1 epoch = 41 batches
+BatchSize: 256 || Tokens per batch; 32
+[STEP 2] Initializing model...
+[STEP 3] Printing Model Architecture Summary...
+Model Architecture:
+DecoderTransformer(
+  (wte): Embedding(50257, 768)
+  (wpe): Embedding(1024, 768)
+  (blocks): ModuleList(
+    (0-11): 12 x Block(
+      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (att): Attention(
+        (w_qkv): Linear(in_features=768, out_features=2304, bias=True)
+        (proj): Linear(in_features=768, out_features=768, bias=True)
+      )
+      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (mlp): MLP(
+        (fc): Linear(in_features=768, out_features=3072, bias=True)
+        (gelu): GELU(approximate='tanh')
+        (proj): Linear(in_features=3072, out_features=768, bias=True)
+      )
+    )
+  )
+  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
+)
+Total Parameters: 124.44M
+Total Steps 41 (epochs 1 , stepsPerEpoch 41)
+[STEP 4] Starting Training...
+(venv) gitesh.grover@Giteshs-MacBook-Pro ai-era-assignment12 % python train.py
+[INFO] Using device: mps
+[STEP 1] Preparing datasets...
+/Users/gitesh.grover/Study/AI-ERA/venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
+  warnings.warn(
+loaded 338025 tokens
+1 epoch = 41 batches
+BatchSize: 256 || Tokens per batch; 32
+[STEP 2] Initializing model...
+[STEP 3] Printing Model Architecture Summary...
+Model Architecture:
+DecoderTransformer(
+  (wte): Embedding(50257, 768)
+  (wpe): Embedding(1024, 768)
+  (blocks): ModuleList(
+    (0-11): 12 x Block(
+      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (att): Attention(
+        (w_qkv): Linear(in_features=768, out_features=2304, bias=True)
+        (proj): Linear(in_features=768, out_features=768, bias=True)
+      )
+      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (mlp): MLP(
+        (fc): Linear(in_features=768, out_features=3072, bias=True)
+        (gelu): GELU(approximate='tanh')
+        (proj): Linear(in_features=3072, out_features=768, bias=True)
+      )
+    )
+  )
+  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
+)
+Total Parameters: 124.44M
+Total Steps 12300 (epochs 300 , stepsPerEpoch 41)
+[STEP 4] Starting Training...
+Epoch 1, Loss: 11.0051
+Epoch 2, Loss: 6.6564
+Epoch 3, Loss: 6.1045
+Epoch 4, Loss: 5.6797
+Epoch 5, Loss: 5.3227
+Epoch 6, Loss: 4.9817
+Epoch 7, Loss: 4.6557
+Epoch 8, Loss: 4.4270
+Epoch 9, Loss: 4.2327
+Epoch 10, Loss: 3.9861
+Epoch 11, Loss: 3.7526
+Epoch 12, Loss: 3.5475
+Epoch 13, Loss: 3.3379
+Epoch 14, Loss: 3.1133
+Epoch 15, Loss: 2.8888
+Epoch 16, Loss: 2.7211
+Epoch 17, Loss: 2.4558
+Epoch 18, Loss: 2.1982
+Epoch 19, Loss: 1.9944
+Epoch 20, Loss: 1.7707
+Epoch 21, Loss: 1.6288
+Epoch 22, Loss: 1.4231
+Epoch 23, Loss: 1.2248
+Epoch 24, Loss: 1.0180
+Epoch 25, Loss: 0.8970
+Epoch 26, Loss: 0.7644
+Epoch 27, Loss: 0.6474
+Epoch 28, Loss: 0.5318
+Epoch 29, Loss: 0.4483
+Epoch 30, Loss: 0.3601
+Epoch 31, Loss: 0.2932
+Epoch 32, Loss: 0.2754
+Epoch 33, Loss: 0.2155
+Epoch 34, Loss: 0.2092
+Epoch 35, Loss: 0.1893
+Epoch 36, Loss: 0.1753
+Epoch 37, Loss: 0.1671
+:
+:
+Epoch 203, Loss: 0.1224
+Epoch 204, Loss: 0.1243
+Epoch 205, Loss: 0.1308
+Epoch 206, Loss: 0.1358
+Epoch 207, Loss: 0.1413
+Epoch 208, Loss: 0.1425
+Epoch 209, Loss: 0.1281
+Epoch 210, Loss: 0.1264
+Epoch 211, Loss: 0.1305
+Epoch 212, Loss: 0.1399
+Epoch 213, Loss: 0.1266
+Epoch 214, Loss: 0.1135
+Epoch 215, Loss: 0.1127
+Epoch 216, Loss: 0.1137
+Epoch 217, Loss: 0.1045
+Epoch 218, Loss: 0.1074
+Epoch 219, Loss: 0.1014
+Epoch 220, Loss: 0.0997
+Target loss achieved at step 8979. Breaking
+0.09973063319921494
+[STEP 5] Saving Model...
+[STEP 6] Testing by predicting next few tokens
+X Shape before test: torch.Size([256, 32])
+256
+Y Shape after test: torch.Size([256, 30])
+```

app.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import gradio as gr
+import torch
+import torch.nn as nn
+import tiktoken
+import torchvision.transforms as transforms
+from model import DecoderTransformer
+from config import Config
+from inference import predict
+from utils import get_device
+def generate_sequence(text):
+    config = Config()
+    device = get_device()
+    # Load model
+    model = DecoderTransformer(config)
+    model.load_state_dict(torch.load(config.saved_model_path, weights_only=True))
+    model.to(device)
+    model.eval()
+    enc = tiktoken.get_encoding('gpt2')
+    tokens = enc.encode(text)
+    T = len(tokens)
+    input_tensor = torch.tensor(tokens, device=device)
+    input_tensor = input_tensor.view(1, T)
+    max_output_len = 30
+    y = predict(input_tensor, model, max_output_len=max_output_len)
+    output_tokens = y[0, :].tolist()
+    return enc.decode(output_tokens)
+    # # Convert input text to tensor using tokenizer
+    # input_tensor = torch.tensor([config.tokenizer.encode(text)], device=config.device)
+    # Generate sequence
+    # with torch.no_grad():
+    #     # Initialize start token and empty sequence
+    #     current_seq = torch.tensor([[config.start_token]], device=config.device)
+    #     # Generate tokens one by one
+    #     for _ in range(config.max_seq_length):
+    #         # Get model predictions
+    #         output = model(input_tensor, current_seq)
+    #         next_token_logits = output[:, -1, :]
+    #         next_token = torch.argmax(next_token_logits, dim=-1)
+    #         # Add predicted token to sequence
+    #         current_seq = torch.cat([current_seq, next_token.unsqueeze(0)], dim=1)
+    #         # Stop if end token is generated
+    #         if next_token.item() == config.end_token:
+    #             break
+    # # Convert tokens to text
+    # generated_sequence = config.tokenizer.decode(current_seq[0].tolist())
+    # return generated_sequence
+# Create Gradio interface
+iface = gr.Interface(
+    fn=generate_sequence,
+    inputs=gr.Textbox(),
+    outputs=gr.Textbox(),
+    title="Text Generation",
+    description="Enter text to generate a continuation",
+    allow_flagging=False
+)
+if __name__ == "__main__":
+    iface.launch()

config.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from dataclasses import dataclass
+@dataclass
+class Config:
+    vocab_size: int = 50257 # number of tokens: 50,000 BPE merges + 256 bytes tokens + 1 <|endoftext|> token
+    nn_layer: int = 12 # number of layers
+    nn_head: int = 12 # number of heads
+    nn_embed: int = 768 # embedding dimension
+    nn_max_tok_seq: int = 1024 # max token sequence length (for pos embedding) # Block size
+    nn_train_tok_seq: int = 32 # Actual training token sequence
+    nn_mlp_expansion: int = 4 # Expansion in the MLP layer
+    batch_size: int = 256
+    train_tok_size: int = 32
+    saved_model_path = 'data/model_tf.pth'
+    train_input_file = 'data/input.txt'

data/model_tf.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6fe9fa8e75332d711c50372e863ddfe6cfb4f8fc3b56e8cf2455fb8fb7ca605a
+size 548137112

datasets.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import tiktoken
+import torch
+class DataLoader:
+    def __init__(self, B, T, inputFile):
+        # Batch size and token sequence length
+        self.B = B
+        self.T = T
+        # at init load tokens from disk and store them in memory
+        # Custom Input text
+        with open(inputFile, 'r') as f:
+            text = f.read()
+        # Using Gpt2 encoding tokens
+        enc = tiktoken.get_encoding('gpt2')
+        tokens = enc.encode(text)
+        self.tokens = torch.tensor(tokens)
+        self.enc = enc
+        print(f'loaded {len(self.tokens)} tokens')
+        print(f'1 epoch = {len(self.tokens) // (B * T)} batches')
+        # state
+        self.current_position = 0
+    def next_batch(self):
+        B, T = self.B, self.T
+        # Load B*T +1 tokens (+1 for target)
+        buf = self.tokens[self.current_position: self.current_position + B * T + 1]
+        x = (buf[:-1]).view(B, T) # inputs [0-B*T)
+        y = (buf[1:]).view(B, T) # targets [1 - B*T +1)
+        # advance the position to B*T in the tensor
+        self.current_position += B*T
+        # if loading the next batch would be out of bounds, reset (to keep going)
+        if self.current_position + (B * T + 1) > len(self.tokens):
+            self.current_position = 0
+        return x, y

inference.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from utils import get_device
+from model import DecoderTransformer
+def predict(x, model, max_output_len = 30):
+    device = get_device(seed=37)
+    input_len = x.size(1)
+    # x is of shape (B, Tr). Tr = running token size increased by 1 afer every loop below
+    while (x.size(1) < input_len + max_output_len):
+        # forward the model to get the logits
+        with torch.no_grad():
+            # TODO what is [0]?
+            logits = model(x)[0] # (B, Tr, vocab_size)
+            # take the logits at the last position as thats the prediction
+            logits = logits[:, -1, :] # (B, vocab_size)
+            # get the probabilities (from predicted vocab)
+            probs = F.softmax(logits, dim=-1)
+            # do top-k sampling of 50 (huggingface pipeline default)
+            # topk_probs here becomes (5, 50), topk_indices is (5, 50)
+            topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
+            # select a token from the top-k probabilities
+            # note: multinomial does not demand the input to sum to 1
+            ix = torch.multinomial(topk_probs, 1) # (B, 1)
+            # gather the corresponding indices
+            xcol = torch.gather(topk_indices, -1, ix) # (B, 1)
+            # append to the sequence increaing the Tr by 1
+            x = torch.cat((x, xcol), dim=1) # (B, Tr).. Tr = Tr+1
+            # Stop if end token is generated
+            # if xcol == config.end_token:
+            #     break
+    return x[:, input_len:] # B, max_output_len

model.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import os
+import math
+import time
+import inspect
+from dataclasses import dataclass
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+class Attention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.nn_embed % config.nn_head == 0
+        self.nn_head = config.nn_head
+        self.nn_embed = config.nn_embed
+        # K,Q,V NN layer calculated for the every token of the every batch
+        self.w_qkv = nn.Linear(config.nn_embed, config.nn_embed * 3) # (X, embed) -> (X, 3*embed)
+        # Projection layer to mix up the heads or the every token of the every batch
+        self.proj = nn.Linear(config.nn_embed, config.nn_embed) # (X, embed) -> (X, embed)
+         # TODO What does the following line do (coiped from class)
+        self.register_buffer("bias", torch.tril(torch.ones(config.nn_max_tok_seq, config.nn_max_tok_seq)).view(1, 1, config.nn_max_tok_seq, config.nn_max_tok_seq))
+    def forward(self, x):
+        B, T, E = x.size() # Batch size, token numbers, Embediing(nn_embed)
+        q, k, v = self.w_qkv(x).split(self.nn_embed, dim=2) # Split the last dimension in size od embed ie into 3
+        # divide the q,k,v last dim in groups (heads) and then shuffle to for the calculation
+        q = q.view(B, T, self.nn_head, E//self.nn_head).transpose(1,2) # (B, head, T, headEmbedSize)
+        k = k.view(B, T, self.nn_head, E//self.nn_head).transpose(1,2) # (B, head, T, headEmbedSize)
+        v = v.view(B, T, self.nn_head, E//self.nn_head).transpose(1,2) # (B, head, T, headEmbedSize)
+        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))) # Q*K / sqt(headEmbedSize)...(B, head, T, T)
+        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf')) # Mask fill the B,headEmbedSize,T.a,T.b with -infinity where T.a < T.b
+        att = F.softmax(att, dim = -1) # maxFilled vals -infinity will become 0 after softmax
+        y = att @ v # B, head, T, headEmbedSize
+        # Shuffle the head and headEmbedSize together and append one after another to get back embed
+        y = y.transpose(1,2).contiguous().view(B, T, E) # B, T, head, headEmbedSize -> B, T, E
+        # Projection NN layer to shuffle the last dim that were stacked together
+        y = self.proj(y) # B, T, E
+        return y
+# Feed Forward NN Layer
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.fc = nn.Linear(config.nn_embed, config.nn_embed * config.nn_mlp_expansion)
+        self.gelu = nn.GELU(approximate='tanh')
+        self.proj = nn.Linear(config.nn_embed * config.nn_mlp_expansion, config.nn_embed)
+        self.proj.NANGPT_SCALE_INIT = 1
+    def forward(self, x):
+        x = self.fc(x)
+        x = self.gelu(x)
+        x = self.proj(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.nn_embed)
+        self.att = Attention(config)
+        self.ln_2 = nn.LayerNorm(config.nn_embed)
+        self.mlp = MLP(config)
+    def forward(self, x):
+        x = x + self.att(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+class DecoderTransformer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.wte = nn.Embedding(config.vocab_size, config.nn_embed)
+        self.wpe = nn.Embedding(config.nn_max_tok_seq, config.nn_embed)
+        self.blocks = nn.ModuleList([Block(config) for _ in range(0, config.nn_layer)])
+        self.lm_head = nn.Linear(config.nn_embed, config.vocab_size, bias=False)
+        # weight sharing for cost optimization
+        self.wte.weight = self.lm_head.weight
+        # weight initialization
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            std = 0.02
+            if hasattr(module, 'NANGPT_SCALE_INIT'):
+                std *= (2 * self.config.nn_layer) ** -0.5
+            torch.nn.init.normal_(module.weight, mean = 0.0, std = std)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std = 0.02)
+    def forward(self, idx, targets=None):
+        B, T = idx.size()
+        assert T <= self.config.nn_max_tok_seq, f"Token length ({T}) can not exceed the max allowed sequence size (block size) ({self.config.nn_max_tok_seq})"
+        # Embedding Layer
+        pos = torch.arange(0, T, dtype=torch.long, device=idx.device) # 1-D vector from 0..T represing token seq of a single batch
+        pos_embed = self.wpe(pos) # position embedding (T, nn_embed) - every token of given sequence will have a nn_embed size output
+        tok_embed = self.wte(idx) # Token embedding (B, T, nn_embed) - every token of a batch will have individual token embedding
+        # As pos embedding would be same for all the batches (as it is based on token sequence and not value), it can be added to every batch as is
+        x = pos_embed + tok_embed # B, T, nn_embed
+        # Transformer blocks..nn_layers
+        for block in self.blocks:
+            x = block(x) # B, T, nn_embed
+        # Head - last layer
+        logits = self.lm_head(x) # B, T, vocab_size
+        # If targets are supplied, calculate loss and return both logits & loss, otherwise just the loss
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
+        return logits, loss

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+torch
+torchvision
+pytest
+numpy
+torchsummary
+gradio
+transformers
+tiktoken

utils.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import torch
+def get_device(seed = 1):
+     # Seed is to generate the same random data for each run
+    # For reproducibility
+    torch.manual_seed(seed)
+    # Set device
+    device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
+    if torch.cuda.is_available():
+        print(f"[INFO] GPU: {torch.cuda.get_device_name(0)}")
+        print(f"[INFO] CUDA Version: {torch.version.cuda}\n")
+        torch.cuda.manual_seed(seed)
+    if not torch.backends.mps.is_available():
+        if not torch.backends.mps.is_built():
+            print("MPS not available because the current PyTorch install was not "
+              "built with MPS enabled.")
+        else:
+            print("MPS not available because the current MacOS version is not 12.3+ "
+              "and/or you do not have an MPS-enabled device on this machine.")
+    else:
+        torch.mps.manual_seed(seed)
+    return device