DrNerd commited on
Commit
37db96e
·
verified ·
1 Parent(s): 48b48f8

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Salesforce/wikitext
5
+ language:
6
+ - en
7
+ metrics:
8
+ - perplexity
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
+ tags:
12
+ - text-generation-inference
13
+ - custom-code
14
+ - from-scratch
15
+ - llama-inspired
16
+ - educational
17
+ - wikitext
18
+ - deepseek-tokenizer
19
+ - ~200M-params
20
+ ---
21
+
22
+ ---
23
+ license: apache-2.0
24
+ language: en
25
+ library_name: pytorch
26
+ tags:
27
+ - text-generation
28
+ - custom-model
29
+ - educational
30
+ - from-scratch
31
+ - llama-inspired
32
+ pipeline_tag: text-generation
33
+ ---
34
+
35
+ # LLAMA-3-From-Scartch (Custom ~221M Model)
36
+
37
+ ## Model Description
38
+
39
+ This repository contains a ~221 Million parameter decoder-only Transformer language model built "from scratch". The architecture is inspired by models like Llama but significantly scaled down and adapted during development to run within limited hardware constraints (specifically, training was demonstrated on a single ~4GB VRAM GPU).
40
+
41
+ **This model is primarily an educational project** demonstrating the implementation of core Transformer components (RMSNorm, RoPE, Attention, FFN with SwiGLU, Weight Tying) and a basic training pipeline (including checkpointing, LR scheduling, AMP, and validation).
42
+
43
+ **Key Architectural Features:**
44
+ * **Parameters:** ~221 Million (Weight Tied)
45
+ * **Layers:** 12 Transformer Blocks
46
+ * **Hidden Size:** 768
47
+ * **Attention:** 12 Heads (Multi-Head Attention, `head_dim=64`)
48
+ * **FFN Intermediate Size:** 2048 (using SwiGLU activation)
49
+ * **Normalization:** RMSNorm
50
+ * **Positional Embeddings:** Rotary Positional Embeddings (RoPE)
51
+ * **Weight Tying:** Input embeddings and output projection layer share weights.
52
+ * **Tokenizer:** `deepseek-ai/DeepSeek-R1` (Vocab Size: 128,000) - *Note: Requires `trust_remote_code=True`*
53
+ * **Context Length:** Trained with `max_position_embeddings=4096`, demo training used `sequence_length=256`.
54
+
55
+ **Training:**
56
+ * **Dataset:** Primarily trained on **WikiText-2** (`wikitext-2-raw-v1`, ~2M tokens) for demonstration purposes. The tokenized versions (`wikitext2_tokens_128k.pt`, `wikitext2_val_tokens_128k.pt`) are included in the repository.
57
+ * **Procedure:** Trained on a single GPU using PyTorch with AMP (float16), AdamW optimizer, and a Cosine learning rate schedule with warmup. The provided checkpoints (`step_600.pt`, `step_800.pt`, potentially others) represent states after very limited training.
58
+ * **Performance:** Due to the extremely limited training data and duration, the model exhibits basic pattern learning but **lacks coherence, factual accuracy, and instruction-following capabilities.** The training and validation loss decreased but remained high. See `loss_plot_*.png` for visualization.
59
+
60
+ **Intended Use:**
61
+ * Educational purposes: Studying Transformer architecture implementation and training basics.
62
+ * Experimentation: Serving as a base for further training or architectural modifications.
63
+ * **Not suitable for production or reliable text generation.**
64
+
65
+ ## How to Use
66
+
67
+ **Important:** This model requires the custom Python code (`model_architecture.py`) from this repository. It cannot be loaded directly using `AutoModelForCausalLM`.
68
+
69
+ 1. **Clone the repository:**
70
+ ```bash
71
+ git clone https://huggingface.co/DrNerd/LLAMA-3-From-Scartch
72
+ cd LLAMA-3-From-Scartch
73
+ # Ensure LFS files are downloaded (if needed)
74
+ # git lfs pull
75
+ ```
76
+ 2. **Install dependencies:**
77
+ ```bash
78
+ pip install torch transformers datasets matplotlib tqdm # Add others if needed
79
+ ```
80
+ 3. **Run Inference (using `inference.py`):**
81
+ The `inference.py` script loads a checkpoint (defaults to `step_1200.pt` or latest if not found, **edit the script to specify `step_800.pt` or `step_600.pt`**) and runs generation.
82
+ ```bash
83
+ # Make sure step_600.pt or step_800.pt exists in the directory
84
+ # Edit inference.py to point to the desired checkpoint file
85
+ python inference.py
86
+ ```
87
+ *Alternatively, adapt the loading logic from `inference.py` into your own script.*