---
title: Transformers from Scratch - Complete Implementation
emoji: 🔮
colorFrom: blue
colorTo: green
sdk: pytorch
app_file: Transformers.ipynb
pinned: false
license: mit
tags:
- deep-learning
- transformers
- attention
- pytorch
- nlp
- text-classification
- sentiment-analysis
- educational
- from-scratch
datasets:
- synthetic-movie-reviews
---

# Transformers from Scratch: Complete Implementation

A comprehensive PyTorch implementation of the Transformer architecture from "Attention Is All You Need", featuring detailed mathematical foundations, educational content, and practical text classification applications.

## Model Description

This repository contains a complete, from-scratch implementation of the Transformer architecture. The model demonstrates the core concepts behind modern NLP systems like BERT, GPT, and ChatGPT through a practical sentiment analysis task. This implementation serves as both a working model and an educational resource for understanding the revolutionary attention mechanism.

### Architecture Details

- **Model Type**: Transformer Encoder for Text Classification
- **Framework**: PyTorch
- **Task**: Binary sentiment classification (positive/negative movie reviews)
- **Model Dimension**: 128
- **Attention Heads**: 8
- **Layers**: 4 Transformer blocks
- **Feed-Forward Dimension**: 256
- **Total Parameters**: ~200K
- **Vocabulary Size**: Dynamic (built from training data)

### Key Components

1. **Multi-Head Attention**: Core mechanism allowing parallel processing of sequences
2. **Positional Encoding**: Sine/cosine embeddings to inject position information
3. **Transformer Blocks**: Attention + feed-forward with residual connections
4. **Layer Normalization**: Stabilizes training and improves convergence
5. **Classification Head**: Global average pooling + linear layer for predictions

## Mathematical Foundation

### Scaled Dot-Product Attention
```
Attention(Q, K, V) = softmax(QK^T / √d_k)V
```

### Multi-Head Attention
```
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
```

### Positional Encoding
```
PE(pos, 2i) = sin(pos/10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
```

## Training Details

- **Dataset**: Synthetic movie reviews (positive/negative sentiment)
- **Optimizer**: AdamW with weight decay (0.01)
- **Learning Rate**: 0.0001 with cosine annealing
- **Batch Size**: 16
- **Max Sequence Length**: 24 tokens
- **Training Epochs**: 30
- **Hardware**: Optimized for Apple M4 and CUDA GPUs

## Model Performance

### Metrics
- **Test Accuracy**: 85%+
- **Training Time**: ~10 minutes on Apple M4
- **Model Size**: 200K parameters
- **Convergence**: Stable training without overfitting

### Capabilities
- ✅ Binary sentiment classification
- ✅ Attention weight visualization
- ✅ Fast inference on modern hardware
- ✅ Educational transparency
- ✅ Easily extensible architecture

## Usage

### Quick Start

```python
import torch
import torch.nn as nn
import math

# Load the complete implementation (from notebook)
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_len, num_classes):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, num_classes)
    
    def forward(self, x):
        # Embedding + positional encoding
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        
        # Transformer blocks
        for transformer in self.transformer_blocks:
            x = transformer(x)
        
        # Classification
        x = self.norm(x)
        x = x.mean(dim=1)  # Global average pooling
        return self.classifier(x)

# Load trained model
model = TransformerClassifier(
    vocab_size=vocab_size,
    d_model=128,
    num_heads=8,
    num_layers=4,
    d_ff=256,
    max_len=24,
    num_classes=2
)
model.load_state_dict(torch.load('best_transformer_model.pth'))
model.eval()

# Example inference
def predict_sentiment(text, model, vocab_to_idx, max_length=24):
    tokens = tokenize_text(text, vocab_to_idx, max_length)
    with torch.no_grad():
        output = model(tokens.unsqueeze(0))
        prediction = torch.softmax(output, dim=1)
        return "Positive" if prediction[0][1] > 0.5 else "Negative"

# Test the model
result = predict_sentiment("This movie was absolutely fantastic!", model, vocab_to_idx)
print(f"Sentiment: {result}")
```

### Advanced Usage

```python
# Visualize attention weights
def visualize_attention(model, text, vocab_to_idx):
    # Extract attention weights from each layer
    # Create heatmaps showing what the model focuses on
    pass

# Fine-tune on new data
def fine_tune_model(model, new_data_loader, epochs=5):
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    # Continue training on domain-specific data
    pass
```

## Visualizations and Analysis

1. **Training Curves**: Loss and accuracy evolution over epochs
2. **Attention Heatmaps**: Visualize what the model pays attention to
3. **Performance Metrics**: Precision, recall, F1-score breakdowns
4. **Architecture Diagrams**: Component-wise model visualization
5. **Error Analysis**: Common failure cases and model limitations

## Files and Outputs

- `Transformers.ipynb`: Complete implementation with educational content
- `best_transformer_model.pth`: Trained model weights
- `m4_transformer_results.png`: Training curves and performance metrics
- Architecture visualization and attention weight examples

## Educational Value

This implementation is designed as a comprehensive learning resource featuring:

### Mathematical Understanding
- **Complete Derivations**: From attention theory to implementation
- **Step-by-Step Breakdown**: Each component explained individually
- **Visual Mathematics**: Attention visualizations and formula explanations
- **Practical Examples**: Concrete numerical calculations

### Implementation Insights
- **Clean Code Architecture**: Modular, readable, and well-documented
- **Best Practices**: Modern PyTorch patterns and techniques
- **Performance Optimization**: Efficient training and inference
- **Debugging Techniques**: How to monitor and improve training

### Real-World Applications
- **End-to-End Pipeline**: From raw text to predictions
- **Production Considerations**: Model deployment and optimization
- **Extension Examples**: How to adapt for different tasks
- **Transfer Learning**: Building on pre-trained representations

## Applications

This Transformer implementation can be adapted for:

### Text Classification Tasks
- **Sentiment Analysis**: Movie reviews, product feedback, social media
- **Topic Classification**: News categorization, document organization
- **Spam Detection**: Email filtering, content moderation
- **Intent Recognition**: Chatbot understanding, voice assistants

### Sequence Processing
- **Named Entity Recognition**: Extract people, places, organizations
- **Part-of-Speech Tagging**: Grammatical analysis
- **Text Similarity**: Document matching, plagiarism detection
- **Feature Extraction**: Dense representations for downstream tasks

### Research and Development
- **Architecture Experiments**: Test new attention mechanisms
- **Ablation Studies**: Understand component contributions
- **Scaling Experiments**: Larger models and datasets
- **Novel Applications**: Domain-specific adaptations

## Comparison with Other Architectures

### Advantages over RNNs
- ✅ **Parallel Processing**: Much faster training and inference
- ✅ **Long-Range Dependencies**: Better handling of distant relationships
- ✅ **Scalability**: Efficient on modern hardware
- ✅ **Interpretability**: Attention weights provide insights

### Advantages over CNNs
- ✅ **Sequence Modeling**: Natural fit for text and time series
- ✅ **Variable Length**: Handle sequences of any length
- ✅ **Global Context**: Attend to entire sequence simultaneously
- ✅ **Position Awareness**: Explicit positional information

### Educational Benefits
- 🎓 **Foundation Understanding**: Core concepts behind modern NLP
- 🎓 **Mathematical Clarity**: Clean mathematical formulations
- 🎓 **Implementation Practice**: Hands-on coding experience
- 🎓 **Research Preparation**: Basis for advanced architectures

## Citation

If you use this implementation in your research or projects, please cite:

```bibtex
@misc{transformers_from_scratch_2024,
  title={Transformers from Scratch: Complete Implementation},
  author={Gruhesh Kurra},
  year={2024},
  url={https://huggingface.co/karthik-2905/TransformersFromScratch}
}
```

## Future Extensions

Planned improvements and research directions:

- 🔄 **Encoder-Decoder Architecture**: Full sequence-to-sequence implementation
- 🎨 **Pre-training Pipeline**: Large-scale language model training
- 📊 **Alternative Attention**: Sparse, local, and linear attention variants
- 🖼️ **Vision Transformers**: Adapt architecture for image tasks
- 🎵 **Multimodal Transformers**: Text, image, and audio processing
- 🧬 **Scientific Applications**: Protein sequences, molecular modeling

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Additional Resources

- **GitHub Repository**: [TransformersFromScratch](https://github.com/GruheshKurra/TransformersFromScratch)
- **Original Paper**: "Attention Is All You Need" by Vaswani et al.
- **Educational Content**: Complete mathematical derivations and examples
- **Performance Benchmarks**: Detailed analysis and comparisons

## Model Card Authors

**Gruhesh Kurra** - Implementation, documentation, and educational content

---

**Tags**: transformers, attention, pytorch, nlp, text-classification, educational

**Model Card Last Updated**: December 2024