Modern-Transformer-Decoder-Tiny
A lightweight PyTorch transformer decoder trained for conversational AI with modern architecture features.
Model Details
- Model Type: Transformer Decoder with Grouped-Query Attention
- Parameters: 23,744
- Architecture: Custom implementation combining LLaMA and Qwen-3 features
- Training: Word-level tokenization on conversation data
- Format: SafeTensors (ready for further training)
Features
- Grouped-Query Attention (GQA) for memory efficiency
- Rotary Position Embeddings (RoPE) for position encoding
- RMSNorm pre-normalization for training stability
- SwiGLU activation in feed-forward networks
- KV caching for efficient inference
- SafeTensors format for safe loading
Model Architecture
Vocab Size: 35
Hidden Size: 32
Layers: 2
Attention Heads: 2
KV Groups: 1
Max Sequence Length: 32
Usage
Loading the Model
import torch
from safetensors.torch import load_file
# Load SafeTensors weights
weights = load_file("model.safetensors")
# Load your custom model class
from your_code import TransformerDecoder, TransformerConfig
# Load config
import json
with open("config.json", "r") as f:
config_dict = json.load(f)
config = TransformerConfig(
vocab_size=config_dict["vocab_size"],
embed_dim=config_dict["hidden_size"],
num_layers=config_dict["num_hidden_layers"],
num_heads=config_dict["num_attention_heads"],
kv_groups=config_dict["kv_groups"],
max_seq_len=config_dict["max_position_embeddings"]
)
# Create and load model
model = TransformerDecoder(config)
model.load_state_dict(weights)
model.eval()
Training Further
This model is saved in SafeTensors format, making it easy to:
- Continue training with your own data
- Fine-tune for specific tasks
- Integrate with Hugging Face Transformers
- Use with other ML frameworks
Training Data
Trained on a small conversational dataset with common patterns:
- Greetings and responses
- Question-answer pairs
- Basic conversational flow
- Word-level tokenization
Intended Use
- Research: Study modern transformer architectures
- Education: Learn about GQA, RoPE, and efficient attention
- Base Model: Fine-tune for specific conversational tasks
- Experimentation: Test architectural improvements
Limitations
- Small vocabulary (35 words)
- Limited training data
- Basic tokenization
- Requires custom model code for loading
Further Training
To continue training:
- Load the SafeTensors weights
- Prepare your dataset
- Use the same architecture configuration
- Resume training with appropriate learning rate
Model Source
- Repository: [Your GitHub Repository]
- Architecture: Modern Transformer Decoder
- Implementation: PyTorch with custom layers
Citation
@misc{modern-transformer-decoder,
title={Modern Transformer Decoder with GQA and RoPE},
author={Your Name},
year={2024},
howpublished={\url{https://huggingface.co/Modern-Transformer-Decoder-Tiny}}
}
License
MIT License - Feel free to use for research and development.
- Downloads last month
- 26
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support