fischgpt-sft / README.md
kristianfischerai12345's picture
Update README.md
d233ef6 verified
metadata
language:
  - en
license: mit
tags:
  - pytorch
  - gpt2
  - text-generation
  - transformer
  - from-scratch
pipeline_tag: text-generation
inference: true

FischGPT-SFT

Model Description

FischGPT-SFT is a supervised fine-tuned GPT-2 style transformer model built completely from scratch using PyTorch. This implementation demonstrates deep understanding of transformer architecture and industry-standard training practices.

Key Features

  • From-scratch implementation: Every component built without using pre-existing transformer libraries
  • Flash Attention: Implements efficient attention using F.scaled_dot_product_attention
  • Professional Architecture: Clean separation of attention, MLP, and transformer blocks
  • Industry Training: Follows OpenAI's GPT-2 training methodology
  • Production Ready: Includes proper weight initialization and distributed training support

Model Architecture

Parameter Value
Model Type GPT-2 Style Decoder
Layers 12
Hidden Size 768
Attention Heads 12
Context Length 1024
Vocabulary Size 50,304
Parameters ~124M

Training Details

  • Model Type: Supervised Fine-Tuned
  • Training Data: OpenAssistant/oasst1
  • Training Steps: 19,999 steps
  • Final Validation Loss: 1.725750207901001
  • Tokenizer: GPT-2 BPE (tiktoken)
  • Framework: PyTorch with mixed precision (bfloat16)

Training Infrastructure

  • Distributed Training: Multi-GPU support with DistributedDataParallel
  • Optimization: AdamW with cosine learning rate schedule
  • Regularization: Weight decay, dropout, gradient clipping
  • Monitoring: Comprehensive logging and checkpoint management

Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained('fischgpt-sft')
tokenizer = GPT2Tokenizer.from_pretrained('fischgpt-sft')

# Generate text
input_text = "The future of artificial intelligence"
inputs = tokenizer.encode(input_text, return_tensors='pt')

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_length=100,
        num_return_sequences=1,
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Chat Format (for SFT models)

def chat_format(user_message):
    return f"<|user|>{user_message}<|assistant|>"

prompt = chat_format("Explain quantum computing in simple terms")
# ... generate as above

Implementation Highlights

Custom Components

  • CasualSelfAttention: Implements multi-head self-attention with causal masking
  • MLP: Feed-forward network with GELU activation and custom initialization
  • Block: Transformer block with pre-layer normalization
  • GPT: Complete model with tied embeddings and generation capabilities

Advanced Features

# Flash Attention Implementation
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

# Custom Weight Initialization
if hasattr(module, "FISCHGPT_SCALE_INIT"):
    std *= (2 * self.config.n_layer) ** -.5

Performance & Benchmarks

Metric Value
Training Speed ~1200000 tokens/sec
Memory Efficiency Mixed precision (bfloat16)
Context Length 1024 tokens
Generation Speed Fast inference with optimized attention

Technical Specifications

  • Attention Pattern: Causal (autoregressive)
  • Activation Function: GELU (approximate='tanh')
  • Normalization: Layer Normalization
  • Position Encoding: Learned positional embeddings
  • Weight Tying: Shared input/output embeddings

Use Cases

  • Conversational AI, Instruction Following
  • Code completion and programming assistance
  • Creative writing and storytelling
  • Educational content generation
  • Research and experimentation

Limitations

  • Context length limited to 1024 tokens
  • English-focused training data
  • Requires careful prompt engineering for best results
  • May generate inconsistent or incorrect information

Ethics and Safety

This model was trained on publicly available datasets and may reflect biases present in the training data. Users should:

  • Validate generated content for accuracy
  • Be aware of potential biases in outputs
  • Use appropriate content filtering for production applications
  • Follow responsible AI practices

Citation

@misc{fischgpt2024,
  title={FischGPT: A From-Scratch GPT-2 Implementation},
  author={[Your Name]},
  year={2024},
  howpublished={\url{https://github.com/yourusername/FischGPT}}
}

License

MIT License - See LICENSE file for details.


Built with industry best practices and attention to detail. This implementation showcases deep understanding of transformer architecture and modern NLP engineering.