Model Card for amusktweewt/tiny-stories-v1

This model is a custom transformer-based language model trained on the TinyStories dataset, designed for creative text generation tasks such as storytelling and conversational agents. This model is purely an academic project and should not be used in production or practical applications.

Model Details

Model Description

This model utilizes a custom tokenizer with Byte Pair Encoding (BPE) and has been trained with a smaller architecture to balance efficiency and performance. It is designed for generating coherent and contextually relevant short stories. However, a known issue with the tokenizer causes spaces between tokens to appear repeated, leading to suboptimal text output quality.

Developed by: amusktweewt
Model type: AutoModelForCausalLM
Language(s) (NLP): English
License: MIT

Model Sources

Repository: HuggingFace repository

Uses

Direct Use

This model is intended for academic and research purposes only. It demonstrates a proof of concept for training smaller transformer-based language models.

Out-of-Scope Use

Not suitable for tasks requiring factual accuracy
Should not be used in production environments or applications involving sensitive content

Bias, Risks, and Limitations

Risks and Biases

The model may reflect biases present in the training data, leading to unintended or inappropriate outputs. Additionally, the tokenizer issue can result in suboptimal and incoherent text generations.

Recommendations

This model is meant for research and demonstration purposes. Users should validate outputs critically and avoid using it for practical applications.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

model = AutoModelForCausalLM.from_pretrained("amusktweewt/tiny-stories-v1")
tokenizer = PreTrainedTokenizerFast.from_pretrained("amusktweewt/tiny-stories-v1")

prompt = "Once upon a time,"
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained on the TinyStories dataset, consisting of curated short stories. Preprocessing ensured consistent formatting and tokenization using a custom BPE tokenizer.

Training Procedure

Preprocessing

Used BPE tokenizer with a vocabulary size of 4096
Included special tokens: <sos>, <pad>, <|endoftext|>, and <unk>

Training Hyperparameters

Batch size: 64
Epochs: 3
Learning rate: 1e-3
Scheduler: Cosine annealing
Precision: Mixed precision (FP16)

Speeds, Sizes, Times

Training time: Approx. 5 hours 30 minutes
Model size: 230 MB
Dataset size: 535.98 million tokens

Evaluation

Testing Data, Factors & Metrics

Testing Data

A subset of the training data was used for evaluation, focusing on coherence and storytelling quality.

Metrics

Loss: 0.9723
Qualitative Evaluation: Manual assessment of generated outputs for coherence and relevance.

Results

Sample Outputs:
- Prompt: "in a far away country" Completion: "in a far away coun try . He was so excited to explore the world . He was so happy to be able to explore the world ."

Summary

The model generates coherent short stories suitable for research demonstration but is limited by tokenizer issues and should not be used in real-world scenarios.

Environmental Impact

Hardware Type: NVIDIA 4090 GPU
Hours used: 5.5
Carbon Emitted: Approx. 0.2 kg CO2 eq

Technical Specifications

Model Architecture and Objective

Transformer architecture with 8 layers, 12 attention heads, and a hidden size of 768

Compute Infrastructure

Hardware

Single GPU (NVIDIA 4090)

Software

Python 3.8+
HuggingFace Transformers 4.x
PyTorch 1.x

Model Card Authors

amusktweewt

Model Card Contact

For questions or feedback, contact amusktweewt.

amusktweewt
/

tiny-stories-v1