Model Card for meta-llama/Llama-3.1-8B (Instruction-Tuned)

This model is a powerful, multilingual instruction-tuned autoregressive LLM developed by Meta that excels at chat, reasoning, coding, and long-context tasks.

Model Details

Model Description

Llama 3.1 8B is part of Meta's Llama 3.1 collection—released July 23, 2024—including 8B, 70B, and 405B parameter models. It was pre-trained on ~15 trillion tokens of multilingual text and code, with a context window of 128K tokens. Instruction-tuning used supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to optimize for assistive tasks :contentReference[oaicite:1]{index=1}.

Developed by: Meta AI
Model type: Decoder‑only transformer (auto-regressive)
Input/Output modality: Multilingual text and code
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai (+ broad multilingual support) :contentReference[oaicite:2]{index=2}
Context window: 128,000 tokens :contentReference[oaicite:3]{index=3}
Knowledge cutoff: December 2023 :contentReference[oaicite:4]{index=4}
License: Llama 3.1 Community License (custom commercial) :contentReference[oaicite:5]{index=5}
Finetuned from: Base pretrained Llama 3.1 8B

Model Sources

Repository: https://huggingface.co/meta-llama/Llama-3.1-8B :contentReference[oaicite:6]{index=6}
Paper: “Introducing Llama 3” blog post by Meta AI, April 18, 2024; updated to version 3.1 July 23, 2024 :contentReference[oaicite:7]{index=7}
Demo: Available via transformers pipeline, or hosted on Meta.ai and WhatsApp :contentReference[oaicite:8]{index=8}

Uses

Direct Use

Ideal for multilingual chatbots, reasoning assistants, code generation, summarization, data synthesis, and long-context tasks (document analysis, RAG).

Downstream Use

Can be fine-tuned for domain-specific applications such as RAG, summarization, topic-controlled dialogue, coding agents, multimodal reasoning pipelines.

Out-of-Scope Use

Not designed for vision (image, audio, video generation). Avoid using for disallowed content per license (e.g., illicit or unsafe instructions).

Bias, Risks, and Limitations

May produce biased or unsafe content, hallucinatory outputs, and reflection of training data biases.
Context window misuse could cause unexpected behavior.
Not fully safe for sensitive/legal/medical advice without guardrails.

Recommendations

Use with moderation filters, human oversight, prompt safety checks, and evaluation for target domain bias and safety.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

inputs = tokenizer("Tell me a story about a dragon:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Pre-trained on a cleaned corpus of ~15 trillion public tokens (multilingual text/code). Instruction tuning used public datasets and ~25M synthetic examples from SFT/RLHF (Collabnix, Lifewire, Hugging Face).

Training Procedure

Preprocessing: Public web, code, and instruction data filtered via Meta classifiers.
Hyperparameters: Referenced in local repo; mix of SFT & RLHF; context length up to 128K.

Speeds, Sizes, Times

Pretraining: 15 trillion tokens; ~1.46 M GPU hours for 8B model (Collabnix).
Checkpoint size: ~8 B parameters; ~30–40 GB depending on format (fp16, bfloat16).

Evaluation

Testing Data & Metrics

Benchmarked on multilingual tasks (MMLU, coding, reasoning), outperforming many open and closed models (Hugging Face).

Instruction-tuned 8B: ~69.4% MMLU; latency ~280 ms TTFT; ~193 tokens/sec (Hugging Face).

Results Summary

Metric	Value
MMLU (instruction)	~69.4%
Perplexity (The Pile)	~8.28 (fp16)
Throughput	~192.9 tokens/sec
Time-to-first-token	~0.28 sec

Environmental Impact

Pretraining compute: ~1.46M GPU hours (H100s) for 8B; ~15T tokens.
Estimated CO₂e emissions: Use ML CO₂ Impact calculator for specifics.

Technical Specifications

Architecture

Decoder-only Transformer with SwiGLU, rotary embeddings, RMSNorm, Grouped-Query Attention (GQA); 32 layers, 8B parameters (arXiv, Prompthub, Collabnix, Wikipedia).

Compute Infrastructure

Pretrained on large Meta GPU clusters, likely H100-based.

Software

Implemented in PyTorch and Hugging Face Transformers (v4.43+) (Hugging Face).

Citation

@misc{together2024llama3,
  title={Introducing Llama 3},
  author={Meta AI},
  howpublished={\url{https://ai.meta.com/blog/meta-llama-3/}},
  year={2024},
  note={Version 3.1 released July 23, 2024}
}

alphaoumardev
/

Llama3-8B-noryu