latam-gpt/Wayra-Perplexity-Estimator-55M

A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.

Use Cases

High-throughput data quality assessment: Evaluate quality of massive datasets by measuring text perplexity at 50,000+ samples/sec
Real-time perplexity estimation: Instant perplexity computation for live content filtering and moderation
Large-scale dataset cleaning: Process millions of documents to remove low-quality samples before model training
Curriculum Learning: Rank training examples by difficulty using perplexity for progressive learning
Semantic Filtering: Filter semantically relevant content based on perplexity thresholds
Production MLOps pipelines: Automated data quality gates in production ML workflows

Hardware Requirements

This model works on NVIDIA A100 GPUs with:

GPU Architecture: sm_80 (A100-80GB)
CUDA: 12.8+
TensorRT: 10.13.x
Driver: 570.124.06+

Performance

Throughput: ~50,000+ samples/sec (A100)
Latency: <1ms per sample
Batch Size: Up to 2048
Memory: ~2GB GPU memory

Model Versions

Version	Throughput	Latency	Memory	Use Case
TensorRT (A100)	~50,000/sec	<1ms	2GB	Production inference
PyTorch Standard	~1,000/sec	10ms	4GB	Research & development

Installation

# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt

# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)"  # Should be 10.13.x

Usage

TensorRT Engine (High Performance) - RECOMMENDED

from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer

# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")

# Multilingual examples
texts = [
    # Spanish
    "La inteligencia artificial está transformando el mundo.",
    # Portuguese  
    "A tecnologia blockchain promete revolucionar sistemas financeiros.",
    # English
    "Natural language processing enables human-computer communication."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())

for i, text in enumerate(texts):
    print(f"Text: {text}")
    print(f"Perplexity: {outputs['ppl'][i]:.2f}\n")

PyTorch Model (Standard)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")
model = AutoModel.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")

texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")

Performance Comparison: 100K Examples

# TensorRT: ~2 hours for 100,000 examples
# PyTorch: ~28 hours for 100,000 examples  
# Speedup: 14x faster with TensorRT

Files Included

TensorRT Engine (A100-optimized) - PRIMARY

wayrappl_fp16_bs2048.engine - TensorRT engine (A100 only)
tensorrt_config.json - Engine configuration
tensorrt_inference.py - Inference code with multilingual examples
tensorrt_requirements.txt - Dependencies

PyTorch Model (Standard HuggingFace format)

pytorch_model.bin - Model weights
config.json - Model configuration
tokenizer.json - Tokenizer

TensorRT Optimizations

The A100-optimized engine includes advanced optimizations:

Layer Fusion:

Embedding + Positional Encoding → Single kernel
LayerNorm + Linear → Combined operation
Attention QKV projections → Single matrix multiplication
Multi-head attention → Fused attention kernel

Memory Optimizations:

Intermediate attention matrices eliminated
Key/Value cache optimized for batch processing
Activation recomputation removed (stored in optimized layout)

Graph Optimizations:

Constant folding on positional embeddings
Dead code elimination of unused heads
Operator fusion for perplexity computation

Benchmarks (A100)

Model Type	Throughput	Latency	Memory	GPU Util	100K Examples
Wayra TensorRT	~50,000/sec	<1ms	2GB	95%	~2 hours
Wayra PyTorch	~1,000/sec	10ms	4GB	60%	~28 hours
Llama 3 1B	~200/sec	50ms	8GB	40%	~139 hours

Model Details

Base: Knowledge distillation from meta-llama/Llama-3.2-1B
Architecture: GPT2-based Transformer blocks with perplexity heads
Languages: Spanish, Portuguese, English
Max Length: 512 tokens
Precision: FP16 (TensorRT), FP32 (PyTorch)
Parameters: 55M

Troubleshooting

"TensorRT engine not compatible"

Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
Check CUDA version: nvidia-smi (should be 12.8+)
Verify TensorRT: python -c "import tensorrt" (should be 10.13.x)
Confirm driver version: nvidia-smi (should be 570.124.06+)

"CUDA out of memory"

Reduce batch size in inference
Use smaller sequence lengths
Monitor GPU memory: nvidia-smi -l 1

"Import tensorrt failed"

Reinstall TensorRT: pip uninstall tensorrt && pip install tensorrt==10.13.0
Check CUDA compatibility
Verify LD_LIBRARY_PATH includes TensorRT libs

Performance not as expected

Ensure GPU is not throttling: nvidia-smi -q -d PERFORMANCE
Use dedicated GPU (not shared)
Enable persistence mode: nvidia-smi -pm 1

Citation

@software{WayraPPL,
  title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
  author={Omar U. Florez and LatamGPT Team},
  year={2025},
  url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}

References

Dao et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022)
Narang et al. "Do Transformer Modifications Transfer Across Implementations and Applications?" (2021)
Rabe & Staats "Self-attention Does Not Need O(n²) Memory" (2021)
Pope et al. "Efficiently Scaling Transformer Inference" (2022)

License

Apache 2.0 - See LICENSE file

Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.

Downloads last month: 5

Safetensors

Model size

55.4M params

Tensor type

F32