latam-gpt/Wayra-Perplexity-Estimator-55M

A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.

WayraPPL Architecture

Use Cases

  • High-throughput data quality assessment: Evaluate quality of massive datasets by measuring text perplexity at 50,000+ samples/sec
  • Real-time perplexity estimation: Instant perplexity computation for live content filtering and moderation
  • Large-scale dataset cleaning: Process millions of documents to remove low-quality samples before model training
  • Curriculum Learning: Rank training examples by difficulty using perplexity for progressive learning
  • Semantic Filtering: Filter semantically relevant content based on perplexity thresholds
  • Production MLOps pipelines: Automated data quality gates in production ML workflows

Hardware Requirements

This model works on NVIDIA A100 GPUs with:

  • GPU Architecture: sm_80 (A100-80GB)
  • CUDA: 12.8+
  • TensorRT: 10.13.x
  • Driver: 570.124.06+

Performance

TensorRT Performance

  • Throughput: ~50,000+ samples/sec (A100)
  • Latency: <1ms per sample
  • Batch Size: Up to 2048
  • Memory: ~2GB GPU memory

Model Versions

Version Throughput Latency Memory Use Case
TensorRT (A100) ~50,000/sec <1ms 2GB Production inference
PyTorch Standard ~1,000/sec 10ms 4GB Research & development

Installation

# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt

# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)"  # Should be 10.13.x

Usage

TensorRT Engine (High Performance) - RECOMMENDED

from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer

# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")

# Multilingual examples
texts = [
    # Spanish
    "La inteligencia artificial está transformando el mundo.",
    # Portuguese  
    "A tecnologia blockchain promete revolucionar sistemas financeiros.",
    # English
    "Natural language processing enables human-computer communication."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())

for i, text in enumerate(texts):
    print(f"Text: {text}")
    print(f"Perplexity: {outputs['ppl'][i]:.2f}\n")

PyTorch Model (Standard)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")
model = AutoModel.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")

texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")

Performance Comparison: 100K Examples

# TensorRT: ~2 hours for 100,000 examples
# PyTorch: ~28 hours for 100,000 examples  
# Speedup: 14x faster with TensorRT

Files Included

TensorRT Engine (A100-optimized) - PRIMARY

  • wayrappl_fp16_bs2048.engine - TensorRT engine (A100 only)
  • tensorrt_config.json - Engine configuration
  • tensorrt_inference.py - Inference code with multilingual examples
  • tensorrt_requirements.txt - Dependencies

PyTorch Model (Standard HuggingFace format)

  • pytorch_model.bin - Model weights
  • config.json - Model configuration
  • tokenizer.json - Tokenizer

TensorRT Optimizations

TensorRT Optimizations

The A100-optimized engine includes advanced optimizations:

Layer Fusion:

  • Embedding + Positional Encoding → Single kernel
  • LayerNorm + Linear → Combined operation
  • Attention QKV projections → Single matrix multiplication
  • Multi-head attention → Fused attention kernel

Memory Optimizations:

  • Intermediate attention matrices eliminated
  • Key/Value cache optimized for batch processing
  • Activation recomputation removed (stored in optimized layout)

Graph Optimizations:

  • Constant folding on positional embeddings
  • Dead code elimination of unused heads
  • Operator fusion for perplexity computation

Benchmarks (A100)

Model Type Throughput Latency Memory GPU Util 100K Examples
Wayra TensorRT ~50,000/sec <1ms 2GB 95% ~2 hours
Wayra PyTorch ~1,000/sec 10ms 4GB 60% ~28 hours
Llama 3 1B ~200/sec 50ms 8GB 40% ~139 hours

Model Details

  • Base: Knowledge distillation from meta-llama/Llama-3.2-1B
  • Architecture: GPT2-based Transformer blocks with perplexity heads
  • Languages: Spanish, Portuguese, English
  • Max Length: 512 tokens
  • Precision: FP16 (TensorRT), FP32 (PyTorch)
  • Parameters: 55M

Troubleshooting

"TensorRT engine not compatible"

  • Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
  • Check CUDA version: nvidia-smi (should be 12.8+)
  • Verify TensorRT: python -c "import tensorrt" (should be 10.13.x)
  • Confirm driver version: nvidia-smi (should be 570.124.06+)

"CUDA out of memory"

  • Reduce batch size in inference
  • Use smaller sequence lengths
  • Monitor GPU memory: nvidia-smi -l 1

"Import tensorrt failed"

  • Reinstall TensorRT: pip uninstall tensorrt && pip install tensorrt==10.13.0
  • Check CUDA compatibility
  • Verify LD_LIBRARY_PATH includes TensorRT libs

Performance not as expected

  • Ensure GPU is not throttling: nvidia-smi -q -d PERFORMANCE
  • Use dedicated GPU (not shared)
  • Enable persistence mode: nvidia-smi -pm 1

Citation

@software{WayraPPL,
  title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
  author={Omar U. Florez and LatamGPT Team},
  year={2025},
  url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}

References

  • Dao et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022)
  • Narang et al. "Do Transformer Modifications Transfer Across Implementations and Applications?" (2021)
  • Rabe & Staats "Self-attention Does Not Need O(n²) Memory" (2021)
  • Pope et al. "Efficiently Scaling Transformer Inference" (2022)

License

Apache 2.0 - See LICENSE file


Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.

Downloads last month
5
Safetensors
Model size
55.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support