Model Card for lyraLLMs

Introduction

We have released lyraLLMs, a highly optimized and easy-to-use inference engine for LLMs.

lyraLLMs is suitable for NVIDIA GPUs:

  • Volta (V100)
  • Turing (T4)
  • Ampere (A100/A10)
  • Ada Lovelace (RTX 4090, etc.)

lyraLLMs supports many popular HuggingFace models as follows:

lyraLLMs is fast, memory-efficient & easy to use with:

  • State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
  • Efficient memory usage of attention with FlashAttention2
  • Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
  • Easy-to-use Python API to serve LLMs
  • Streaming outputs

If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com

Speed

Settings

  • Evaluated at tokens/s (input + output)
  • Test on A100 40G, CUDA 12.0
  • Enable the use of MEMOPT mode and KVCache Int8

Throughputs

XVERSE-13B-Chat

Input

北京的景点:故宫、天坛、万里长城等。\n深圳的景点:

Version Batch Size 1 Batch Size 64 Batch Size 128 Batch Size 256 Batch Size 512
Torch 2.1.0 52.9 2308.1 OOM
lyraXVERSE 200.4 4624.8 5759.7 6075.6 5733

Baichuan2-7B-Base

Input

北京的景点:登鹳雀楼->王之涣\n夜雨寄北->

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 41.2 323.2 640.0 1256.8 2231.0
lyraBaichuan 125.9 948.1 1749.3 2974.0 4370.1

Baichuan2-13B-Base

Input

北京的景点:登鹳雀楼->王之涣\n夜雨寄北->

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 40.9 307.9 555.6 1010.4 1601.0
lyraBaichuan 80.0 568.2 1124.4 1942.6 2828.0

Yi-6B

Input

# write the quick sort algorithm

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.1.0 31.4 247.5 490.4 987.2 1796.3
lyraLLaMA 93.8 735.6 2339.8 3020.9 4630.8

Yi-34B

Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.

Input

Let me tell you an interesting story about cat Tom and mouse Jerry,

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
lyraLLaMA 52.5 399.4 753.0 1138.2 1926.2

Usage

Environment (Docker recommended)

  • For Cuda 11.X: we recommend nvcr.io/nvidia/pytorch:22.12-py3
  • For Cuda 12.0: we recommend nvcr.io/nvidia/pytorch:23.02-py3
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt

Convert Models

We have released multiple optimized models converted from original HuggingFace ones:

  • ChatGLM-6B
  • XVERSE-13B-Chat
  • LLaMA-Ziya-13B
  • Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
  • Yi-6B, Yi-34B

Feel free to contact us if you would like to convert a finetuned version of LLMs.

Inference

Refer to README.md for inference of converted models with lyraLLMs.

Python Demo

from lyra_llama import lyraLlama

model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1

model = lyraLlama(model_path, data_type, memopt_mode)

prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.'
prompts = [prompts,] * 64

output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)

Citation

@Misc{lyraLLMs2024,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
  title =        {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
  year =         {2024}
}

Report bug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.