Snowflake
/

Llama-3.1-SwiftKV-8B-Instruct-FP8

compressed-tensors

Model card Files Files and versions Community

jeffra commited on Dec 5, 2024

Commit

39e7c16

·

verified ·

1 Parent(s): 81e7fdd

Update README.md

Files changed (1) hide show

README.md +8 -0

README.md CHANGED Viewed

@@ -13,6 +13,14 @@ For more details about SwiftKV and how to use it:
 * 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
 * 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
 ## Eval Metrics
 For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.

 * 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
 * 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
+## Performance Metrics
+Combined input and output throughput for Llama 3.1 405B across a range of input lengths.
+<img src="figure-4.png" alt="performance plot of llama-405B w. swiftkv" width="400">
+Legend: blue - baseline FP8, pink - SwiftKV FP8<br>
 ## Eval Metrics
 For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.