AI & ML interests
Knowledge Distillation, Pruning, Quantization, KV Cache Compression, Latency, Inference Speed
Recent Activity
View all activity
-
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Paper • 2407.11062 • Published • 9 -
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Paper • 2210.17323 • Published • 8 -
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper • 2306.00978 • Published • 9
-
A Survey on Model Compression for Large Language Models
Paper • 2308.07633 • Published • 3 -
A Survey on Efficient Inference for Large Language Models
Paper • 2404.14294 • Published • 3 -
Model Compression and Efficient Inference for Large Language Models: A Survey
Paper • 2402.09748 • Published • 1
-
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Paper • 2407.11062 • Published • 9 -
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Paper • 2210.17323 • Published • 8 -
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper • 2306.00978 • Published • 9
-
A Survey on Model Compression for Large Language Models
Paper • 2308.07633 • Published • 3 -
A Survey on Efficient Inference for Large Language Models
Paper • 2404.14294 • Published • 3 -
Model Compression and Efficient Inference for Large Language Models: A Survey
Paper • 2402.09748 • Published • 1