Possible Improvements / Future Options

#1
by Koitenshin - opened

Future Improvements could include:

  • Even longer Context Length. Some models, such as Nemotron Nano v2 support a 1 million context length (i.e., 1048576)
  • Evaluation Batch Size
  • Flash Attention / Sage Attention / Triton / etc.
  • K / V Cache Quantization Type
  • More Quantization Types (e.g., Q4_K_M, Q4_K_L, Q6_K, etc.)

I use EBS set at 2048 quite frequently, because who wants to wait?
Flash Attention is very model dependent, as some models never load with FA turned on.

This tool is very helpful, thanks for making it.

Sign up or log in to comment