Alternative quantizations.

#13
by ZeroWw - opened

ZeroWw/gemma-2-9b-it-GGUF

These are my own quantizations (updated almost daily).
The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs

Example Usage:
llama-cli -m /content/gemma-2-9b-it.q5_k.gguf -t 2 -ngl 99 -p "User: Hi\nBot:Hi\nUser: Tell me all you know about LLMs in 1000 tokens.\nBot:"

Large Language Models (LLMs) are a type of artificial intelligence (AI) that excel at understanding and generating human-like text. They are trained on massive datasets of text and code, enabling them to learn patterns, grammar, and contextual nuances of language.

Key Characteristics of LLMs:

  • Generative: LLMs can create new text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
  • Contextual Understanding: They can analyze text and understand the relationships between words and sentences, allowing for more coherent and meaningful responses.
  • Scale and Training Data: LLMs are typically trained on vast amounts of data, which is crucial for their performance and ability to generalize to new tasks.
  • Transformer Architecture: Many powerful LLMs, like GPT-3 and BERT, are based on the transformer architecture, which allows them to process and understand long-range dependencies in text.

Applications of LLMs:

  • Chatbots and Conversational AI: LLMs power chatbots that can engage in natural-sounding conversations with humans.
  • Text Generation: They can generate creative content such as stories, poems, articles, and marketing copy.
  • Language Translation: LLMs can translate text from one language to another with high accuracy.
  • Code Generation: Some LLMs have been trained on code and can assist developers in writing and debugging code.
  • Summarization and Information Extraction: LLMs can summarize large amounts of text and extract key information.

Challenges and Considerations:

  • Bias and Fairness: LLMs can inherit biases present in the training data, leading to unfair or discriminatory outputs.
  • Explainability: It can be difficult to understand how LLMs arrive at their outputs, which can raise concerns about transparency and accountability.
  • Misinformation and Malicious Use: LLMs can be used to generate convincing fake news, propaganda, or spam.

Future Directions:

Research in LLMs is rapidly progressing, with ongoing efforts to address the challenges and explore new applications. Some key areas of development include:

  • Improving Fairness and Bias Mitigation: Techniques are being developed to identify and mitigate biases in LLMs.
  • Enhancing Explainability: Researchers are working on methods to make LLM decision-making more transparent.
  • Multimodality: Integrating LLMs with other modalities, such as vision and audio, to enable more comprehensive understanding and generation.

Let me know if you have any more questions about LLMs! [end of text]

Statistics on colab CPU ONLY:

llama_print_timings: load time = 51762.26 ms
llama_print_timings: sample time = 226.65 ms / 522 runs ( 0.43 ms per token, 2303.16 tokens per second)
llama_print_timings: prompt eval time = 27039.12 ms / 30 tokens ( 901.30 ms per token, 1.11 tokens per second)
llama_print_timings: eval time = 627527.87 ms / 521 runs ( 1204.47 ms per token, 0.83 tokens per second)
llama_print_timings: total time = 656354.94 ms / 551 tokens
Log end

Google org

Hi @ZeroWw , This is great work, and thank you for sharing your custom quantization method and detailed results with the community.

Finding the right balance between model size, CPU performance, and response quality is crucial for making these models accessible to users without high-end GPUs. Your contribution, along with the detailed performance logs and clear example usage, is a fantastic resource for others looking to run models on their own machines.

This is a great contribution. Keep up the fantastic work! 👍

Thank you.

Sign up or log in to comment