google/gemma-2-9b-it · Alternative quantizations.

ZeroWw/gemma-2-9b-it-GGUF

These are my own quantizations (updated almost daily).
The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs

Example Usage:
llama-cli -m /content/gemma-2-9b-it.q5_k.gguf -t 2 -ngl 99 -p "User: Hi\nBot:Hi\nUser: Tell me all you know about LLMs in 1000 tokens.\nBot:"

Large Language Models (LLMs) are a type of artificial intelligence (AI) that excel at understanding and generating human-like text. They are trained on massive datasets of text and code, enabling them to learn patterns, grammar, and contextual nuances of language.

Key Characteristics of LLMs:

Generative: LLMs can create new text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
Contextual Understanding: They can analyze text and understand the relationships between words and sentences, allowing for more coherent and meaningful responses.
Scale and Training Data: LLMs are typically trained on vast amounts of data, which is crucial for their performance and ability to generalize to new tasks.
Transformer Architecture: Many powerful LLMs, like GPT-3 and BERT, are based on the transformer architecture, which allows them to process and understand long-range dependencies in text.

Applications of LLMs:

Chatbots and Conversational AI: LLMs power chatbots that can engage in natural-sounding conversations with humans.
Text Generation: They can generate creative content such as stories, poems, articles, and marketing copy.
Language Translation: LLMs can translate text from one language to another with high accuracy.
Code Generation: Some LLMs have been trained on code and can assist developers in writing and debugging code.
Summarization and Information Extraction: LLMs can summarize large amounts of text and extract key information.

Challenges and Considerations:

Bias and Fairness: LLMs can inherit biases present in the training data, leading to unfair or discriminatory outputs.
Explainability: It can be difficult to understand how LLMs arrive at their outputs, which can raise concerns about transparency and accountability.
Misinformation and Malicious Use: LLMs can be used to generate convincing fake news, propaganda, or spam.

Future Directions:

Research in LLMs is rapidly progressing, with ongoing efforts to address the challenges and explore new applications. Some key areas of development include:

Improving Fairness and Bias Mitigation: Techniques are being developed to identify and mitigate biases in LLMs.
Enhancing Explainability: Researchers are working on methods to make LLM decision-making more transparent.
Multimodality: Integrating LLMs with other modalities, such as vision and audio, to enable more comprehensive understanding and generation.

Let me know if you have any more questions about LLMs! [end of text]

Statistics on colab CPU ONLY:

llama_print_timings: load time = 51762.26 ms
llama_print_timings: sample time = 226.65 ms / 522 runs ( 0.43 ms per token, 2303.16 tokens per second)
llama_print_timings: prompt eval time = 27039.12 ms / 30 tokens ( 901.30 ms per token, 1.11 tokens per second)
llama_print_timings: eval time = 627527.87 ms / 521 runs ( 1204.47 ms per token, 0.83 tokens per second)
llama_print_timings: total time = 656354.94 ms / 551 tokens
Log end