Update README.md
Browse files
README.md
CHANGED
|
@@ -25,83 +25,139 @@ Gemma v2 is a large language model released by Google on Jun 27th 2024.
|
|
| 25 |
|
| 26 |
The model is packaged into executable weights, which we call
|
| 27 |
[llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
|
| 28 |
-
easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD,
|
| 29 |
-
NetBSD for AMD64 and ARM64.
|
| 30 |
-
|
| 31 |
-
## License
|
| 32 |
-
|
| 33 |
-
The llamafile software is open source and permissively licensed. However
|
| 34 |
-
the weights embedded inside the llamafiles are governed by Google's
|
| 35 |
-
Gemma License and Gemma Prohibited Use Policy. This is not an open
|
| 36 |
-
source license. It's about as restrictive as it gets. There's a great
|
| 37 |
-
many things you're not allowed to do with Gemma. The terms of the
|
| 38 |
-
license and its list of unacceptable uses can be changed by Google at
|
| 39 |
-
any time. Therefore we wouldn't recommend using these llamafiles for
|
| 40 |
-
anything other than evaluating the quality of Google's engineering.
|
| 41 |
-
|
| 42 |
-
See the [LICENSE](LICENSE) file for further details.
|
| 43 |
|
| 44 |
## Quickstart
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
|
|
|
| 48 |
|
| 49 |
```
|
| 50 |
-
wget https://huggingface.co/
|
| 51 |
chmod +x gemma-2-9b-it.Q6_K.llamafile
|
| 52 |
./gemma-2-9b-it.Q6_K.llamafile
|
| 53 |
```
|
| 54 |
|
| 55 |
-
|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
-
context window size of 512 tokens is used. You may increase this to the
|
| 59 |
-
maximum by passing the `-c 0` flag.
|
| 60 |
-
|
| 61 |
-
On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
|
| 62 |
-
the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
|
| 63 |
-
driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
|
| 64 |
-
or ROCm SDKs may need to be installed, in which case llamafile builds a
|
| 65 |
-
native module just for your system.
|
| 66 |
-
|
| 67 |
-
For further information, please see the [llamafile
|
| 68 |
-
README](https://github.com/mozilla-ocho/llamafile/).
|
| 69 |
|
| 70 |
Having **trouble?** See the ["Gotchas"
|
| 71 |
-
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
|
| 72 |
of the README.
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
|
|
|
| 79 |
|
| 80 |
```
|
| 81 |
-
|
| 82 |
-
<start_of_turn>{{char}}
|
| 83 |
```
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
```
|
| 88 |
-
|
| 89 |
-
{{message}}<end_of_turn>
|
| 90 |
```
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
```
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
```
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
## About llamafile
|
| 102 |
|
| 103 |
-
llamafile is a new format introduced by Mozilla
|
| 104 |
-
|
| 105 |
binaries that run on the stock installs of six OSes for both ARM64 and
|
| 106 |
AMD64.
|
| 107 |
|
|
@@ -109,13 +165,25 @@ AMD64.
|
|
| 109 |
|
| 110 |
This model works well with any quantization format. Q6\_K is the best
|
| 111 |
choice overall here. We tested that, with [our 27b Gemma2
|
| 112 |
-
llamafiles](https://huggingface.co/
|
| 113 |
that the llamafile implementation of Gemma2 is able to to produce
|
| 114 |
identical responses to the Gemma2 model that's hosted by Google on
|
| 115 |
aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
|
| 116 |
faithful to Google's intentions. If you encounter any divergences, then
|
| 117 |
try using the BF16 weights, which have the original fidelity.
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
---
|
| 120 |
|
| 121 |
# Gemma 2 model card
|
|
|
|
| 25 |
|
| 26 |
The model is packaged into executable weights, which we call
|
| 27 |
[llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
|
| 28 |
+
easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD 7.3,
|
| 29 |
+
and NetBSD for AMD64 and ARM64.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## Quickstart
|
| 32 |
|
| 33 |
+
To get started, you need both the Gemma weights, and the llamafile
|
| 34 |
+
software. Both of them are included in a single file, which can be
|
| 35 |
+
downloaded and run as follows:
|
| 36 |
|
| 37 |
```
|
| 38 |
+
wget https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
|
| 39 |
chmod +x gemma-2-9b-it.Q6_K.llamafile
|
| 40 |
./gemma-2-9b-it.Q6_K.llamafile
|
| 41 |
```
|
| 42 |
|
| 43 |
+
The default mode of operation for these llamafiles is our new command
|
| 44 |
+
line chatbot interface.
|
| 45 |
|
| 46 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
Having **trouble?** See the ["Gotchas"
|
| 49 |
+
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
| 50 |
of the README.
|
| 51 |
|
| 52 |
+
## Usage
|
| 53 |
|
| 54 |
+
By default, llamafile launches a chatbot in the terminal, and a server
|
| 55 |
+
in the background. The chatbot is mostly self-explanatory. You can type
|
| 56 |
+
`/help` for further details. See the [llamafile v0.8.15 release
|
| 57 |
+
notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
|
| 58 |
+
for documentation on our newest chatbot features.
|
| 59 |
|
| 60 |
+
To instruct Gemma to do role playing, you can customize the system
|
| 61 |
+
prompt as follows:
|
| 62 |
|
| 63 |
```
|
| 64 |
+
./gemma-2-9b-it.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
|
|
|
|
| 65 |
```
|
| 66 |
|
| 67 |
+
To view the man page, run:
|
| 68 |
|
| 69 |
```
|
| 70 |
+
./gemma-2-9b-it.Q6_K.llamafile --help
|
|
|
|
| 71 |
```
|
| 72 |
|
| 73 |
+
To send a request to the OpenAI API compatible llamafile server, try:
|
| 74 |
|
| 75 |
```
|
| 76 |
+
curl http://localhost:8080/v1/chat/completions \
|
| 77 |
+
-H "Content-Type: application/json" \
|
| 78 |
+
-d '{
|
| 79 |
+
"model": "gemma-9b-it",
|
| 80 |
+
"messages": [{"role": "user", "content": "Say this is a test!"}],
|
| 81 |
+
"temperature": 0.0
|
| 82 |
+
}'
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
If you don't want the chatbot and you only want to run the server:
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
./gemma-2-9b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
An advanced CLI mode is provided that's useful for shell scripting. You
|
| 92 |
+
can use it by passing the `--cli` flag. For additional help on how it
|
| 93 |
+
may be used, pass the `--help` flag.
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
./gemma-2-9b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
You then need to fill out the prompt / history template (see below).
|
| 100 |
+
|
| 101 |
+
For further information, please see the [llamafile
|
| 102 |
+
README](https://github.com/mozilla-ocho/llamafile/).
|
| 103 |
+
|
| 104 |
+
## Troubleshooting
|
| 105 |
+
|
| 106 |
+
Having **trouble?** See the ["Gotchas"
|
| 107 |
+
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
| 108 |
+
of the README.
|
| 109 |
+
|
| 110 |
+
On Linux, the way to avoid run-detector errors is to install the APE
|
| 111 |
+
interpreter.
|
| 112 |
+
|
| 113 |
+
```sh
|
| 114 |
+
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
|
| 115 |
+
sudo chmod +x /usr/bin/ape
|
| 116 |
+
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
| 117 |
+
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
| 118 |
```
|
| 119 |
|
| 120 |
+
On Windows there's a 4GB limit on executable sizes. This means you
|
| 121 |
+
should download the Q2\_K llamafile. For better quality, consider
|
| 122 |
+
instead downloading the official llamafile release binary from
|
| 123 |
+
<https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
|
| 124 |
+
have the .exe file extension, and then saying:
|
| 125 |
+
|
| 126 |
+
```
|
| 127 |
+
.\llamafile-0.8.15.exe -m gemma-2-9b-it.Q6_K.llamafile
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
That will overcome the Windows 4GB file size limit, allowing you to
|
| 131 |
+
benefit from bigger better models.
|
| 132 |
+
|
| 133 |
+
## Context Window
|
| 134 |
+
|
| 135 |
+
This model has a max context window size of 8k tokens. By default, a
|
| 136 |
+
context window size of 8192 tokens is used. You may limit the context
|
| 137 |
+
window size by passing the `-c N` flag.
|
| 138 |
+
|
| 139 |
+
## GPU Acceleration
|
| 140 |
+
|
| 141 |
+
On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
|
| 142 |
+
the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
|
| 143 |
+
driver needs to be installed if you own an NVIDIA GPU. On Windows, if
|
| 144 |
+
you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
|
| 145 |
+
the flags `--recompile --gpu amd` the first time you run your llamafile.
|
| 146 |
+
|
| 147 |
+
On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
|
| 148 |
+
perform matrix multiplications. This is open source software, but it
|
| 149 |
+
doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
|
| 150 |
+
installed on your system, then you can pass the `--recompile` flag to
|
| 151 |
+
build a GGML CUDA library just for your system that uses cuBLAS. This
|
| 152 |
+
ensures you get maximum performance.
|
| 153 |
+
|
| 154 |
+
For further information, please see the [llamafile
|
| 155 |
+
README](https://github.com/mozilla-ocho/llamafile/).
|
| 156 |
+
|
| 157 |
## About llamafile
|
| 158 |
|
| 159 |
+
llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
|
| 160 |
+
uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
|
| 161 |
binaries that run on the stock installs of six OSes for both ARM64 and
|
| 162 |
AMD64.
|
| 163 |
|
|
|
|
| 165 |
|
| 166 |
This model works well with any quantization format. Q6\_K is the best
|
| 167 |
choice overall here. We tested that, with [our 27b Gemma2
|
| 168 |
+
llamafiles](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile),
|
| 169 |
that the llamafile implementation of Gemma2 is able to to produce
|
| 170 |
identical responses to the Gemma2 model that's hosted by Google on
|
| 171 |
aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
|
| 172 |
faithful to Google's intentions. If you encounter any divergences, then
|
| 173 |
try using the BF16 weights, which have the original fidelity.
|
| 174 |
|
| 175 |
+
## See Also
|
| 176 |
+
|
| 177 |
+
- <https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile>
|
| 178 |
+
- <https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile>
|
| 179 |
+
|
| 180 |
+
## License
|
| 181 |
+
|
| 182 |
+
The llamafile software is open source and permissively licensed. However
|
| 183 |
+
the weights embedded inside the llamafiles are governed by Google's
|
| 184 |
+
Gemma License and Gemma Prohibited Use Policy. See the
|
| 185 |
+
[LICENSE](LICENSE) file for further details.
|
| 186 |
+
|
| 187 |
---
|
| 188 |
|
| 189 |
# Gemma 2 model card
|