Tanvir1337/BanglaLLama-3-8b-BnWiki-Instruct-GGUF
This model has been quantized using llama.cpp, a high-performance inference engine for large language models.
System Prompt Format
To interact with the model, use the following prompt format:
{System}
### Prompt:
{User}
### Response:
Usage Instructions
If you're new to using GGUF files, refer to TheBloke's README for detailed instructions.
Quantization Options
The following graph compares various quantization types (lower is better):
For more information on quantization, see Artefact2's notes.
Choosing the Right Model File
To select the optimal model file, consider the following factors:
- Memory constraints: Determine how much RAM and/or VRAM you have available.
- Speed vs. quality: If you prioritize speed, choose a model that fits within your GPU's VRAM. For maximum quality, consider a model that fits within the combined RAM and VRAM of your system.
Quantization formats:
- K-quants (e.g., Q5_K_M): A good starting point, offering a balance between speed and quality.
- I-quants (e.g., IQ3_M): Newer and more efficient, but may require specific hardware configurations (e.g., cuBLAS or rocBLAS).
Hardware compatibility:
- I-quants: Not compatible with Vulcan (AMD). If you have an AMD card, ensure you're using the rocBLAS build or a compatible inference engine.
For more information on the features and trade-offs of each quantization format, refer to the llama.cpp feature matrix.
- Downloads last month
- 96
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for Tanvir1337/BanglaLLama-3-8b-BnWiki-Instruct-GGUF
Base model
BanglaLLM/BanglaLLama-3-8b-BnWiki-Instruct