ctrltokyo/llama-2-7b-hf-dolly-flash-attention

This model is a fine-tuned version of NousResearch/Llama-2-7b-hf on the databricks/databricks-dolly-15k dataset with all training performed using Flash Attention 2.

No further testing or optimisation has been performed.

Model description

Just like ctrltokyo/llm_prompt_mask_fill_model, this model could be used for live autocompletion of PROMPTS, but is more designed for a generalized chatbot (hence the usage of the Dolly 15k dataset). Don't try this on code, because it won't work. I plan to release a further fine-tuned version using the code_instructions_120k dataset.

Intended uses & limitations

Use as intended.

Training and evaluation data

No evaluation was performed. Trained on NVIDIA A100, but appears to use around 20GB of VRAM when performing inference on the raw model.

Training procedure

The following bitsandbytes quantization config was used during training:

load_in_8bit: False
load_in_4bit: True
llm_int8_threshold: 6.0
llm_int8_skip_modules: None
llm_int8_enable_fp32_cpu_offload: False
llm_int8_has_fp16_weight: False
bnb_4bit_quant_type: fp4
bnb_4bit_use_double_quant: False
bnb_4bit_compute_dtype: float32

Framework versions

PEFT 0.4.0

ctrltokyo
/

llama-2-7b-hf-dolly-flash-attention