--- base_model: NousResearch/Llama-2-7b-hf tags: - generated_from_trainer datasets: - darulm model-index: - name: RuBit-Llama-56M2 results: [] --- # RuBit-Llama-63M This model is a fine-tuned version of [NousResearch/Llama-2-7b-hf](https://huggingface.co/NousResearch/Llama-2-7b-hf) on the darulm dataset. From darulm aphorisms, dramaturgy, history, humor, literature domains were sampled Training on 2_125_871_104 tokens. Inspired by [abideen/Bitnet-Llama-70M](https://huggingface.co/abideen/Bitnet-Llama-70M) ## Model description # Sample inference code ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load a pretrained BitNet model model = "igorktech/RuBit-LLama-63M" tokenizer = AutoTokenizer.from_pretrained(model) model = AutoModelForCausalLM.from_pretrained(model) def convert_to_bitnet(model, copy_weights): for name, module in model.named_modules(): # Replace linear layers with BitNet if isinstance(module, LlamaSdpaAttention) or isinstance(module, LlamaMLP): for child_name, child_module in module.named_children(): if isinstance(child_module, nn.Linear): bitlinear = BitLinear(child_module.in_features, child_module.out_features, child_module.bias is not None).to(device="cuda:0") if copy_weights: bitlinear.weight = child_module.weight if child_module.bias is not None: bitlinear.bias = child_module.bias setattr(module, child_name, bitlinear) # Remove redundant input_layernorms elif isinstance(module, LlamaDecoderLayer): for child_name, child_module in module.named_children(): if isinstance(child_module, LlamaRMSNorm) and child_name == "input_layernorm": setattr(module, child_name, nn.Identity().to(device="cuda:0")) convert_to_bitnet(model, copy_weights=True) model.to(device="cuda:0") prompt = "Привет" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) generate_ids = model.generate(inputs.input_ids, max_length=100) tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ``` ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0015 - train_batch_size: 64 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 128 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 0.1 - num_epochs: 2 - mixed_precision_training: Native AMP ### Training results ### Framework versions - Transformers 4.40.0 - Pytorch 2.2.1+cu121 - Datasets 2.19.0 - Tokenizers 0.19.1