File size: 5,311 Bytes

---
license: cc-by-nc-4.0
library_name: transformers
language:
- en
tags:
- writing
base_model:
- maldv/badger-nu-llama-3.1-8B-UltraLong
pipeline_tags:
- text-generation
datasets:
- SillyTilly/fiction-writer-596
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b19c1b098c85365af5a83e/IYQqDTuu329tfl7IHo8J8.png)

[GGUF](https://huggingface.co/mradermacher/praxis-bookwriter-llama3.1-8b-sft-GGUF) [iMat](https://huggingface.co/mradermacher/praxis-bookwriter-llama3.1-8b-sft-i1-GGUF)

# Praxis Bookwriter Llama 3.1 8B

My last iteration of fantasy writer suffered from one glaring flaw: It did not really follow instructions well. 
After much consideration, I decided it would make sense to introduce some information about the story chapter text
somewhere to link instructions to the text generated.

For this, I took strides of 16,384 tokens across each of the books in the ~140M token dataset, and used R1 to generate a summary of the text. With
some careful modification, I used this to generate the first user turn. Each subsequent assistant turn takes approximately
512 tokens of content, and then the user turn is a chapter header, or one paragraph of content. This alternated until I
consumed the entirity of the original stride.

## Crafting the prompt

The system prompt should contain some variation of:

```text
You are the user's helpful writing assistant.

// Title: The Title of Your Story
// Author: Author Name For Style
// Tags: some comma, delimited list, of genres
```


In an initial test, I tried putting the summary in the system prompt. The result was underwhelming. For this
version, the first user turn should contain an overview of the setting (the summary), with the last line being of the format:

```
// Chapter n
```

The content of this block can contain all variety of instruction about what to write in the proceeding frame. The summaries I used were between 500 and 1500 tokens, so the more detail about setting, location, characters, their relationships, and plot points, the better.

## Training

This model was trained on one Paperspace A6000 using unsloth rsLoRA:

```python
from datasets import load_from_disk
from dotenv import dotenv_values
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from transformers import TrainingArguments
from trl import SFTTrainer
import wandb

envconfig = dict(dotenv_values(".env"))

dtype = None
max_seq_length = 24576
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128**.5,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    loftq_config = None,
)

dataset = load_from_disk('bookdata')
ds_train = dataset
ds_eval = dataset.shuffle(seed=12345).select(range(32))

targs = TrainingArguments(
    per_device_train_batch_size = 3,
    gradient_accumulation_steps = 4,
    learning_rate = 4e-5,
    weight_decay = 0,
    gradient_checkpointing = True,
    max_grad_norm = 1,
    warmup_steps = 5,
    num_train_epochs = 3,
    optim = "paged_adamw_32bit",
    lr_scheduler_type = "cosine",
    seed = 3407,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    per_device_eval_batch_size = 1,
    do_eval = True,
    eval_steps = 25,
    eval_strategy = "steps",
    save_strategy = "steps",
    save_steps = 20,
    save_total_limit = 3,
    output_dir = "outputs",
    report_to="wandb",
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = ds_train,
    eval_dataset = ds_eval,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 6,
    packing = False,
    args = targs,
)

wandb.login(key=envconfig['wandb_key'])
wandb.init(
    project='bookwriter-596',
    config={
        "learning_rate": 4e-5,
        "architecture": 'llama 3.1 8b',
        "dataset": 'bookdata',
        "epochs": 3,
    }
)

#trainer_stats = trainer.train()
trainer.train(resume_from_checkpoint=True)
```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b19c1b098c85365af5a83e/WRbDGDT9kv9ZnJFFSIRPJ.png)

## Merged

The rsLoRA I trained was applied on top of badger-nu-llama-3.1-8B UltraLong, which is RoPE scaled; so in theory
this model should be able to perform at content lengths exceeding my original training data. I say this, but
my training data was limited to sequence lengths of around 20k tokens, so anything after that might be out-of-distribution.

## License

This model is released under the limitations of both the llama3 license and CC-BY-NC-4.0.

## Author

Praxis Maldevide

## Citation

If you find our work helpful, feel free to give us a cite.

```
@misc{praxis-bookwriter-llama3.1-8b-sft,
    title = {Praxis Bookwriter Llama3.1 8B},
    url = {https://huggingface.co/maldv/praxis-bookwriter-llama3.1-8b-sft},
    author = {Praxis Maldevide},
    month = {May},
    year = {2025}
}
```