Fine-Tuning xLSTM-7B on a budget: An Experimental Journey with Chat Templates

Community Article Published October 15, 2025

Screenshot 2025-10-15 at 18.03.18

Exploring transformer alternatives is crucial for a more energy-efficient future.

This article chronicles a humble experiment to fine-tune the promising (and underrated) xLSTM-7b model by NX-AI, adding chat capabilities using only Google Colab resources for the evaluation and a CUDO Compute VM for the training.

The Motivation: Beyond Transformers

The machine learning world is dominated by the Transformer architecture. While incredibly powerful, its quadratic complexity and significant energy consumption are well-known challenges.

This has inspired a search for alternatives, and one of the most exciting recent developments is xLSTM.

My goal for this project was not to break SOTA records, but to explore the practicality of working with these new architectures on a limited budget. My interest in LSTM architectures isn't new.

It came naturally around the time GPT-2 was released and I was fine-tuning LSTMs for text generation on a Lenovo Laptop.

Just for fun. Back then, as a Software Engineer experimenting outside of academia, I was fascinated by the potential of LSTMs, aiming to create my first chatbot.

However, the world was quickly swept up by the Transformer revolution, and like many, I shifted my focus to other topics such as DevOps and MLOps, but I never lost my appreciation for the core ideas behind transformers and LSTMs.

With the advent of projects like NXAI's xLSTM, which address some of the original limitations, I felt a renewed excitement to dive back in and explore their potential.

Specifically, I wanted to take the base NX-AI/xLSTM-7b model and teach it to be a conversational agent by fine-tuning it on a chat dataset.

The Process: Old School Grit and a Modern Companion

I want to be transparent about the development process: this small project is a product of uninterrupted focus time, made possible by remote work.

It wasn't generated by an LLM assistant in one shot. Honestly I've found that for complex coding tasks, even tools like Gemini Pro often hallucinate entire codebases making them unusable.

I even had to disable the integration in my Colab environment and instead, I went back to basics: multiple browser tabs open with Hugging Face documentation, libraries source code, and trusty old Stack Overflow.

This "old school" method of deep work and research was essential, however, I don't want to dismiss LLM assistants entirely: they were an incredibly powerful tool for problem-solving and debugging.

When I have to challenge my assumptions or I am stuck on a specific error or needed to understand a concept from a different angle, they provide a valuable sounding board.

The real workflow was a blend of human-led research and assisted debugging.

The Training Notebook: A Step-by-Step Guide

Let's walk through the key stages and, most importantly, the challenges I encountered and how I solved them.

The Kernel Challenge: Native PyTorch to the Rescue

My initial plan was to leverage Triton kernels and Flash Attention 2 for maximum efficiency.

However, I quickly ran into roadblocks. The xLSTM implementation in transformers does not yet fully support them, leading to compatibility errors during model loading.

After struggling, I decided to fall back to the native PyTorch implementation.

Lesson learned: sometimes the "less optimized" path is the one that actually works. So, let's optimize later.

This is where the power of the open-source community shines. I want to give a huge thank you to John6666, who shared an incredibly useful resource detailing the specific version requirements for xLSTM.

This post was a lifesaver.

Configuration and Model Loading

To make this work on a single Colab GPU (A100), 4-bit quantization using bitsandbytes was non-negotiable.

# --- Quantization Config for Memory Efficiency ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

# --- Load the Model and Tokenizer ---
# Note: I removed attn_implementation="flash_attention_2" to avoid errors.
# The model defaults to the native PyTorch implementation
model = AutoModelForCausalLM.from_pretrained(
    "NX-AI/xLSTM-7b",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained("NX-AI/xLSTM-7b", trust_remote_code=True)

Applying the Chat Template

A base model knows nothing about how to format a conversation. I chose the template from SmolLM3 for its simplicity.

The clone_chat_template utility from trl handles resizing the model's embedding layer to accommodate new special tokens automatically and I later removed the reasoning parts manually, as I'm not focusing on a "think mode", as I think it's more important to experiment first on a simpler post-training pipeline.

# The SmolLM3 template is simple and effective.
# `trl` handles adding new tokens and resizing embeddings automatically.

model, tokenizer, additional_tokens = clone_chat_template(model, tokenizer, "HuggingFaceTB/SmolLM3-3B")

Preparing the Dataset and Training

With the model ready, the final step was to prepare the dataset and launch the training.

I used the first 25,000 samples from the HuggingFaceH4/ultrachat_200k dataset, trl and the peft library with DoRA (use_dora=True) for parameter-efficient fine-tuning.

Prepare the model for k-bit training and apply LoRA/DoRA

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "proj_up", "proj_down"],
    use_dora=True,
)
model = get_peft_model(model, peft_config)

# Load the dataset
chat_dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:25000]")
chat_dataset = chat_dataset.train_test_split(test_size=0.1)

# Configure the trainer
training_args = SFTConfig(
    output_dir="xlstm-7b-instruct",
    max_steps=1000,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    logging_steps=10,
    max_length=2048,
    packing=False, # Set to False as per the notebook
    optim="adamw_torch",
    lr_scheduler_type="cosine",
    report_to="wandb",
    bf16=True,
    tf32=True,  # Mixed Precision
    gradient_checkpointing=True,
    dataset_text_field="text" # The data was pre-formatted into the 'text' field
)

trainer = SFTTrainer(
    model=model,
    train_dataset=chat_dataset['train'],
    eval_dataset=chat_dataset['test'],
    args=training_args,
    peft_config=peft_config,
)

# Let's go!

trainer.train()
trainer.save_model("xlstm-7b-instruct")

Evaluation Results and the Efficiency Surprise

The evaluation is complete, and the results paint a fascinating picture of the trade-offs between model architecture, size, and performance.

The Head-to-Head Comparison

I used the lighteval library to benchmark both my fine-tuned xLSTM-7b-Instruct and the original HuggingFaceTB/SmolLM3-3B on a handful of "standard" tasks.

Here’s how they stack up:

Task Metric xLSTM-7b-Instruct SmolLM3-3B
ARC (Challenge) acc 0.5401 0.7000
HellaSwag acc_norm 0.6384 0.4500
TruthfulQA (MC2) mc2 0.3804 0.5885
WinoGrande acc. 0.7230 0.7500
GSM8K acc/em. 0.0887 0.6000

The results are mixed and incredibly insightful.

  • The fine-tuned xLSTM model on 25k samples shows strong performance on commonsense tasks like HellaSwag.
  • However, the smaller, transformer-based SmolLM3 model pulls ahead significantly on reasoning-heavy benchmarks, especially ARC (science questions) and GSM8K (math word problems).

This result highlights that while a new architecture can be promising, a well-trained model of a smaller size can still be highly competitive, particularly in specialized domains.

The minimal fine-tuning run on only 25k samples was likely not enough to unlock the full potential of the Base 7B xLSTM model weights provided by NXAI.

The Efficiency Surprise: Is Faster Always Better?

Here is where the story takes a twist. As I mentioned previously, my original plan was to compare scores.

But an unexpected observation came from the evaluation process itself: The xLSTM-7b-Instruct model, despite being more than twice the size (7B vs 3B parameters), completed the full benchmark suite significantly faster than the SmolLM3 model.

This is a powerful finding and it suggests that the architecture design of xLSTM could lead to more efficient inference, a critical factor for real-world deployment.

The fact that a larger model can be evaluated more quickly challenges the simple assumption that fewer parameters always mean faster performance.

This aligns perfectly with the core motivation of this project: exploring alternatives that can lower the energy consumption of LLMs and make better use of our existing data centers.

Conclusion and Looking Ahead

This experiment started with a simple goal: to add a chat template to xLSTM-7B.

It has evolved into a study on the trade-offs between performance and efficiency in next-generation architectures:

  • Low-budget fine-tuning is possible: It's feasible to adapt and experiment with novel architectures using accessible resources.
  • Performance is a complex equation: A larger parameter count and a new architecture don't automatically guarantee superior performance across all tasks, especially with limited fine-tuning data.
  • Inference efficiency is a key differentiator: xLSTM's potential for faster inference, even at a larger scale, is a valid reason to continue research in this area. This work is just a first step and I am publishing this in the spirit of open research and collaboration.

I am actively seeking access to more GPU compute power to fine-tune on a larger dataset and begin exploring more advanced capabilities like tool use. If you're interested in exploring post-Transformer models or have resources to share, I would love to connect!

A final thank you to NXAI for the xLSTM-7b release and the Hugging Face team for building the incredible community and libraries such as transformers, trl, and datasets, that make small experiments like this possible.


Support ethicalabs.ai's ResearchOps:

Community

Sign up or log in to comment