Note (16 July 2025):
I have trained a new version of this model that no longer uses space-separated amino acids:
kittawere/Llama-KW-CwP-V1-3B-notseparated

Accuracy is very similar (~79.2% vs. ~80.57%) and the difference is not statistically significant.

So, while my original idea of separating amino acids made sense for clarity, it’s not needed for classification tasks. However, I still plan to explore it for protein generation, where residue-level control might be useful.

Kittawere's Chat with Protein version 1 3b param model (well LoRA)

👋 Hi there! This is my first LLM fine-tuning project, so feel free to reach out if you spot anything missing or want to collaborate.

This model is the first in a planned series designed to chat with proteins directly. The idea is that by simply placing a protein sequence into the context window, you can interact with the model for biological insights.

What Does This Model Do?

In this first version, the model can:

✅ Predict the phylogeny of a eukaryotic protein, telling you whether it belongs to:

  • Plant (Viridiplantae)
  • Animal (Metazoa)
  • Fungus (Fungi)

✅ And that is all, this is just a first test

Training Data

  • Dataset: Entire Swiss-Prot database
  • Data Processing:
    • Trimmed to ensure equal number examples from animals, plants, and fungi
    • Split: 80% training / 20% testing

Performance

  • Accuracy: 79.2% on held-out test set (33% would be random guessing)

This result demonstrates that a fine-tuned LLM can work with protein sequences without requiring special systems or tokenizers.

Input Format

To query the model, simply add the protein sequence into your prompt.

Example input:

<seq> T P P A G P D V G P R <seq> What is the taxonomic classification of the protein?

Why “separated” in the Name?

In this version, each amino acid letter in the protein sequence is space-separated. For example <seq> T P P A G P D V G P R <seq>, instead of: <seq> TPPAGPDVGPR <seq>

This forces the tokenizer to treat each amino acid as a separate token instead of chunking the sequence, which might improve accuracy. However, I’m currently training another version without this separation to test this hypothesis.

Limitations

Since each amino is a different token, some of them gets out of the context window

Proteins are not exclusive of a kingdom but... hey this is just a test to check that a LLM can understand proteins

License

Apache 2.0 (the LORA, for the rest of the weighs refer to the meta's documentation)

Inference

For now I run it with:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kittawere/Llama-KW-CwP-V1-3B-separated"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"                # Automatically puts model on GPU(s)/CPU(s)
)

prompt = f"<seq> T P P A G P D V G P R <seq> What is the taxonomic classification of the protein?"

def inference(model, tokenizer, prompt):
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
    )

    # Get only the newly generated tokens
    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return answer

print(inference(model, tokenizer, prompt))

Want to help me?, reach me out or join my discord

Downloads last month
140
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kittawere/Llama-KW-CwP-V1-3B-separated

Adapter
(289)
this model
Quantizations
1 model

Space using kittawere/Llama-KW-CwP-V1-3B-separated 1

Collection including kittawere/Llama-KW-CwP-V1-3B-separated