Note (16 July 2025):
I have trained a new version of this model that no longer uses space-separated amino acids:
kittawere/Llama-KW-CwP-V1-3B-notseparated

Accuracy is very similar (~79.2% vs. ~80.57%) and the difference is not statistically significant.

So, while my original idea of separating amino acids made sense for clarity, it’s not needed for classification tasks. However, I still plan to explore it for protein generation, where residue-level control might be useful.

Kittawere's Chat with Protein version 1 3b param model (well LoRA)

👋 Hi there! This is my first LLM fine-tuning project, so feel free to reach out if you spot anything missing or want to collaborate.

This model is the first in a planned series designed to chat with proteins directly. The idea is that by simply placing a protein sequence into the context window, you can interact with the model for biological insights.

What Does This Model Do?

In this first version, the model can:

✅ Predict the phylogeny of a eukaryotic protein, telling you whether it belongs to:

Plant (Viridiplantae)
Animal (Metazoa)
Fungus (Fungi)

✅ And that is all, this is just a first test

Training Data

Dataset: Entire Swiss-Prot database
Data Processing:
- Trimmed to ensure equal number examples from animals, plants, and fungi
- Split: 80% training / 20% testing

Performance

Accuracy: 79.2% on held-out test set (33% would be random guessing)

This result demonstrates that a fine-tuned LLM can work with protein sequences without requiring special systems or tokenizers.

Input Format

To query the model, simply add the protein sequence into your prompt.

Example input:

<seq> T P P A G P D V G P R <seq> What is the taxonomic classification of the protein?

Why “separated” in the Name?

In this version, each amino acid letter in the protein sequence is space-separated. For example <seq> T P P A G P D V G P R <seq>, instead of: <seq> TPPAGPDVGPR <seq>

This forces the tokenizer to treat each amino acid as a separate token instead of chunking the sequence, which might improve accuracy. However, I’m currently training another version without this separation to test this hypothesis.

Limitations

Since each amino is a different token, some of them gets out of the context window

Proteins are not exclusive of a kingdom but... hey this is just a test to check that a LLM can understand proteins

License

Apache 2.0 (the LORA, for the rest of the weighs refer to the meta's documentation)

Inference

For now I run it with:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kittawere/Llama-KW-CwP-V1-3B-separated"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"                # Automatically puts model on GPU(s)/CPU(s)
)

prompt = f"<seq> T P P A G P D V G P R <seq> What is the taxonomic classification of the protein?"

def inference(model, tokenizer, prompt):
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
    )

    # Get only the newly generated tokens
    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return answer

print(inference(model, tokenizer, prompt))

Want to help me?, reach me out or join my discord

kittawere
/

Llama-KW-CwP-V1-3B-separated

Kittawere's Chat with Protein version 1 3b param model (well LoRA)

What Does This Model Do?

Training Data

Performance

Input Format

Why “separated” in the Name?

Limitations

License

Inference

Model tree for kittawere/Llama-KW-CwP-V1-3B-separated

Space using kittawere/Llama-KW-CwP-V1-3B-separated 1

Collection including kittawere/Llama-KW-CwP-V1-3B-separated

Kittawere's Chat with Protein V1