Note (16 July 2025):
I have trained a new version of this model that no longer uses space-separated amino acids:kittawere/Llama-KW-CwP-V1-3B-notseparated
Accuracy is very similar (~79.2% vs. ~80.57%) and the difference is not statistically significant.
So, while my original idea of separating amino acids made sense for clarity, it’s not needed for classification tasks. However, I still plan to explore it for protein generation, where residue-level control might be useful.
Kittawere's Chat with Protein version 1 3b param model (well LoRA)
👋 Hi there! This is my first LLM fine-tuning project, so feel free to reach out if you spot anything missing or want to collaborate.
This model is the first in a planned series designed to chat with proteins directly. The idea is that by simply placing a protein sequence into the context window, you can interact with the model for biological insights.
What Does This Model Do?
In this first version, the model can:
✅ Predict the phylogeny of a eukaryotic protein, telling you whether it belongs to:
- Plant (Viridiplantae)
- Animal (Metazoa)
- Fungus (Fungi)
✅ And that is all, this is just a first test
Training Data
- Dataset: Entire Swiss-Prot database
- Data Processing:
- Trimmed to ensure equal number examples from animals, plants, and fungi
- Split: 80% training / 20% testing
Performance
- Accuracy: 79.2% on held-out test set (33% would be random guessing)
This result demonstrates that a fine-tuned LLM can work with protein sequences without requiring special systems or tokenizers.
Input Format
To query the model, simply add the protein sequence into your prompt.
Example input:
<seq> T P P A G P D V G P R <seq> What is the taxonomic classification of the protein?
Why “separated” in the Name?
In this version, each amino acid letter in the protein sequence is space-separated. For example <seq> T P P A G P D V G P R <seq>
, instead of: <seq> TPPAGPDVGPR <seq>
This forces the tokenizer to treat each amino acid as a separate token instead of chunking the sequence, which might improve accuracy. However, I’m currently training another version without this separation to test this hypothesis.
Limitations
Since each amino is a different token, some of them gets out of the context window
Proteins are not exclusive of a kingdom but... hey this is just a test to check that a LLM can understand proteins
License
Apache 2.0 (the LORA, for the rest of the weighs refer to the meta's documentation)
Inference
For now I run it with:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "kittawere/Llama-KW-CwP-V1-3B-separated"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto" # Automatically puts model on GPU(s)/CPU(s)
)
prompt = f"<seq> T P P A G P D V G P R <seq> What is the taxonomic classification of the protein?"
def inference(model, tokenizer, prompt):
messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=128,
)
# Get only the newly generated tokens
generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return answer
print(inference(model, tokenizer, prompt))
Want to help me?, reach me out or join my discord
- Downloads last month
- 140