Kittawere's Chat with Protein version 1 (no-separation) - 3B param model (LoRA)
👋 Hi there! This is the second release in my project to chat with proteins
In this new version, amino acids are no longer space-separated. You simply input the raw protein sequence as a continuous string of letters.
What Does This Model Do?
✅ Predicts the phylogeny of a eukaryotic protein, classifying it as:
- Plant (Viridiplantae)
- Animal (Metazoa)
- Fungus (Fungi)
Why Remove Amino Acid Separation?
My first model V1-separated used space-separated amino acids (e.g. <seq> T P P A G P D V G P R <seq>
) to force the tokenizer to treat each residue as a separate token.
However, further testing showed:
- No significant accuracy difference. The new model achieves ~80.57% accuracy versus ~79.2% for the separated version — not statistically significant on the held-out test set.
- Non-separated sequences save tokens and avoid context-length issues for longer proteins.
So, my original reasoning for separation was not necessary for classification. However, I still plan to explore separated inputs in future work on protein generation, where residue-level control might be beneficial.
Training Data
- Dataset: Entire Swiss-Prot database
- Data processing:
- Balanced samples of animals, plants, fungi
- 80% training / 20% testing
Performance
- Accuracy: ~80.57% on held-out test set
- Baseline (random guess): 33%
This demonstrates that LLMs can directly work with protein sequences in a natural language context.
Input Format
Now you can simply paste the raw amino acid sequence into your prompt:
Example input:
<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?
Limitations
- Phylogenetic predictions remain approximate; proteins may be shared across kingdoms.
- Model context limits very long sequences.
- Model is trained only on eukaryotic sequences (plants, animals, fungi).
License
Apache 2.0 (for the LoRA), refer to Meta’s license for the base weights.
Inference
Example usage:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "kittawere/Llama-KW-CwP-V1-3B-notseparated"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto" # Automatically puts model on GPU(s)/CPU(s)
)
prompt = f"<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?"
def inference(model, tokenizer, prompt):
messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=128,
)
# Get only the newly generated tokens
generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return answer
print(inference(model, tokenizer, prompt))
Want to help me?, reach me out or join my discord
- Downloads last month
- 16