Kittawere's Chat with Protein version 1 (no-separation) - 3B param model (LoRA)

👋 Hi there! This is the second release in my project to chat with proteins

In this new version, amino acids are no longer space-separated. You simply input the raw protein sequence as a continuous string of letters.

What Does This Model Do?

✅ Predicts the phylogeny of a eukaryotic protein, classifying it as:

Plant (Viridiplantae)
Animal (Metazoa)
Fungus (Fungi)

Why Remove Amino Acid Separation?

My first model V1-separated used space-separated amino acids (e.g. <seq> T P P A G P D V G P R <seq>) to force the tokenizer to treat each residue as a separate token.

However, further testing showed:

No significant accuracy difference. The new model achieves ~80.57% accuracy versus ~79.2% for the separated version — not statistically significant on the held-out test set.
Non-separated sequences save tokens and avoid context-length issues for longer proteins.

So, my original reasoning for separation was not necessary for classification. However, I still plan to explore separated inputs in future work on protein generation, where residue-level control might be beneficial.

Training Data

Dataset: Entire Swiss-Prot database
Data processing:
- Balanced samples of animals, plants, fungi
- 80% training / 20% testing

Performance

Accuracy: ~80.57% on held-out test set
Baseline (random guess): 33%

This demonstrates that LLMs can directly work with protein sequences in a natural language context.

Input Format

Now you can simply paste the raw amino acid sequence into your prompt:

Example input:

<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?

Limitations

Phylogenetic predictions remain approximate; proteins may be shared across kingdoms.
Model context limits very long sequences.
Model is trained only on eukaryotic sequences (plants, animals, fungi).

License

Apache 2.0 (for the LoRA), refer to Meta’s license for the base weights.

Inference

Example usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kittawere/Llama-KW-CwP-V1-3B-notseparated"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"                # Automatically puts model on GPU(s)/CPU(s)
)

prompt = f"<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?"

def inference(model, tokenizer, prompt):
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
    )

    # Get only the newly generated tokens
    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return answer

print(inference(model, tokenizer, prompt))

Want to help me?, reach me out or join my discord

kittawere
/

Llama-KW-CwP-V1-3B-notseparated

Kittawere's Chat with Protein version 1 (no-separation) - 3B param model (LoRA)

What Does This Model Do?

Why Remove Amino Acid Separation?

Training Data

Performance

Input Format

Limitations

License

Inference

Model tree for kittawere/Llama-KW-CwP-V1-3B-notseparated

Space using kittawere/Llama-KW-CwP-V1-3B-notseparated 1

Collection including kittawere/Llama-KW-CwP-V1-3B-notseparated

Kittawere's Chat with Protein V1