Kittawere's Chat with Protein version 1 (no-separation) - 3B param model (LoRA)

👋 Hi there! This is the second release in my project to chat with proteins

In this new version, amino acids are no longer space-separated. You simply input the raw protein sequence as a continuous string of letters.

What Does This Model Do?

✅ Predicts the phylogeny of a eukaryotic protein, classifying it as:

  • Plant (Viridiplantae)
  • Animal (Metazoa)
  • Fungus (Fungi)

Why Remove Amino Acid Separation?

My first model V1-separated used space-separated amino acids (e.g. <seq> T P P A G P D V G P R <seq>) to force the tokenizer to treat each residue as a separate token.

However, further testing showed:

  • No significant accuracy difference. The new model achieves ~80.57% accuracy versus ~79.2% for the separated version — not statistically significant on the held-out test set.
  • Non-separated sequences save tokens and avoid context-length issues for longer proteins.

So, my original reasoning for separation was not necessary for classification. However, I still plan to explore separated inputs in future work on protein generation, where residue-level control might be beneficial.

Training Data

  • Dataset: Entire Swiss-Prot database
  • Data processing:
    • Balanced samples of animals, plants, fungi
    • 80% training / 20% testing

Performance

  • Accuracy: ~80.57% on held-out test set
  • Baseline (random guess): 33%

This demonstrates that LLMs can directly work with protein sequences in a natural language context.

Input Format

Now you can simply paste the raw amino acid sequence into your prompt:

Example input:

<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?

Limitations

  • Phylogenetic predictions remain approximate; proteins may be shared across kingdoms.
  • Model context limits very long sequences.
  • Model is trained only on eukaryotic sequences (plants, animals, fungi).

License

Apache 2.0 (for the LoRA), refer to Meta’s license for the base weights.

Inference

Example usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kittawere/Llama-KW-CwP-V1-3B-notseparated"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"                # Automatically puts model on GPU(s)/CPU(s)
)

prompt = f"<seq> TPPAGPDVGPR <seq> What is the taxonomic classification of the protein?"

def inference(model, tokenizer, prompt):
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
    )

    # Get only the newly generated tokens
    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return answer

print(inference(model, tokenizer, prompt))

Want to help me?, reach me out or join my discord

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kittawere/Llama-KW-CwP-V1-3B-notseparated

Adapter
(289)
this model
Quantizations
1 model

Space using kittawere/Llama-KW-CwP-V1-3B-notseparated 1

Collection including kittawere/Llama-KW-CwP-V1-3B-notseparated