LLaDA-8B-BioGRID-BioPAX
This repository contains a specialized LoRA adapter for GSAI-ML/LLaDA-8B-Instruct, fine-tuned by Proximile LLC for protein interaction network prediction using the BioPAX format. This adapter combines LLaDA's diffusion-based generation with comprehensive biological knowledge from BioGRID, UniProt, and AlphaFold databases.
𧬠Model Description
LLaDA-8B-BioGRID-BioPAX is a LoRA (Low-Rank Adaptation) adapter that specializes the base LLaDA model for predicting and completing protein interaction networks. The adapter enables the model to understand both sequence-level and structural characteristics of proteins while maintaining LLaDA's iterative denoising process to generate biologically plausible protein networks in compressed BioPAX format.
Key Capabilities
- Sequence-Aware Network Prediction: Generate complete interaction networks from protein lists with sequence/structure context
- Structure-Guided Network Completion: Complete partial networks using structural compatibility information
- New Protein Integration: Predict interactions for novel proteins based on sequence similarity and structural features
- Multi-Modal Biological Reasoning: Combine interaction patterns with sequence and structural data
- BioPAX Format Generation: Output structured biological pathway data in compressed BioPAX XML
π Quick Start
Installation
pip install transformers peft torch bitsandbytes
Basic Usage
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_name)
# Example: Predict protein network
messages = [
{
"role": "system",
"content": "You are a protein interaction prediction system. Given a list of proteins with their sequence and structural information, predict all likely interactions between them in compressed BioPAX format."
},
{
"role": "user",
"content": """Predict the protein interaction network for these proteins:
PROTEIN: TP53
UniProt ID: P04637
Full Name: Tumor protein p53
Organism: Homo sapiens
Sequence Length: 393 amino acids
AlphaFold Structure: Available
Function: Tumor suppressor that prevents cancer formation
PROTEIN: MDM2
UniProt ID: Q00987
Full Name: E3 ubiquitin-protein ligase Mdm2
Organism: Homo sapiens
Sequence Length: 491 amino acids
AlphaFold Structure: Available
Function: Regulates p53 tumor suppressor"""
}
]
# Generate network prediction using LLaDA's diffusion process
# (Implementation of generate() function needed - see full example below)
π¬ Training Details
Base Model
- Architecture: LLaDA (Large Language Diffusion with mAsking)
- Base Model: GSAI-ML/LLaDA-8B-Instruct
- Parameters: 8.02B (base model)
- Adapter Type: LoRA (Low-Rank Adaptation)
LoRA Configuration
- Method: Supervised Fine-Tuning (SFT) with LoRA
- LoRA Settings:
- Rank (r): 256 (16 Γ 16 multiplier)
- Alpha: 512 (256 Γ 2 alpha/r ratio)
- Target Modules:
q_proj
,k_proj
,v_proj
,o_proj
,gate_proj
,up_proj
,down_proj
- Training Data: BioGRID-Conv dataset with 5,000+ protein neighborhoods
- Context Length: Up to 1,024 tokens (context) + 512 tokens (generation)
Data Sources
- BioGRID 4.4.246: 2.8M+ protein/genetic interactions from 86K+ publications
- UniProt: Protein sequences, functional annotations, organism data
- AlphaFold: AI-predicted protein structures, confidence scores
π» Complete Generation Example
import torch
import json
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
# Constants for LLaDA generation
MASK_TOKEN_ID = 126336
def add_gumbel_noise(logits, temperature):
"""Add Gumbel noise for categorical sampling in diffusion models."""
if temperature <= 0:
return logits
logits = logits.to(torch.float64)
noise = torch.rand_like(logits, dtype=torch.float64)
gumbel_noise = (- torch.log(noise)) ** temperature
return logits.exp() / gumbel_noise
def get_num_transfer_tokens(mask_index, steps):
"""Compute tokens to transition at each denoising step."""
mask_num = mask_index.sum(dim=1, keepdim=True)
if steps == 0:
steps = 1
base = mask_num // steps
remainder = mask_num % steps
num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
for i in range(mask_num.size(0)):
if remainder[i] > 0:
num_transfer_tokens[i, :remainder[i]] += 1
return num_transfer_tokens
def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0.,
remasking='low_confidence', mask_id=MASK_TOKEN_ID):
"""Generate text using LLaDA's diffusion-based process."""
device = next(model.parameters()).device
prompt = prompt.to(device)
x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device)
x[:, :prompt.shape[1]] = prompt.clone()
prompt_index = (x != mask_id)
assert gen_length % block_length == 0
num_blocks = gen_length // block_length
assert steps % num_blocks == 0
steps_per_block = steps // num_blocks
for num_block in range(num_blocks):
block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block)
for i in range(steps_per_block):
mask_index = (x == mask_id)
if not mask_index.any():
break
outputs = model(x)
logits = outputs.logits
logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
x0 = torch.argmax(logits_with_noise, dim=-1)
if remasking == 'low_confidence':
p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1)
x0_p = torch.squeeze(
torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)
elif remasking == 'random':
x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
else:
raise NotImplementedError(remasking)
x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf')
x0 = torch.where(mask_index, x0, x)
confidence = torch.where(mask_index, x0_p, -float('inf'))
transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
for j in range(confidence.shape[0]):
_, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
transfer_index[j, select_index] = True
x[transfer_index] = x0[transfer_index]
return x
def predict_protein_network(model, tokenizer, messages, temperature=0.1, gen_length=512, steps=128):
"""Generate protein network prediction."""
formatted_input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"]
with torch.no_grad():
output_ids = generate(
model,
input_ids,
steps=steps,
gen_length=gen_length,
block_length=32,
temperature=temperature,
remasking='low_confidence'
)
generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<|")[0]
return generated_text
# Load model
base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base_model, adapter_name)
# Example prediction
messages = [
{
"role": "user",
"content": """Predict the protein interaction network for these proteins in compressed BioPAX format:
PROTEIN: TP53
UniProt ID: P04637
Full Name: Tumor protein p53
Organism: Homo sapiens
Sequence Length: 393 amino acids
AlphaFold Structure: Available
PROTEIN: MDM2
UniProt ID: Q00987
Full Name: E3 ubiquitin-protein ligase Mdm2
Organism: Homo sapiens
Sequence Length: 491 amino acids
AlphaFold Structure: Available"""
}
]
result = predict_protein_network(model, tokenizer, messages)
print("Predicted Network:")
print(result)
π BioPAX Output Format
The model generates protein networks in compressed BioPAX format:
<biopax>
<proteins>
<p id="tp53" name="TP53" uniprot="P04637" fullname="Tumor protein p53"/>
<p id="mdm2" name="MDM2" uniprot="Q00987" fullname="E3 ubiquitin-protein ligase Mdm2"/>
</proteins>
<interactions>
<i id="1" a="tp53" b="mdm2" type="Affinity Capture-Western"/>
<i id="2" a="tp53" b="mdm2" type="Biochemical Activity"/>
</interactions>
</biopax>
π§ͺ Supported Task Types
- Complete Network Prediction: Generate full interaction networks from protein lists
- New Protein Integration: Predict interactions for new proteins in existing networks
- Partial Network Completion: Fill in missing interactions in incomplete networks
- Property-Constrained Generation: Generate networks meeting specific biological constraints
β οΈ Limitations
- Diffusion-Based Generation: LLaDA's iterative denoising may behave differently than standard autoregressive models
- BioPAX Format Specificity: Output must precisely match the compressed BioPAX XML schema
- Biological Accuracy: Predictions are based on training data patterns and may not reflect all biological realities
- Computational Requirements: Diffusion generation requires more compute than standard inference
π Citation
If you use this model in your research, please cite:
@misc{llada-8b-biogrid-biopax,
author = {Proximile LLC},
title = {LLaDA-8B-BioGRID-BioPAX: LoRA Adapter for Diffusion-Based Protein Interaction Network Prediction},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-BioGRID-BioPAX}}
}
Also cite the original LLaDA paper and BioGRID database.
π’ About Proximile LLC
Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in:
- On-premise AI inference solutions that ensure unparalleled privacy
- Cost-effective hardware configurations including specialized bioinformatics workstations
- Secure Local AI applications for life sciences, including protein analysis and drug discovery tools
- Specialized services for compliance & governance in regulated industries
Visit proximile.llc to learn more about our secure, local AI solutions for your business.
π Model Updates
- June 16, 2025 β Initial LoRA adapter release with BioGRID 4.4.246 training data
- Enhanced with UniProt and AlphaFold integration for comprehensive protein context
π License
This LoRA adapter is released under the same license as the base LLaDA model.