LLaDA-8B-BioGRID-BioPAX

This repository contains a specialized LoRA adapter for GSAI-ML/LLaDA-8B-Instruct, fine-tuned by Proximile LLC for protein interaction network prediction using the BioPAX format. This adapter combines LLaDA's diffusion-based generation with comprehensive biological knowledge from BioGRID, UniProt, and AlphaFold databases.

🧬 Model Description

LLaDA-8B-BioGRID-BioPAX is a LoRA (Low-Rank Adaptation) adapter that specializes the base LLaDA model for predicting and completing protein interaction networks. The adapter enables the model to understand both sequence-level and structural characteristics of proteins while maintaining LLaDA's iterative denoising process to generate biologically plausible protein networks in compressed BioPAX format.

Key Capabilities

Sequence-Aware Network Prediction: Generate complete interaction networks from protein lists with sequence/structure context
Structure-Guided Network Completion: Complete partial networks using structural compatibility information
New Protein Integration: Predict interactions for novel proteins based on sequence similarity and structural features
Multi-Modal Biological Reasoning: Combine interaction patterns with sequence and structural data
BioPAX Format Generation: Output structured biological pathway data in compressed BioPAX XML

🚀 Quick Start

Installation

pip install transformers peft torch bitsandbytes

Basic Usage

from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_name)

# Example: Predict protein network
messages = [
    {
        "role": "system",
        "content": "You are a protein interaction prediction system. Given a list of proteins with their sequence and structural information, predict all likely interactions between them in compressed BioPAX format."
    },
    {
        "role": "user",
        "content": """Predict the protein interaction network for these proteins:

PROTEIN: TP53
  UniProt ID: P04637
  Full Name: Tumor protein p53
  Organism: Homo sapiens
  Sequence Length: 393 amino acids
  AlphaFold Structure: Available
  Function: Tumor suppressor that prevents cancer formation

PROTEIN: MDM2  
  UniProt ID: Q00987
  Full Name: E3 ubiquitin-protein ligase Mdm2
  Organism: Homo sapiens
  Sequence Length: 491 amino acids
  AlphaFold Structure: Available
  Function: Regulates p53 tumor suppressor"""
    }
]

# Generate network prediction using LLaDA's diffusion process
# (Implementation of generate() function needed - see full example below)

🔬 Training Details

Base Model

Architecture: LLaDA (Large Language Diffusion with mAsking)
Base Model: GSAI-ML/LLaDA-8B-Instruct
Parameters: 8.02B (base model)
Adapter Type: LoRA (Low-Rank Adaptation)

LoRA Configuration

Method: Supervised Fine-Tuning (SFT) with LoRA
LoRA Settings:
- Rank (r): 256 (16 × 16 multiplier)
- Alpha: 512 (256 × 2 alpha/r ratio)
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Data: BioGRID-Conv dataset with 5,000+ protein neighborhoods
Context Length: Up to 1,024 tokens (context) + 512 tokens (generation)

Data Sources

BioGRID 4.4.246: 2.8M+ protein/genetic interactions from 86K+ publications
UniProt: Protein sequences, functional annotations, organism data
AlphaFold: AI-predicted protein structures, confidence scores

💻 Complete Generation Example

import torch
import json
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel

# Constants for LLaDA generation
MASK_TOKEN_ID = 126336

def add_gumbel_noise(logits, temperature):
    """Add Gumbel noise for categorical sampling in diffusion models."""
    if temperature <= 0:
        return logits
        
    logits = logits.to(torch.float64)
    noise = torch.rand_like(logits, dtype=torch.float64)
    gumbel_noise = (- torch.log(noise)) ** temperature
    return logits.exp() / gumbel_noise

def get_num_transfer_tokens(mask_index, steps):
    """Compute tokens to transition at each denoising step."""
    mask_num = mask_index.sum(dim=1, keepdim=True)
    
    if steps == 0:
        steps = 1
        
    base = mask_num // steps
    remainder = mask_num % steps
    
    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
    
    for i in range(mask_num.size(0)):
        if remainder[i] > 0:
            num_transfer_tokens[i, :remainder[i]] += 1
            
    return num_transfer_tokens

def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0.,
             remasking='low_confidence', mask_id=MASK_TOKEN_ID):
    """Generate text using LLaDA's diffusion-based process."""
    device = next(model.parameters()).device
    prompt = prompt.to(device)
    
    x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device)
    x[:, :prompt.shape[1]] = prompt.clone()
    
    prompt_index = (x != mask_id)
    
    assert gen_length % block_length == 0
    num_blocks = gen_length // block_length
    
    assert steps % num_blocks == 0
    steps_per_block = steps // num_blocks
    
    for num_block in range(num_blocks):
        block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
        num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block)
        
        for i in range(steps_per_block):
            mask_index = (x == mask_id)
            if not mask_index.any():
                break
                
            outputs = model(x)
            logits = outputs.logits
            
            logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
            x0 = torch.argmax(logits_with_noise, dim=-1)
            
            if remasking == 'low_confidence':
                p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1)
                x0_p = torch.squeeze(
                    torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)
            elif remasking == 'random':
                x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
            else:
                raise NotImplementedError(remasking)
            
            x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf')
            
            x0 = torch.where(mask_index, x0, x)
            confidence = torch.where(mask_index, x0_p, -float('inf'))
            
            transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
            for j in range(confidence.shape[0]):
                _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
                transfer_index[j, select_index] = True
            x[transfer_index] = x0[transfer_index]
    
    return x

def predict_protein_network(model, tokenizer, messages, temperature=0.1, gen_length=512, steps=128):
    """Generate protein network prediction."""
    formatted_input = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"]
    
    with torch.no_grad():
        output_ids = generate(
            model, 
            input_ids, 
            steps=steps,
            gen_length=gen_length,
            block_length=32,
            temperature=temperature,
            remasking='low_confidence'
        )
    
    generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<|")[0]
    return generated_text

# Load model
base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base_model, adapter_name)

# Example prediction
messages = [
    {
        "role": "user",
        "content": """Predict the protein interaction network for these proteins in compressed BioPAX format:

PROTEIN: TP53
  UniProt ID: P04637
  Full Name: Tumor protein p53
  Organism: Homo sapiens
  Sequence Length: 393 amino acids
  AlphaFold Structure: Available

PROTEIN: MDM2
  UniProt ID: Q00987  
  Full Name: E3 ubiquitin-protein ligase Mdm2
  Organism: Homo sapiens
  Sequence Length: 491 amino acids
  AlphaFold Structure: Available"""
    }
]

result = predict_protein_network(model, tokenizer, messages)
print("Predicted Network:")
print(result)

📊 BioPAX Output Format

The model generates protein networks in compressed BioPAX format:

<biopax>
  <proteins>
    <p id="tp53" name="TP53" uniprot="P04637" fullname="Tumor protein p53"/>
    <p id="mdm2" name="MDM2" uniprot="Q00987" fullname="E3 ubiquitin-protein ligase Mdm2"/>
  </proteins>
  <interactions>
    <i id="1" a="tp53" b="mdm2" type="Affinity Capture-Western"/>
    <i id="2" a="tp53" b="mdm2" type="Biochemical Activity"/>
  </interactions>
</biopax>

🧪 Supported Task Types

Complete Network Prediction: Generate full interaction networks from protein lists
New Protein Integration: Predict interactions for new proteins in existing networks
Partial Network Completion: Fill in missing interactions in incomplete networks
Property-Constrained Generation: Generate networks meeting specific biological constraints

⚠️ Limitations

Diffusion-Based Generation: LLaDA's iterative denoising may behave differently than standard autoregressive models
BioPAX Format Specificity: Output must precisely match the compressed BioPAX XML schema
Biological Accuracy: Predictions are based on training data patterns and may not reflect all biological realities
Computational Requirements: Diffusion generation requires more compute than standard inference

📚 Citation

If you use this model in your research, please cite:

@misc{llada-8b-biogrid-biopax,
  author = {Proximile LLC},
  title = {LLaDA-8B-BioGRID-BioPAX: LoRA Adapter for Diffusion-Based Protein Interaction Network Prediction},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-BioGRID-BioPAX}}
}

Also cite the original LLaDA paper and BioGRID database.

🏢 About Proximile LLC

Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in:

On-premise AI inference solutions that ensure unparalleled privacy
Cost-effective hardware configurations including specialized bioinformatics workstations
Secure Local AI applications for life sciences, including protein analysis and drug discovery tools
Specialized services for compliance & governance in regulated industries

Visit proximile.llc to learn more about our secure, local AI solutions for your business.

🔄 Model Updates

June 16, 2025 – Initial LoRA adapter release with BioGRID 4.4.246 training data
Enhanced with UniProt and AlphaFold integration for comprehensive protein context

📄 License

This LoRA adapter is released under the same license as the base LLaDA model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support