Phi-4-mini N3 Transform to Knowledge Graph Fine-tune
This model is a fine-tuned version of microsoft/Phi-4-mini-instruct optimized for transforming entity and schema information into JSON-LD format, trained as part of the WIM (Wikipedia to Knowledge Graph) pipeline.
Model Details
Model Description
- Developed by: UWV InnovatieHub
- Model type: Causal Language Model with LoRA fine-tuning
- Language(s): Dutch (nl)
- License: MIT
- Finetuned from: microsoft/Phi-4-mini-instruct (3.82B parameters)
- Training Framework: Unsloth (optimized training for extreme context lengths)
Training Details
- Dataset: UWV/wim-instruct-wiki-to-jsonld-agent-steps
- Dataset Size: 10,593 N3-specific examples (JSON-LD transformation tasks)
- Training Duration: 41 hours 54 minutes
- Hardware: NVIDIA A100 80GB
- Context Length: 131,072 tokens (128K)
- Steps: 1,000
- Training Metrics:
- Final Training Loss: 0.11
- Final Eval Loss: 0.119
- Trainable Parameters: ~178M (4.4% of model)
LoRA Configuration
{
"r": 320, # Rank (Microsoft's recommended config)
"lora_alpha": 320, # Alpha (1:1 ratio for Phi-4)
"lora_dropout": 0.0, # No dropout
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
]
}
Training Configuration
{
"model": "phi4-mini",
"max_seq_length": 131072, # 128K context
"batch_size": 1,
"gradient_accumulation_steps": 8,
"effective_batch_size": 8,
"learning_rate": 1e-5,
"warmup_steps": 20,
"max_grad_norm": 1.0,
"lr_scheduler": "linear",
"optimizer": "paged_adamw_8bit",
"bf16": True,
"gradient_checkpointing": True,
"seed": 42
}
Intended Uses & Limitations
Intended Uses
- JSON-LD Generation: Transform entity and schema information into valid JSON-LD format
- Knowledge Graph Construction: Third step (N3) in the WIM pipeline
- Structured Data Creation: Convert unstructured entity descriptions to Schema.org-compliant JSON-LD
- Long Context Processing: Handle extremely long input sequences (up to 128K tokens)
Limitations
- Requires extensive context (average input ~40K tokens)
- Memory intensive due to long sequences
- Best performance with Phi-4's specific prompt format
- May require post-processing validation (N4 step)
How to Use
Option 1: Using the Merged Model (Recommended)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import json
# Load the merged model (ready to use)
model = AutoModelForCausalLM.from_pretrained(
"UWV/wim-n3-phi4-mini-merged", # Update with actual repo
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n3-phi4-mini-merged")
# Prepare input (typically very long with entity and schema information)
entities = [
{"name": "Amsterdam", "type": "City"},
{"name": "Netherlands", "type": "Country"}
]
schemas = {
"City": "https://schema.org/City",
"Country": "https://schema.org/Country"
}
messages = [
{
"role": "system",
"content": "You are an expert in creating JSON-LD representations using Schema.org vocabulary."
},
{
"role": "user",
"content": f"""Transform the following entities into JSON-LD format using Schema.org:
Entities: {json.dumps(entities, ensure_ascii=False)}
Schemas: {json.dumps(schemas, ensure_ascii=False)}
Create a complete JSON-LD representation with proper @context and @type declarations."""
}
]
# Apply chat template and generate
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=131072)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=4096, # JSON-LD can be long
temperature=0.1, # Low temperature for valid JSON
do_sample=True,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode and parse response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
if "assistant:" in response:
json_ld = response.split("assistant:")[-1].strip()
print(json_ld)
Option 2: Using the LoRA Adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-4-mini-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Load adapter
model = PeftModel.from_pretrained(
base_model,
"UWV/wim-n3-phi4-mini-adapter" # Update with actual repo
)
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n3-phi4-mini-adapter")
# Use same inference code as above...
Expected Output Format
The model outputs valid JSON-LD with Schema.org vocabulary:
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "City",
"@id": "_:amsterdam",
"name": "Amsterdam",
"containedInPlace": {
"@id": "_:netherlands"
}
},
{
"@type": "Country",
"@id": "_:netherlands",
"name": "Netherlands"
}
]
}
Dataset Information
The model was trained on the UWV/wim-instruct-wiki-to-jsonld-agent-steps dataset, which contains:
- Source: Dutch Wikipedia articles processed through N1 and N2 steps
- Processing: Multi-agent pipeline converting text to JSON-LD
- N3 Examples: 10,593 transformation tasks
- Average Token Length: ~40,388 tokens (extremely long sequences)
- Max Token Length: 520,575 tokens
- Format: ChatML-formatted instruction-following examples
- Task: Transform entity and schema information into valid JSON-LD
Training Results
The model achieved exceptional performance with minimal overfitting:
- Final Loss: 0.11 (excellent convergence)
- Eval Loss: 0.119 (very close to training loss)
- Loss Ratio: 0.92 (indicating good generalization)
This was achieved despite the extreme context lengths and complex transformation task.
Model Versions
Merged Model:
UWV/wim-n3-phi4-mini-merged
(681MB adapter + base model)- Ready to use without adapter loading
- Recommended for production inference
LoRA Adapter:
UWV/wim-n3-phi4-mini-adapter
(681MB)- Requires base Phi-4-mini-instruct model
- More flexible for further fine-tuning
Pipeline Context
This model is part of the WIM (Wikipedia to Knowledge Graph) pipeline:
- N1: Entity Extraction
- N2: Schema.org Type Selection
- N3 (This Model): Transform to JSON-LD
- N4: Validation
- N5: Add Human-Readable Labels
N3 is the most computationally intensive step, handling the complex transformation from structured entity information to valid JSON-LD format.
Technical Notes
- Memory Requirements: ~53GB VRAM for 128K context inference
- Optimization: Uses Unsloth's custom kernels for efficient long-context processing
- Special Configuration: Requires
TORCH_COMPILE_DISABLE=1
for Phi-4 compatibility - Context Handling: Can process full Wikipedia articles with extensive entity information
Citation
If you use this model, please cite:
@misc{wim-n3-phi4-mini,
author = {UWV InnovatieHub},
title = {Phi-4-mini N3 Transform to JSON-LD Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/UWV/wim-n3-phi4-mini-merged}
}
- Downloads last month
- 3
Model tree for UWV/wim-n3-phi4-mini-adapter
Base model
microsoft/Phi-4-mini-instruct