This model is a fine-tuned version of occiglot/occiglot-7b-eu5-instruct for generating SPARQL queries from German natural language questions, specifically targeting the Wikidata knowledge graph.

Model Details

Model Description

It was fine-tuned using QLoRA. It takes a German natural language question as input and aims to produce a corresponding SPARQL query that can be executed against the Wikidata knowledge graph. It is part of a series of experiments to investigate the impact of continual multilingual pre-training on cross-lingual transferability and task-specific performance. Uses 4-bit quantization.

  • Developed by: Julio Cesar Perez Duran
  • Funded by : DFKI
  • Model type: Decoder-only Transformer-based language model
  • Language(s) (NLP): de (German)
  • License: mit
  • Finetuned from model [optional]: occiglot/occiglot-7b-eu5-instruct

Bias, Risks, and Limitations

  • Entity/Relationship Linking Bottleneck: A primary limitation of this model (and v1 models generally) is a significant deficiency in accurately mapping textual entities and relationships in German to their correct Wikidata identifiers (QIDs and PIDs) without explicit contextual aid. While the model might generate structurally valid SPARQL, the entities or properties could be incorrect. This significantly impacted recall.

How to Get Started with the Model

The following Python script provides an example of how to load the model and tokenizer using the Hugging Face Transformers and PEFT libraries to generate a SPARQL query.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import re

# Model ID for the Occiglot German v1.1 fine-tuned model
model_id = "julioc-p/julioc-p/occiglot-7b-eu5-instruct-txt-de-sparql_4bit"
base_model_for_tokenizer = "occiglot/occiglot-7b-eu5-instruct"

# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, # Or float16 if bfloat16 not available
    bnb_4bit_use_double_quant=False,
)

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_for_tokenizer)
tokenizer.pad_token = tokenizer.eos_token

def extract_sparql(text):
    match = re.search(
        r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text, re.DOTALL | re.IGNORECASE
    )
    if match:
        return match.group(0).strip()
    return ""

# --- Example usage ---
question = "Was ist der Siedepunkt von Wasser?" # German example question
# knowledge_graph_target = "Wikidata" # This model is fine-tuned for Wikidata

system_prompt_content = "Sie sind ein Experte für die Generierung von SPARQL-Anfragen. Generieren Sie die SPARQL-Anfrage, die die Frage des Benutzers beantwortet." # German system prompt

chat_template = [
    {"role": "system", "content": system_prompt_content},
    {"role": "user", "content": question},
]

inputs = tokenizer.apply_chat_template(
    chat_template,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate the output
with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=256, pad_token_id=tokenizer.eos_token_id)

generated_text_assistant_part = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
cleaned_sparql = extract_sparql(generated_text_assistant_part)

print(f"Frage: {question}")
print(f"Generierte SPARQL: {cleaned_sparql}")
print(f"Textausgabe (Assistent): {generated_text_assistant_part}")

Training Data

The model was fine-tuned on a subset of the julioc-p/Question-Sparql dataset. Specifically, a 35,000-sample German subset (translated from English and filtered for Wikidata-related queries) was used.

Training Hyperparameters

The following hyperparameters were used for the v1.1 Occiglot German fine-tuning:

  • LoRA Configuration (for Occiglot v1.1):
    • r (LoRA rank): 64
    • lora_alpha: 16
    • lora_dropout: 0.1
    • bias: "none"
    • task_type: "CAUSAL_LM"
    • target_modules: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"
  • Training Arguments:
    • num_train_epochs: 5
    • per_device_train_batch_size: 1
    • gradient_accumulation_steps: 8 (Effective batch size of 8)
    • gradient_checkpointing: True
    • optim: "paged_adamw_32bit"
    • learning_rate: 1e-5
    • weight_decay: 0.05
    • bf16: False
    • fp16: True
    • max_grad_norm: 1.0
    • warmup_ratio: 0.01
    • lr_scheduler_type: "cosine"
    • group_by_length: True
    • packing: False
  • BitsAndBytesConfig:
    • load_in_4bit: True
    • bnb_4bit_quant_type: "nf4"
    • bnb_4bit_compute_dtype: torch.float16
    • bnb_4bit_use_double_quant: False

Speeds, Sizes, Times [optional]

  • The training took approximately 19-20 hours for 5 epochs on a single NVIDIA V100 GPU.

Evaluation

Testing Data, Factors & Metrics

Testing Data

  1. QALD-10 test set (German)
  2. v1 Test Set (German): 3,500 German held-out examples randomly sampled from the julioc-p/Question-Sparql dataset (Wikidata-focused).

Metrics

The primary evaluation metrics used were the QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries resulted in P, R, F1 = 0. The percentage of Executable Queries was also tracked. Correctness was further broken down into "Correct (Exact Match)" and "Correct (Both Empty)".

Results

On QALD-10 (German, N=391):

  • Macro F1-Score: 0.0691
  • Macro Precision: 0.6957
  • Macro Recall: 0.0691
  • Executable Queries: 97.95% (383/391)
  • Correctness (Exact Match + Both Empty): 6.91% (27/391)
    • Correct (Exact Match): 5.37% (21/391)
    • Correct (Both Empty): 1.53% (6/391)

On v1 Test Set (German, N=3500):

  • Macro F1-Score: 0.3021
  • Macro Precision: 0.8827
  • Macro Recall: 0.3065
  • Executable Queries: 98.60% (3451/3500)
  • Correctness (Exact Match + Both Empty): 29.57% (1035/3500)
    • Correct (Exact Match): 22.91% (802/3500)
    • Correct (Both Empty): 6.66% (233/3500)

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 1 x NVIDIA V100 32GB GPU
  • Hours used: Approx. 19-20 hours for fine-tuning.
  • Cloud Provider: DFKI HPC Cluster
  • Compute Region: Germany
  • Carbon Emitted: Approx. 2.96 kg CO2eq.

Technical Specifications

Compute Infrastructure

Hardware

  • NVIDIA V100 GPU (32 GB RAM)
  • Approx. 60 GB system RAM

Software

  • Slurm, NVIDIA Enroot, CUDA 11.8.0
  • Python, Hugging Face transformers, peft (0.13.2), bitsandbytes, trl, PyTorch.

More Information

Framework versions

  • PEFT 0.13.2
  • Transformers (4.39.3)
  • BitsAndBytes (0.43.0)
  • trl (0.8.6)
  • PyTorch (torch==2.1.0)
Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for julioc-p/occiglot-7b-eu5-instruct-txt-de-sparql_4bit

Adapter
(3)
this model

Dataset used to train julioc-p/occiglot-7b-eu5-instruct-txt-de-sparql_4bit