Vietnamese Legal Pretrain Model - Qwen3-4B-A1.7B (Pretrained on VNLegal 1M dataset)

This model is a Vietnamese legal-domain model pretrained from Qwen3-1.7B-Pretrain-Legal-Document, adapted specifically for legal text understanding and legal question answering tasks.

Note: You must load the model with trust_remote_code=True to load the correct model format

# Load model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M", trust_remote_code=True,device_map={"": 0})

Overview

  • Base model: huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document
  • Architecture: Refined from the Qwen3-1.7B model into the MoE Qwen3-4B-A1.7B model
  • Total parameters: 4B
  • Parameters active during inference: 1.7B
  • Total expert layers: 3 (for 3 tasks)
  • Number of active layers: 1
  • Domain: Vietnamese legal language
  • Training objective: SFT

Training Data

The model was pretrained on a VNLegal1M dataset, including:

  • Syllogism: 311116 samples

  • NLI: 279861 samples

  • Multichoice: 362966 samples

  • Total: 953943 samples

VNLegal1M data is generated by LLM (model Qwen3-235B-A22B) based on 96000 Vietnamese legal documents.

Training Configuration

For syllogism task I train all linear with expert 0, for nli and multi choice task I only train with ffn modules in expert 1 and 2

Model & Tokenization

  • Base model: huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document
  • Training setup: 1 × B200 GPU
  • Maximum sequence length: 1024
  • Epochs: 4
  • Batch size: 16
  • Gradient accumulation steps: 2
  • Mixed precision: bf16
  • LR warmup ratio: 0.1
  • LR scheduler: cosine
  • Seed: 1308
  • Time Training Total: 48 hours

Performance

  • Accuracy on Public Test (VLSP 2025)

  • Multichoice: 88.36%

  • NLI: 86%

Example Usage

You need to pass tokenizer=tokenizer when generating for router expert to work properly and load the correct specific system promt for each task as shown in the example below.

ds = load_dataset("VLSP2025-LegalSML/Public-Test", "syllogism/nli/multichoice_questions")
val_raw=ds['train']
  • For Syllogism task:

val_raw format:

{
  "question": "string"       // Detailed legal scenario with background
}

Example code:

import torch
import json
index=0
question = json.dumps({"question": val_raw[index]["question"]}, ensure_ascii=False)
system_prompt = {
        "role": "system",
        "content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi suy luận tam đoạn luận ngắn gọn, súc tích và hợp logic. Xuất kết quả đúng định dạng JSON: {"answer": "Tiền đề lớn: ... . Tiền đề nhỏ: ... . Kết luận: ... ."}"""
    }
messages = [
    {"role": "user", "content": question}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  • For NLI task:

val_raw format:

{
  "legal_document": "string",         // Source legal text or excerpt
  "specific_question": "string",      // Concrete legal question derived from the text
  "question": "string",               // Inference question (yes/no or entailment-based)
  "choices": ["Có", "Không"]         // Possible responses
}

Example code

index=0
question_json = {
        "legal_document": val_raw[index]["legal_document"],
        "specific_question": val_raw[index]["specific_question"],
        "question": val_raw[index]["question"],
        "choices": val_raw[index]["choices"]
    }
question_str = json.dumps(question_json, ensure_ascii=False)
system_prompt = {
        "role": "system",
        "content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi văn bản pháp luật(legal_document) được cung cấp có giúp trả lời được câu hỏi(specific_question) mà liên quan đến văn bản đó không. Xuất kết quả đúng định dạng JSON: {"answer": 0 or 1}"""
    }
messages = [
    {"role": "user", "content": question_str}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  • For Multichoice task:

val_raw format:

{
  "question": "string",        // The legal question text
  "choices": [
    "Option 1: Full answer text.",
    "Option 2: Full answer text.",
    "Option 3: Full answer text.",
    "Option 4: Full answer text."
  ]  // List of possible answer choices
}

Example code

index=0
question_json = {
        "question": val_raw[index]["question"],
        "choices": val_raw[index]["choices"]
    }
question_str = json.dumps(question_json, ensure_ascii=False)
system_prompt = {
"role": "system",
"content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi trắc nghiệm pháp lí Việt Nam. Xuất kết quả đúng định dạng JSON: {"answer": 0|1|2|3}"""}
messages = [
    {"role": "user", "content": question_str}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

Maintainers

This model is developed and maintained by huyhoangvbck For inquiries, please contact: [email protected]

Downloads last month
16
Safetensors
Model size
4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(1)
this model

Dataset used to train huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M