Vietnamese Legal Pretrain Model - Qwen3-4B-A1.7B (Pretrained on VNLegal 1M dataset)

This model is a Vietnamese legal-domain model pretrained from Qwen3-1.7B-Pretrain-Legal-Document, adapted specifically for legal text understanding and legal question answering tasks.

Note: You must load the model with trust_remote_code=True to load the correct model format

# Load model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M", trust_remote_code=True,device_map={"": 0})

Overview

Base model: huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document
Architecture: Refined from the Qwen3-1.7B model into the MoE Qwen3-4B-A1.7B model
Total parameters: 4B
Parameters active during inference: 1.7B
Total expert layers: 3 (for 3 tasks)
Number of active layers: 1
Domain: Vietnamese legal language
Training objective: SFT

Training Data

The model was pretrained on a VNLegal1M dataset, including:

Syllogism: 311116 samples
NLI: 279861 samples
Multichoice: 362966 samples
Total: 953943 samples

VNLegal1M data is generated by LLM (model Qwen3-235B-A22B) based on 96000 Vietnamese legal documents.

Training Configuration

For syllogism task I train all linear with expert 0, for nli and multi choice task I only train with ffn modules in expert 1 and 2

Model & Tokenization

Base model: huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document
Training setup: 1 × B200 GPU
Maximum sequence length: 1024
Epochs: 4
Batch size: 16
Gradient accumulation steps: 2
Mixed precision: bf16
LR warmup ratio: 0.1
LR scheduler: cosine
Seed: 1308
Time Training Total: 48 hours

Performance

Accuracy on Public Test (VLSP 2025)
Multichoice: 88.36%
NLI: 86%

Example Usage

You need to pass tokenizer=tokenizer when generating for router expert to work properly and load the correct specific system promt for each task as shown in the example below.

ds = load_dataset("VLSP2025-LegalSML/Public-Test", "syllogism/nli/multichoice_questions")
val_raw=ds['train']

For Syllogism task:

val_raw format:

{
  "question": "string"       // Detailed legal scenario with background
}

Example code:

import torch
import json
index=0
question = json.dumps({"question": val_raw[index]["question"]}, ensure_ascii=False)
system_prompt = {
        "role": "system",
        "content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi suy luận tam đoạn luận ngắn gọn, súc tích và hợp logic. Xuất kết quả đúng định dạng JSON: {"answer": "Tiền đề lớn: ... . Tiền đề nhỏ: ... . Kết luận: ... ."}"""
    }
messages = [
    {"role": "user", "content": question}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

For NLI task:

val_raw format:

{
  "legal_document": "string",         // Source legal text or excerpt
  "specific_question": "string",      // Concrete legal question derived from the text
  "question": "string",               // Inference question (yes/no or entailment-based)
  "choices": ["Có", "Không"]         // Possible responses
}

Example code

index=0
question_json = {
        "legal_document": val_raw[index]["legal_document"],
        "specific_question": val_raw[index]["specific_question"],
        "question": val_raw[index]["question"],
        "choices": val_raw[index]["choices"]
    }
question_str = json.dumps(question_json, ensure_ascii=False)
system_prompt = {
        "role": "system",
        "content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi văn bản pháp luật(legal_document) được cung cấp có giúp trả lời được câu hỏi(specific_question) mà liên quan đến văn bản đó không. Xuất kết quả đúng định dạng JSON: {"answer": 0 or 1}"""
    }
messages = [
    {"role": "user", "content": question_str}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

For Multichoice task:

val_raw format:

{
  "question": "string",        // The legal question text
  "choices": [
    "Option 1: Full answer text.",
    "Option 2: Full answer text.",
    "Option 3: Full answer text.",
    "Option 4: Full answer text."
  ]  // List of possible answer choices
}

Example code

index=0
question_json = {
        "question": val_raw[index]["question"],
        "choices": val_raw[index]["choices"]
    }
question_str = json.dumps(question_json, ensure_ascii=False)
system_prompt = {
"role": "system",
"content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi trắc nghiệm pháp lí Việt Nam. Xuất kết quả đúng định dạng JSON: {"answer": 0|1|2|3}"""}
messages = [
    {"role": "user", "content": question_str}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

Maintainers

This model is developed and maintained by huyhoangvbck For inquiries, please contact: [email protected]

Downloads last month: 16

Safetensors

Model size

4B params

Tensor type

F32

Model tree for huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document

Finetuned

(1)

this model

huyhoangvbck
/

Qwen3-4B-A1.7B-MoeLegalVN1M

Vietnamese Legal Pretrain Model - Qwen3-4B-A1.7B (Pretrained on VNLegal 1M dataset)

This model is a Vietnamese legal-domain model pretrained from Qwen3-1.7B-Pretrain-Legal-Document, adapted specifically for legal text understanding and legal question answering tasks.

Note: You must load the model with trust_remote_code=True to load the correct model format

Overview

Training Data

VNLegal1M data is generated by LLM (model Qwen3-235B-A22B) based on 96000 Vietnamese legal documents.

Training Configuration

Model & Tokenization

Performance

Example Usage

You need to pass tokenizer=tokenizer when generating for router expert to work properly and load the correct specific system promt for each task as shown in the example below.

Maintainers

This model is developed and maintained by huyhoangvbck For inquiries, please contact: [email protected]

Model tree for huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M

Dataset used to train huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M