Vietnamese Legal Pretrain Model - Qwen3-4B-A1.7B (Pretrained on VNLegal 1M dataset)
This model is a Vietnamese legal-domain model pretrained from Qwen3-1.7B-Pretrain-Legal-Document, adapted specifically for legal text understanding and legal question answering tasks.
Note: You must load the model with trust_remote_code=True to load the correct model format
# Load model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M", trust_remote_code=True,device_map={"": 0})
Overview
- Base model:
huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document
- Architecture: Refined from the Qwen3-1.7B model into the MoE Qwen3-4B-A1.7B model
- Total parameters: 4B
- Parameters active during inference: 1.7B
- Total expert layers: 3 (for 3 tasks)
- Number of active layers: 1
- Domain: Vietnamese legal language
- Training objective: SFT
Training Data
The model was pretrained on a VNLegal1M dataset, including:
Syllogism: 311116 samples
NLI: 279861 samples
Multichoice: 362966 samples
Total: 953943 samples
VNLegal1M data is generated by LLM (model Qwen3-235B-A22B) based on 96000 Vietnamese legal documents.
Training Configuration
For syllogism task I train all linear with expert 0, for nli and multi choice task I only train with ffn modules in expert 1 and 2
Model & Tokenization
- Base model:
huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document
- Training setup: 1 × B200 GPU
- Maximum sequence length:
1024
- Epochs:
4
- Batch size:
16
- Gradient accumulation steps:
2
- Mixed precision:
bf16
- LR warmup ratio:
0.1
- LR scheduler:
cosine
- Seed:
1308
- Time Training Total:
48 hours
Performance
Accuracy on Public Test (VLSP 2025)
Multichoice: 88.36%
NLI: 86%
Example Usage
You need to pass tokenizer=tokenizer when generating for router expert to work properly and load the correct specific system promt for each task as shown in the example below.
ds = load_dataset("VLSP2025-LegalSML/Public-Test", "syllogism/nli/multichoice_questions")
val_raw=ds['train']
- For Syllogism task:
val_raw format:
{
"question": "string" // Detailed legal scenario with background
}
Example code:
import torch
import json
index=0
question = json.dumps({"question": val_raw[index]["question"]}, ensure_ascii=False)
system_prompt = {
"role": "system",
"content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi suy luận tam đoạn luận ngắn gọn, súc tích và hợp logic. Xuất kết quả đúng định dạng JSON: {"answer": "Tiền đề lớn: ... . Tiền đề nhỏ: ... . Kết luận: ... ."}"""
}
messages = [
{"role": "user", "content": question}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
- For NLI task:
val_raw format:
{
"legal_document": "string", // Source legal text or excerpt
"specific_question": "string", // Concrete legal question derived from the text
"question": "string", // Inference question (yes/no or entailment-based)
"choices": ["Có", "Không"] // Possible responses
}
Example code
index=0
question_json = {
"legal_document": val_raw[index]["legal_document"],
"specific_question": val_raw[index]["specific_question"],
"question": val_raw[index]["question"],
"choices": val_raw[index]["choices"]
}
question_str = json.dumps(question_json, ensure_ascii=False)
system_prompt = {
"role": "system",
"content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi văn bản pháp luật(legal_document) được cung cấp có giúp trả lời được câu hỏi(specific_question) mà liên quan đến văn bản đó không. Xuất kết quả đúng định dạng JSON: {"answer": 0 or 1}"""
}
messages = [
{"role": "user", "content": question_str}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
- For Multichoice task:
val_raw format:
{
"question": "string", // The legal question text
"choices": [
"Option 1: Full answer text.",
"Option 2: Full answer text.",
"Option 3: Full answer text.",
"Option 4: Full answer text."
] // List of possible answer choices
}
Example code
index=0
question_json = {
"question": val_raw[index]["question"],
"choices": val_raw[index]["choices"]
}
question_str = json.dumps(question_json, ensure_ascii=False)
system_prompt = {
"role": "system",
"content": """Bạn là trợ lý pháp lý Việt Nam. Trả lời câu hỏi trắc nghiệm pháp lí Việt Nam. Xuất kết quả đúng định dạng JSON: {"answer": 0|1|2|3}"""}
messages = [
{"role": "user", "content": question_str}
]
prompt = tokenizer.apply_chat_template([system_prompt]+messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1024, tokenizer=tokenizer, do_sample=False)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
Maintainers
This model is developed and maintained by huyhoangvbck For inquiries, please contact: [email protected]
- Downloads last month
- 16
Model tree for huyhoangvbck/Qwen3-4B-A1.7B-MoeLegalVN1M
Base model
Qwen/Qwen3-1.7B-Base
Finetuned
Qwen/Qwen3-1.7B