Malaysian-Normalizer-Qwen3-8B

Finetune Qwen/Qwen3-8B on mesolitica/Malaysian-Normalizer

Prompt

given the text
text: {text}

normalize to {language} language
  • text is the text you want to normalize.
  • language is language you want to normalize, you can omit normalize to {language} language this make the model normalize based on the text language.

Output

It will return a JSON,

{"normalized_text": "All suspects, aged twenty five to thirty seven, were remanded for seven days beginning today to facilitate investigations under Sections twelve open parenthesis two close parenthesis, thirty nine A open parenthesis one close parenthesis and thirty nine A open parenthesis two close parenthesis of the Dangerous Drugs Act one thousand nine hundred fifty two. dash Bernama", "normalizer_mapping": {"25": "twenty five", "37": "thirty seven", "12(2)": "twelve open parenthesis two close parenthesis", "39A(1)": "thirty nine A open parenthesis one close parenthesis", "39A(2)": "thirty nine A open parenthesis two close parenthesis", "1952": "one thousand nine hundred fifty two", "\u2014": "dash"}}

Example

from transformers import TextStreamer, AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model = AutoModelForCausalLM.from_pretrained(
    'malaysia-ai/Malaysian-Normalizer-Qwen3-8B',
    torch_dtype='auto'
).cuda()
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/Malaysian-Normalizer-Qwen3-8B')

user = """
given the text
text: “Oleochemical exports dropped 2.72 per cent m-o-m to 210,924 tonnes from 216,816 tonnes while biodiesel exports fell 48.89 per cent m-o-m to 23,689 tonnes from 46,345 tonnes,” it said.

normalize to english language
"""
message = [
    {'role': 'user', 'content': user.strip()}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt = True, tokenize = False)
generate_kwargs = dict(
    **tokenizer(prompt, return_tensors = 'pt').to('cuda'),
    max_new_tokens=1024,
    top_p=0.9,
    top_k=50,
    temperature=0.9,
    do_sample=True,
    repetition_penalty=1.0,
)
generation_output = model.generate(**generate_kwargs)

Output,

<|im_start|>user
given the text
text: “Oleochemical exports dropped 2.72 per cent m-o-m to 210,924 tonnes from 216,816 tonnes while biodiesel exports fell 48.89 per cent m-o-m to 23,689 tonnes from 46,345 tonnes,” it said.

normalize to english language<|im_end|>
<|im_start|>assistant
<think>

</think>

{"normalized_text": "open quote Oleochemical exports dropped two point seven two per cent m dash o dash m to two hundred ten thousand nine hundred twenty four tonnes from two hundred sixteen thousand eight hundred sixteen tonnes while biodiesel exports fell forty eight point eight nine per cent m dash o dash m to twenty three thousand six hundred eighty nine tonnes from forty six thousand three hundred forty five tonnes, close quote it said.", "normalizer_mapping": {"\u201c": "open quote", "2.72": "two point seven two", "m-o-m": "m dash o dash m", "210,924": "two hundred ten thousand nine hundred twenty four", "216,816": "two hundred sixteen thousand eight hundred sixteen", "48.89": "forty eight point eight nine", "23,689": "twenty three thousand six hundred eighty nine", "46,345": "forty six thousand three hundred forty five", "\u201d": "close quote"}}<|im_end|>

Revision

current stage, 7e4483ac0c66fef90556113d8b32665c80786b5f

  1. This revision trained on mesolitica/Malaysian-SFT/malaysian_normalizer and mesolitica/Malaysian-SFT/malaysian_normalizer_pseudolabel.
  2. This revision trained on proper train set.

older stage, 7b502263c605355fbc93a1b76f6712461812f863

  1. This revision trained initially on mesolitica/Malaysian-SFT/malaysian_normalizer.
  2. This revision pseudolabelled more dataset and released it at mesolitica/Malaysian-Normalizer#pseudolabel
  3. This revision trained on leaked test set.

Acknowledgement

Special thanks to Lambda Research Grant program for Lambda cloud credit!

Downloads last month
37
Safetensors
Model size
8.19B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for malaysia-ai/Malaysian-Normalizer-Qwen3-8B

Quantizations
1 model