Malaysian-Normalizer-Qwen3-8B
Finetune Qwen/Qwen3-8B on mesolitica/Malaysian-Normalizer
Prompt
given the text
text: {text}
normalize to {language} language
text
is the text you want to normalize.language
is language you want to normalize, you can omitnormalize to {language} language
this make the model normalize based on the text language.
Output
It will return a JSON,
{"normalized_text": "All suspects, aged twenty five to thirty seven, were remanded for seven days beginning today to facilitate investigations under Sections twelve open parenthesis two close parenthesis, thirty nine A open parenthesis one close parenthesis and thirty nine A open parenthesis two close parenthesis of the Dangerous Drugs Act one thousand nine hundred fifty two. dash Bernama", "normalizer_mapping": {"25": "twenty five", "37": "thirty seven", "12(2)": "twelve open parenthesis two close parenthesis", "39A(1)": "thirty nine A open parenthesis one close parenthesis", "39A(2)": "thirty nine A open parenthesis two close parenthesis", "1952": "one thousand nine hundred fifty two", "\u2014": "dash"}}
Example
from transformers import TextStreamer, AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
model = AutoModelForCausalLM.from_pretrained(
'malaysia-ai/Malaysian-Normalizer-Qwen3-8B',
torch_dtype='auto'
).cuda()
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/Malaysian-Normalizer-Qwen3-8B')
user = """
given the text
text: “Oleochemical exports dropped 2.72 per cent m-o-m to 210,924 tonnes from 216,816 tonnes while biodiesel exports fell 48.89 per cent m-o-m to 23,689 tonnes from 46,345 tonnes,” it said.
normalize to english language
"""
message = [
{'role': 'user', 'content': user.strip()}
]
prompt = tokenizer.apply_chat_template(message, add_generation_prompt = True, tokenize = False)
generate_kwargs = dict(
**tokenizer(prompt, return_tensors = 'pt').to('cuda'),
max_new_tokens=1024,
top_p=0.9,
top_k=50,
temperature=0.9,
do_sample=True,
repetition_penalty=1.0,
)
generation_output = model.generate(**generate_kwargs)
Output,
<|im_start|>user
given the text
text: “Oleochemical exports dropped 2.72 per cent m-o-m to 210,924 tonnes from 216,816 tonnes while biodiesel exports fell 48.89 per cent m-o-m to 23,689 tonnes from 46,345 tonnes,” it said.
normalize to english language<|im_end|>
<|im_start|>assistant
<think>
</think>
{"normalized_text": "open quote Oleochemical exports dropped two point seven two per cent m dash o dash m to two hundred ten thousand nine hundred twenty four tonnes from two hundred sixteen thousand eight hundred sixteen tonnes while biodiesel exports fell forty eight point eight nine per cent m dash o dash m to twenty three thousand six hundred eighty nine tonnes from forty six thousand three hundred forty five tonnes, close quote it said.", "normalizer_mapping": {"\u201c": "open quote", "2.72": "two point seven two", "m-o-m": "m dash o dash m", "210,924": "two hundred ten thousand nine hundred twenty four", "216,816": "two hundred sixteen thousand eight hundred sixteen", "48.89": "forty eight point eight nine", "23,689": "twenty three thousand six hundred eighty nine", "46,345": "forty six thousand three hundred forty five", "\u201d": "close quote"}}<|im_end|>
Revision
current stage, 7e4483ac0c66fef90556113d8b32665c80786b5f
- This revision trained on mesolitica/Malaysian-SFT/malaysian_normalizer and mesolitica/Malaysian-SFT/malaysian_normalizer_pseudolabel.
- This revision trained on proper train set.
older stage, 7b502263c605355fbc93a1b76f6712461812f863
- This revision trained initially on mesolitica/Malaysian-SFT/malaysian_normalizer.
- This revision pseudolabelled more dataset and released it at mesolitica/Malaysian-Normalizer#pseudolabel
- This revision trained on leaked test set.
Acknowledgement
Special thanks to Lambda Research Grant program for Lambda cloud credit!
- Downloads last month
- 37
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support