Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON
This repository hosts a modified version of the IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese model. The primary purpose of this repository is to include the tokenizer.json
file, which was missing in the original release.
Motivation for this Repository
The original IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese
model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a spiece.model
file for its tokenizer, lacking the tokenizer.json
file.
While the Python transformers
library can generally load the tokenizer from spiece.model
, this absence caused issues for environments that strictly prefer or require tokenizer.json
(e.g., certain versions or implementations of the Rust tokenizers
library, or other frameworks that rely on this standardized format).
To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected tokenizer.json
file.
Changes Made
The following modifications have been made to the original IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese
model files:
- Added
tokenizer.json
: The primary change is the inclusion of thetokenizer.json
file, generated from the originalspiece.model
using the Pythontransformers
library'ssave_pretrained()
method. This ensures broader compatibility and easier loading for various applications. - No Model Weights Changes: Crucially, the model weights (
pytorch_model.bin
ormodel.safetensors
) themselves have not been altered in any way. This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format.
How to Use
You can load this model and its tokenizer using the Hugging Face transformers
library:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "你好,这是一个测试。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For Rust users (and others requiring tokenizer.json
):
use tokenizers::Tokenizer;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name
// The Tokenizer::from_pretrained will now find and use tokenizer.json
let tokenizer = Tokenizer::from_pretrained(model_id, None).await?;
let text = "你好,这是一个中文文本。";
let encoding = tokenizer.encode(text, true).unwrap();
println!("Original text: {}", text);
println!("Tokens: {:?}", encoding.get_tokens());
println!("IDs: {:?}", encoding.get_ids());
let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap();
println!("Decoded text: {}", decoded_text);
Ok(())
}
Original Model Information
For more details about the original IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese
model, its training, capabilities, and benchmarks, please refer to its official repository: IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese.
- Downloads last month
- 5