Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON

This repository hosts a modified version of the IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese model. The primary purpose of this repository is to include the tokenizer.json file, which was missing in the original release.

Motivation for this Repository

The original IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese model is an excellent T5-based model for various Chinese NLP tasks. However, it was released with only a spiece.model file for its tokenizer, lacking the tokenizer.json file.

While the Python transformers library can generally load the tokenizer from spiece.model, this absence caused issues for environments that strictly prefer or require tokenizer.json (e.g., certain versions or implementations of the Rust tokenizers library, or other frameworks that rely on this standardized format).

To enhance usability and compatibility across different platforms and libraries, this repository was created to provide the model with the commonly expected tokenizer.json file.

Changes Made

The following modifications have been made to the original IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese model files:

Added tokenizer.json: The primary change is the inclusion of the tokenizer.json file, generated from the original spiece.model using the Python transformers library's save_pretrained() method. This ensures broader compatibility and easier loading for various applications.
No Model Weights Changes: Crucially, the model weights (pytorch_model.bin or model.safetensors) themselves have not been altered in any way. This repository provides the exact same powerful pre-trained model, just with an updated tokenizer serialization format.

How to Use

You can load this model and its tokenizer using the Hugging Face transformers library:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON" # Replace with your actual repository name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "你好，这是一个测试。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For Rust users (and others requiring tokenizer.json):

use tokenizers::Tokenizer;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let model_id = "your-username/Randeng-T5-784M-MultiTask-Chinese-with-Tokenizer-JSON"; // Replace with your actual repository name
    
    // The Tokenizer::from_pretrained will now find and use tokenizer.json
    let tokenizer = Tokenizer::from_pretrained(model_id, None).await?; 

    let text = "你好，这是一个中文文本。";
    let encoding = tokenizer.encode(text, true).unwrap();

    println!("Original text: {}", text);
    println!("Tokens: {:?}", encoding.get_tokens());
    println!("IDs: {:?}", encoding.get_ids());

    let decoded_text = tokenizer.decode(encoding.get_ids(), true).unwrap();
    println!("Decoded text: {}", decoded_text);

    Ok(())
}

Original Model Information

For more details about the original IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese model, its training, capabilities, and benchmarks, please refer to its official repository: IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese.