MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

The MoC was fully fine-tuned on the Qwen2.5-1.5B-Instruct utilizing 20K data entries from the CRUD benchmark, which was prepared with GPT-4o. Leveraging the segmented data generated by GPT-4o, we assigned granularity labels ranging from 0 to 3 to the text, corresponding to average chunk length intervals such as (0, 120], (120, 150], (150, 180], and (180, +∞).