arxiv:2503.09600

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Published on Mar 12

· Submitted by

Robot2050 on Mar 13

Upvote

Authors:

Jihao Zhao ,

Abstract

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

View arXiv page View PDF Add to collection

Community

Robot2050

Paper author Paper submitter about 6 hours ago

GitHub: https://github.com/IAAR-Shanghai/Meta-Chunking/tree/main/MoC
arXiv: https://arxiv.org/abs/2503.09600

Robot2050

Paper author Paper submitter about 6 hours ago

🚀 MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

💡 Abstract
检索增强生成（RAG）虽然作为大语言模型（LLM）的可行补充，但经常忽略了其管道中文本分块这一关键方面。本文首先提出边界清晰度与文本块黏连度双指标评估体系，实现了分块质量的直接量化。基于此评估框架，我们揭示了传统语义分块在复杂语境下的语义边界模糊问题，进而论证 LLM 介入分块任务的必要性。为解决 LLM 计算效率与分块精度的矛盾，本研究设计出三阶段处理机制的 MoC 框架。借助路由稀疏激活适配的专家，确保在不影响分块精度的前提下优化了整体效率。值得强调的是，本研究引导专家生成一种高度结构化的分块正则表达式列表，从原始文本中精确提取文本块。大量的实验表明，我们提出的两个指标和 MoC 框架都能有效地处理分块任务，在揭示分块内核的同时提高了 RAG 系统的性能。

🧠 Inspiration
1️⃣ 突破传统间接评价范式，提出 Boundary Clarity 与 Chunk Stickiness 双指标，实现分块质量的直接量化。并且通过解构语义分块失效机理，为LLM介入分块任务提供了实验验证。
2️⃣ 设计混合分块专家架构 MoC，通过多粒度感知路由网络动态调度轻量化分块专家。该架构创新性融合：正则表达式引导的分块方法，基于稀疏激活的多粒度分块机制和编辑距离驱动的校正算法。
3️⃣ 为了验证我们所提出指标和分块方法的有效性，我们共采用了五个不同的语言模型，在四个问答数据集上进行了多维度的实验，并进行了深入的分析。

English Version:
1️⃣ We break through the traditional indirect evaluation paradigm, propose dual indicators of Boundary Clarity and Chunk Stickiness, and achieve direct quantification of chunk quality. Furthermore, by deconstructing the mechanism of semantic chunking failure, we provide experimental validation for LLM's involvement in chunking tasks.
2️⃣ We design a hybrid chunking expert architecture called MoC, which dynamically schedules lightweight chunking experts through a multi-granularity perception routing network. This architecture innovatively integrates a regular expression-guided chunking method, a multi-granularity chunking mechanism based on sparse activation, and an edit distance-driven correction algorithm.
3️⃣ To verify the effectiveness of our proposed metrics and chunking methods, we conduct multi-dimensional experiments on four question answering datasets utilizing five different language models, and perform an in-depth analysis.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.09600 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.09600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.09600 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.