RAG语义分块策略实战测评：深度对比分析

2026-06-20阅读 0热度 0

ai 人工智能

RAG系统中，PDF经解析后产出结构化或半结构化数据。核心步骤是将这些数据切割为语义片段，萃取细粒度特征，并通过向量化捕捉其语义。如图1所示，该环节位于检索流水线的最前端。

基于规则的分块策略最为常见，如固定块大小、相邻块间设置重叠窗口。对于层级结构文档，Langchain 的 RecursiveCharacterTextSplitter 支持定义多级分隔符。

然而生产环境中，这类方法的缺陷显著：固定规则（如固定块大小、重叠长度）常导致上下文截断、关键信息丢失，或块体过大引入噪声。

理想的方案转向语义分块，其核心目标是每个块包含语义独立且完整的信息。本文梳理几种主流语义分块方法，涵盖原理与实践，包括三大类：

基于嵌入向量的方法
基于预训练模型的方法
基于大语言模型的方法

一、基于嵌入向量的分块方法

LlamaIndex 与 Langchain 均提供基于嵌入的语义分块器，实现思路相近。以下以 LlamaIndex 为例说明。

使用前需确保 LlamaIndex 版本较新。笔者此前使用的 0.9.45 未包含此算法，因此新建 conda 环境并安装 0.10.12：

pip install llama-index-core
pip install llama-index-readers-file
pip install llama-index-embeddings-openai

0.10.12 的安装较为灵活，此处仅安装关键组件。安装后版本信息如下：

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core                    0.10.12
llama-index-embeddings-openai       0.1.6
llama-index-readers-file            0.1.5
llamaindex-py-client                0.1.13

测试代码如下：

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import SimpleDirectoryReader
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()

embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)
nodes = splitter.get_nodes_from_documents(documents)
for node in nodes:
    print('-' * 100)
    print(node.get_content())

图2展示了 splitter.get_nodes_from_documents 的核心流程。

流程中的 'sentences' 实为 Python 列表，每个元素为包含四个关键字段的字典：

sentences：当前句子
index：句子索引
combined_sentence：滑动窗口，默认包含 [index-1, index, index+1] 三句，旨在降低单句噪声，增强连续句的语义关联
combined_sentence_embedding：上述组合句子的嵌入向量

该方案本质是通过滑动窗口计算相邻片段相似度，超过阈值的句子合并为同一块。

以 BERT 论文为测试样本，输出结果如下：

(llamaindex_010) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_semantic_chunk.py
......
----------------------------------------------------------------------------------------------------
We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.
In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirectionality constraint by using a "masked language model" (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked
arXiv:1810.04805v2 [cs.CL] 24 May 2019
----------------------------------------------------------------------------------------------------
word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pre-trains text-pair representations. The contributions of our paper are as follows:
• We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pre-trained deep bidirectional representations. This is also in contrast to Peters et al.
----------------------------------------------------------------------------------------------------
......

小结：基于嵌入向量的方法

实际测试显示，块粒度偏粗
如图2所示，该方法以页面为单位，未直接处理跨页分块问题
效果高度依赖选用的嵌入模型，需在实际场景中充分评估

二、基于预训练模型的分块方法

2.1 朴素BERT

BERT 预训练中的 NSP（Next Sentence Prediction）任务，用于学习句间顺序关系：输入两句，判断后者是否为前者的下文。

基于此可设计简易分块：文档拆分为句子，滑动窗口将相邻句对送入 BERT 进行 NSP 判断（流程见图3）。

若预测得分低于阈值，表明句间语义弱，即可作为断点（如图3中句子2与句子3之间）。

该方法开箱即用，无需额外训练或微调。但缺点显著：判断仅依赖前后两句，忽略全局上下文；且逐对预测效率较低。

2.2 跨段注意力

文献《Text Segmentation by Cross Segment Attention》提出三种跨段注意力模型，见图4。

图4(a) 将文本分割视为逐句分类任务。每个候选断点处，以两侧 k 个 token 为上下文输入，利用 [CLS] 隐藏状态进行二分类。

另两个模型略有差异：其一用 BERT 提取每句向量，再将连续句子向量送入 Bi-LSTM（图4(b)）；其二将向量送入另一 BERT（图4(c)）预测是否为分割边界。

三个模型在当时均达到 SOTA（见图5），但迄今仅有训练代码公开，推理模型尚未发布。

2.3 序列模型

跨段模型逐句向量化，缺乏全局上下文。改进方案 SeqModel 来自文献《Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation》。

SeqModel 用 BERT 同时编码多个句子，先建模长程依赖，再判断每句后是否为分割点。同时采用自适应滑动窗口加速推理且不损精度，示意如图6。

通过 ModelScope 即可调用 SeqModel，代码简洁：

from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

p = pipeline(
    task = Tasks.document_segmentation,
    model = 'damo/nlp_bert_document-segmentation_english-base'
)

print('-' * 100)

result = p(documents='We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many hea vily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. Today is a good day')

print(result[OutputKeys.TEXT])

测试时在末尾添加 'Today is a good day'，模型并未将其单独分割，输出仍为连贯整体。

(modelscope) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_seqmodel.py
2024-02-24 17:09:36,288 - modelscope - INFO - PyTorch version 2.2.1 Found.
2024-02-24 17:09:36,288 - modelscope - INFO - Loading ast index from /Users/Florian/.cache/modelscope/ast_indexer
......
----------------------------------------------------------------------------------------------------
...... We demonstrate the importance of bidirectional pre-training for language representations.
Unlike Radford et al.(2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations.
This is also in contrast to Peters et al.(2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.
• We show that pre-trained representations reduce the need for many hea vily-engineered taskspecific architectures.
BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.
Today is a good day

三、基于大语言模型的分块方法

文献《Dense X Retrieval: What Retrieval Granularity Should We Use?》提出 'proposition' 检索单元，即文本中的原子表达，每个命题封装独立事实，以简洁自包含的自然语言呈现。

如何提取 proposition？论文采用 prompt 与大语言模型交互生成。

LlamaIndex 与 Langchain 均实现了该算法，下面以 LlamaIndex 为例。

LlamaIndex 的实现使用了论文提供的 prompt：

PROPOSITIONS_PROMPT = PromptTemplate(
"""Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.

Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may ha ve served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare's scratch or form and a lapwing's nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may ha ve served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare's scratch or form and a lapwing's nest look very similar.", "Both
hares and lapwing's nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America." ]

Input: {node_text}
Output:"""
)

前文的嵌入方法已安装 LlamaIndex 0.10.12 关键组件，但使用 DenseXRetrievalPack 需额外安装 llama-index-llms-openai。安装后组件如下：

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core                    0.10.12
llama-index-embeddings-openai       0.1.6
llama-index-llms-openai             0.1.6
llama-index-readers-file            0.1.5
llamaindex-py-client                0.1.13

测试代码如下：

from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"

DenseXRetrievalPack = download_llama_pack(
    "DenseXRetrievalPack", "./dense_pack"
)

dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()

dense_pack = DenseXRetrievalPack(documents)

response = dense_pack.run("YOUR_QUERY")

下面查看 DenseXRetrievalPack 的源码实现：

class DenseXRetrievalPack(BaseLlamaPack):
    def __init__(
        self,
        documents: List[Document],
        proposition_llm: Optional[LLM] = None,
        query_llm: Optional[LLM] = None,
        embed_model: Optional[BaseEmbedding] = None,
        text_splitter: TextSplitter = SentenceSplitter(),
        similarity_top_k: int = 4,
    ) -> None:
        """Init params."""
        self._proposition_llm = proposition_llm or OpenAI(
            model="gpt-3.5-turbo",
            temperature=0.1,
            max_tokens=750,
        )
        embed_model = embed_model or OpenAIEmbedding(embed_batch_size=128)
        nodes = text_splitter.get_nodes_from_documents(documents)
        sub_nodes = self._gen_propositions(nodes)
        all_nodes = nodes + sub_nodes
        all_nodes_dict = {n.node_id: n for n in all_nodes}
        service_context = ServiceContext.from_defaults(
            llm=query_llm or OpenAI(),
            embed_model=embed_model,
            num_output=self._proposition_llm.metadata.num_output,
        )
        self.vector_index = VectorStoreIndex(
            all_nodes, service_context=service_context, show_progress=True
        )
        self.retriever = RecursiveRetriever(
            "vector",
            retriever_dict={
                "vector": self.vector_index.as_retriever(
                    similarity_top_k=similarity_top_k
                )
            },
            node_dict=all_nodes_dict,
        )
        self.query_engine = RetrieverQueryEngine.from_args(
            self.retriever, service_context=service_context
        )

流程清晰：text_splitter 将文档切为原始 nodes，调用 self._gen_propositions 生成 sub_nodes，再将 nodes + sub_nodes 送入 VectorStoreIndex 构建索引，并通过 RecursiveRetriever 检索。巧妙之处在于，检索阶段使用 sub_nodes（小块），生成阶段传递关联的 nodes（大块）。

测试文档仍为 BERT 论文。调试显示 sub_nodes[].text 已被 LLM 重写，不再是原文：

> /Users/Florian/anaconda3/envs/llamaindex_010/lib/python3.11/site-packages/llama_index/packs/dense_x_retrieval/base.py(91)__init__()
     90 
---> 91         all_nodes = nodes + sub_nodes
     92         all_nodes_dict = {n.node_id: n for n in all_nodes}

ipdb> sub_nodes[20]
IndexNode(id_='ecf310c7-76c8-487a-99f3-f78b273e00d9', ..., text='Our paper demonstrates the importance of bidirectional pre-training for language representations.', ...)
ipdb> sub_nodes[21]
IndexNode(id_='4911332e-8e30-47d8-a5bc-ed7cbaa8e042', ..., text='Radford et al. (2018) uses unidirectional language models for pre-training.', ...)
ipdb> sub_nodes[22]
IndexNode(id_='83aa82f8-384a-4b06-92c8-d6277c4162bf', ..., text='BERT uses masked language models to enable pre-trained deep bidirectional representations.', ...)
ipdb> sub_nodes[23]
IndexNode(id_='2ac635c2-ccb0-4e62-88c7-bcbaef3ef38a', ..., text='Peters et al. (2018a) uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.', ...)
ipdb> sub_nodes[24]
IndexNode(id_='e37b17cf-30dd-4114-a3c5-9921b8cf0a77', ..., text='Pre-trained representations reduce the need for many hea vily-engineered task-specific architectures.', ...)

图7清晰展示了 sub_nodes 与 nodes 之间的关系——从小到大的索引结构。

该索引结构由 self._gen_propositions 构建，代码如下：

async def _aget_proposition(self, node: TextNode) -> List[TextNode]:
    """Get proposition."""
    inital_output = await self._proposition_llm.apredict(
        PROPOSITIONS_PROMPT, node_text=node.text
    )
    outputs = inital_output.split("\n")
    all_propositions = []
    for output in outputs:
        if not output.strip():
            continue
        if not output.strip().endswith("]"):
            if not output.strip().endswith('"') and not output.strip().endswith(","):
                output = output + '"'
            output = output + " ]"
        if not output.strip().startswith("["):
            if not output.strip().startswith('"'):
                output = '"' + output
            output = "[ " + output
        try:
            propositions = json.loads(output)
        except Exception:
            try:
                propositions = yaml.safe_load(output)
            except Exception:
                continue
        if not isinstance(propositions, list):
            continue
        all_propositions.extend(propositions)
    assert isinstance(all_propositions, list)
    nodes = [TextNode(text=prop) for prop in all_propositions if prop]
    return [IndexNode.from_text_node(n, node.node_id) for n in nodes]

def _gen_propositions(self, nodes: List[TextNode]) -> List[TextNode]:
    """Get propositions."""
    sub_nodes = asyncio.run(
        run_jobs(
            [self._aget_proposition(node) for node in nodes],
            show_progress=True,
            workers=8,
        )
    )
    return [node for sub_node in sub_nodes for node in sub_node]

需要说明的是，原论文用 LLM 生成的命题作为训练数据，微调了一个文本生成模型。该微调模型现已公开，读者可自行尝试。

小结：基于大语言模型的方法

总体而言，通过 LLM 构造命题的分块方式粒度更细，与原始节点形成从小到大的索引结构，为语义分块提供了全新思路。但代价是依赖 LLM，成本较高。

四、总结

本文探讨了三种语义分块方法及其原理与实现。

语义分块相比规则分块更为优雅，是提升 RAG 系统检索质量的关键。选择哪种方法需结合业务场景与成本-效果权衡。