信息提取Agent构建实战：xParse+LangChain结构化数据提取

2026-06-12阅读 0热度 0

人工智能 Agent

在做信息提取的圈子里混久了，你就会发现真实业务里的文档有多“野”。发票、医疗单据、合同、简历、产品手册……格式千奇百怪，关键数据散落各处，想规规矩矩地收起来？这篇就聊一套硬核实战方案：以`xParse`为解析底座，搭配`LangChain`搭建的智能Agent，把那些非结构化的零碎内容，自动抽成结构化的JSON或CSV，一步到位。 ### 场景介绍 #### 业务痛点实际的信息提取场景中，大家面对的难题殊途同归： * **文档格式五花八门：** 发票是PDF，医疗单是图片，合同是Word，简历又是另一套。光是把这些文件的内容读出来，就够折腾一阵子。 * **提取流程极其繁琐：** 要从这些杂乱文档里精准抓出发票号码、医疗费用、合同条款、个人信息、产品参数、API接口……手工干的话，纯粹是重复劳动。 * **数据格式难以统一：** 不同来源的记录方式天差地别。张三的发票和Nike的发票，格式完全不一样，想让它们规整地共存一个数据库，得费不少功夫。 * **批量处理是刚性需求：** 大量文档一份一份手工提取，效率低得可怕，还容易出错。 * **数据验证不可缺失：** 机器提取的结果，总得有人（或规则）来把关准确性、完整性，否则后续流程全得崩。 * **财务合规与法务风险：** 发票和医疗单上的金额，分毫不能错，那是要跟财务系统和税务局对接的。合同里的条款尤其是违约责任、争议解决，漏掉一条就可能踩坑。 #### 解决方案所以，我们需要一个Agent。它能做到： * **自动解析所有文档：** 不管PDF、Word、Excel还是图片，丢进去就能自动识别。 * **智能抽取关键信息：** 它能理解文档内容，把发票信息、医疗费用、合同条款、简历详情、产品规格、API接口……这些结构化数据完整地取出来。 * **数据标准化输出：** 无论原始格式多乱，最后都统一成标准格式，比如JSON或CSV，方便下游系统直接使用。 * **数据路径验证：** 能对提取结果做初步检查，向用户提示哪些数据可能存疑。 * **批量处理：** 支持一次性处理整个文件夹的文档，彻底解放双手。 * **财务与合同自动化：** 财务人员可以免去手动录入发票，法务人员可以一键生成合同摘要和风险提示。 ### 架构设计整个流程非常清晰，如下图所示： ``` 文档（PDF/Word/Excel/图片） ↓ [xParse Pipeline - Parse] └─ 解析文档，提取结构化元素（elements） ↓ 聚合元素文本（elements[].text） ↓ [LangChain Agent] ├─ Tool 1: extract_invoice_info（提取发票信息） ├─ Tool 2: extract_medical_bill_info（提取医疗票据信息） ├─ Tool 3: extract_contract_info（提取合同信息） ├─ Tool 4: extract_resume_info（提取简历信息） ├─ Tool 5: extract_product_specs（提取产品规格） ├─ Tool 6: extract_api_info（提取API信息） └─ Tool 7: format_data（数据格式化） ↓ 结构化数据（JSON/CSV） ``` **核心思想很简单：** 1. 先用`xParse`这把锋利的刀，把文档“解剖”开，得到一个个有意义的文本元素。 2. 把这些元素拼起来，形成一段完整的文档文本。 3. 最后把这段文本喂给大模型，通过精心设计的提示词（prompt），让大模型一次性把我们要的结构化数据给“榨”出来。 ### 环境准备先把环境搭起来，不复杂，就几行命令。 ```python python -m venv .venv && source .venv/bin/activate pip install "xparse-client>=0.2.5" langchain langchain-community langchain-core python-dotenv pandas export XTI_APP_ID=your-app-id # 在 TextIn 官网注册获取 export XTI_SECRET_CODE=your-secret-code # 在 TextIn 官网注册获取 export DASHSCOPE_API_KEY=your-dashscope-api-key # 本教程使用通义千问大模型，也可以替换成其他大模型 ``` > **特别注意：** `XTI_APP_ID` 和 `XTI_SECRET_CODE` 需要去 TextIn 工作台（https://www.textin.com/console/dashboard/setting）申请。示例里用的是通义千问，你换成其他模型也是一样玩。 ### Step 1：配置 xParse Pipeline 对于信息提取这个场景，我们只需要`xParse`的解析能力，不需要分块和向量化那些花活。 * **解析配置：** 就用默认的解析模块，但引擎要换成TextIn，它对表格和复杂的列表识别效果要好得多。 * **表格优化：** 文档里如果有表格，确保它能完整识别出来，输出格式默认是HTML。 ```python from xparse_client import create_pipeline_from_config import os from dotenv import load_dotenv load_dotenv() EXTRACTION_PIPELINE_CONFIG = { "source": { "type": "local", "directory": "./extraction_documents", "pattern": ["*.pdf", "*.docx", "*.xlsx", "*.xls", "*.png", "*.jpg"] }, "destination": { "type": "local", # 使用本地存储，保存解析结果 "output_dir": "./extraction_results" }, "api_base_url": "https://api.textin.com/api/xparse", "api_headers": { "x-ti-app-id": os.getenv("XTI_APP_ID"), "x-ti-secret-code": os.getenv("XTI_SECRET_CODE") }, "stages": [ { "type": "parse", "config": { "provider": "textin" # 使用TextIn解析引擎，对表格和列表识别效果好 } } ] } ``` 接下来，我们写一个全局的存储，用于放解析后的文档文本。为什么？因为Agent的每个Tool都需要访问当前处理的文档。 ```python from langchain_core.tools import Tool from langchain_community.chat_models import ChatTongyi import os import json # 全局文档文本存储（实际应用中可以使用更持久化的存储） _document_texts = {} # 初始化 qwen-max 大模型 llm = ChatTongyi( model="qwen-max", dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"), temperature=0, # 使用较低温度以获得更确定性的输出 ) def set_document_text(file_path: str, text: str): """设置文档文本内容""" _document_texts[file_path] = text def get_document_text(file_path: str = None) -> str: """获取文档文本内容""" if not _document_texts: return "" # 如果没有加载任何文档，返回空字符串 if file_path not in ("None", "none", None, "", "null"): return _document_texts.get(file_path, "") # 如果没有指定文件，返回第一个文档的文本 return next(iter(_document_texts.values()), "") ``` ### Step 2：构建 LangChain Tools 这里就是Agent的“工具箱”了。每个Tool都对应一个提取任务。因为Tool的代码比较长但结构类似，我们先看发票提取的Tool，其它Tool用`_extract_with_llm`这个通用方法复用核心逻辑。 ```python from langchain_core.tools import Tool from langchain_community.chat_models import ChatTongyi import os import json # 全局文档文本存储（实际应用中可以使用更持久化的存储） _document_texts = {} # 初始化 qwen-max 大模型 llm = ChatTongyi( model="qwen-max", dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"), temperature=0, ) def set_document_text(file_path: str, text: str): _document_texts[file_path] = text def get_document_text(file_path: str = None) -> str: if not _document_texts: return "" if file_path not in ("None", "none", None, "", "null"): return _document_texts.get(file_path, "") return next(iter(_document_texts.values()), "") ``` #### Tool 1: 提取发票信息这个Tool是专门对付发票的。我们给大模型设计了一个非常详细的Prompt，明确告诉它要提取哪些字段，甚至连JSON输出格式都给它画好了。 ```python def extract_invoice_info(file_path: str = None) -> str: """ 从发票中提取结构化信息（使用 qwen-max 大模型）提取内容包括： - 发票基本信息（发票代码、发票号码、开票日期） - 销售方信息（名称、纳税人识别号、地址电话、开户行及账号） - 购买方信息（名称、纳税人识别号、地址电话、开户行及账号） - 商品明细（名称、规格、单位、数量、单价、金额、税率、税额） - 金额信息（合计金额、合计税额、价税合计） - 其他信息（备注、收款人、复核人、开票人等） Args: file_path: 文档路径（可选），如果不提供则使用当前已加载的文档 """ context_text = get_document_text(file_path) if not context_text: return "错误：未找到文档内容。请先使用 load_document() 方法加载文档，或提供文档路径。" prompt = f"""请从以下发票文本中提取结构化信息，并以 JSON 格式返回。要求提取的信息包括： 1. 发票基本信息：invoice_code（发票代码）、invoice_number（发票号码）、date（开票日期） 2. 销售方信息：name（名称）、tax_id（纳税人识别号）、address（地址电话）、bank_account（开户行及账号） 3. 购买方信息：name（名称）、tax_id（纳税人识别号）、address（地址电话）、bank_account（开户行及账号） 4. 商品明细（数组）：name、specification、unit、quantity、unit_price、amount、tax_rate、tax_amount 5. 金额信息：total_amount（合计金额）、tax_amount（合计税额）、total_with_tax（价税合计） 6. 其他信息：remark（备注）、payee（收款人）、reviewer（复核人）、drawer（开票人）请严格按照以下 JSON 格式返回，如果某个字段不存在，请使用空字符串 "" 或空对象 {{}} 或空数组 []： {{ "invoice_info": {{"invoice_code": "", "invoice_number": "", "date": ""}}, "seller": {{"name": "", "tax_id": "", "address": "", "bank_account": ""}}, "buyer": {{"name": "", "tax_id": "", "address": "", "bank_account": ""}}, "items": [{{"name": "", "specification": "", "unit": "", "quantity": "", "unit_price": "", "amount": "", "tax_rate": "", "tax_amount": ""}}], "amounts": {{"total_amount": "", "tax_amount": "", "total_with_tax": ""}}, "other_info": {{"remark": "", "payee": "", "reviewer": "", "drawer": ""}} }} 发票文本内容： {context_text} 请只返回 JSON 格式的数据，不要包含任何其他解释或说明文字。""" # 调用大模型提取信息 response_text = "" try: from langchain_core.messages import HumanMessage response = llm.invoke([HumanMessage(content=prompt)]) response_text = response.content.strip() if response.content else "" # 尝试从响应中提取 JSON（可能包含 markdown 代码块） if "```json" in response_text: json_start = response_text.find("```json") + 7 json_end = response_text.find("```", json_start) if json_end != -1: response_text = response_text[json_start:json_end].strip() elif "```" in response_text: json_start = response_text.find("```") + 3 json_end = response_text.find("```", json_start) if json_end != -1: response_text = response_text[json_start:json_end].strip() invoice_data = json.loads(response_text) return json.dumps(invoice_data, ensure_ascii=False, indent=2) except json.JSONDecodeError as e: error_msg = f"JSON 解析失败：{str(e)}\n模型返回的原始内容：\n{response_text}" print(error_msg) return json.dumps({ "error": "JSON 解析失败", "raw_response": response_text if response_text else "无响应", "error_detail": str(e) }, ensure_ascii=False, indent=2) except Exception as e: error_msg = f"提取信息时发生错误：{str(e)}" print(error_msg) if response_text: print(f"模型返回的原始内容：\n{response_text}") return json.dumps({ "error": "提取信息失败", "error_detail": str(e), "raw_response": response_text if response_text else "无响应" }, ensure_ascii=False, indent=2) ``` #### Tool 2-6: 其他提取工具医疗票据、合同、简历、产品规格、API接口提取的Tool，逻辑和发票提取完全一样，只是Prompt模板不同，用来引导大模型提取不同格式的数据。为了避免代码膨胀，我们可以用一个`_extract_with_llm`通用方法来抽象。在最终的`InformationExtractionAgent`类里，我们会看到这个优化。 #### Tool 7: 数据格式化这个Tool的作用是兜底。如果其他Tool没提取全，或者用户想自定义提取，可以用这个工具，通过Prompt让大模型把原始文本中的“键值对”全给扒出来，然后格式化输出JSON和CSV。 ```python import pandas as pd def format_data(file_path: str = None) -> str: """ 将提取的数据格式化为标准格式（JSON、CSV等）""" context_text = get_document_text(file_path) if not context_text: return "错误：未找到文档内容。请先使用 load_document() 方法加载文档，或提供文档路径。" prompt = f"""请从以下文本中提取所有键值对信息，并以 JSON 格式返回。要求： 1. 提取所有形如"键：值"或"键:值"的键值对 2. 返回格式为数组，每个元素包含 key、value、source（来源文件名）文本内容： {context_text} 请返回 JSON 格式的数组，格式如下： [ {{ "key": "键名", "value": "值", "source": "文件名" }} ] 请只返回 JSON 格式的数据，不要包含任何其他解释或说明文字。""" try: from langchain_core.messages import HumanMessage response = llm.invoke([HumanMessage(content=prompt)]) response_text = response.content.strip() if response.content else "" # 提取 JSON if "```json" in response_text: json_start = response_text.find("```json") + 7 json_end = response_text.find("```", json_start) if json_end != -1: response_text = response_text[json_start:json_end].strip() elif "```" in response_text: json_start = response_text.find("```") + 3 json_end = response_text.find("```", json_start) if json_end != -1: response_text = response_text[json_start:json_end].strip() data_list = json.loads(response_text) if not data_list: return "未找到可格式化的数据" json_output = json.dumps(data_list, ensure_ascii=False, indent=2) try: df = pd.DataFrame(data_list) csv_output = df.to_csv(index=False) except: csv_output = "CSV格式化失败" return f"JSON格式：\n{json_output}\n\nCSV格式：\n{csv_output}" except Exception as e: return f"数据格式化失败：{str(e)}" ``` ### 组装所有Tools 有了上面这些工具函数，接下来把它们包装成LangChain的Tool对象，统一管理起来。 ```python tools = [ Tool( name="extract_invoice_info", description="从发票中提取结构化信息，包括发票基本信息、销售方/购买方信息、商品明细、金额信息等。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=extract_invoice_info ), Tool( name="extract_medical_bill_info", description="从医疗票据中提取结构化信息，包括患者信息、医疗机构信息、就诊信息、费用明细、费用汇总等。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=extract_medical_bill_info ), Tool( name="extract_contract_info", description="从合同中提取结构化信息，包括合同基本信息、合同双方信息、合同标的、关键条款、金额信息等。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=extract_contract_info ), Tool( name="extract_resume_info", description="从简历中提取结构化信息，包括个人信息、教育经历、工作经历、技能等。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=extract_resume_info ), Tool( name="extract_product_specs", description="从产品文档中提取产品规格和技术参数，包括产品名称、型号、技术参数、功能特性、价格等。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=extract_product_specs ), Tool( name="extract_api_info", description="从技术文档中提取API接口信息，包括API端点、请求方法、请求参数、响应格式等。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=extract_api_info ), Tool( name="format_data", description="将提取的数据格式化为标准格式（JSON、CSV等）。如果已加载文档，可以直接调用无需参数；否则需要提供文档路径。", func=format_data ) ] ``` ### Step 3：配置 LangChain Agent Agent的核心是`system_prompt`，它定义了Agent的行为准则。我们需要告诉它，它是一个专业的信息提取助手，要优先使用工具，提供结构化结果，对于财务和合同文档要格外小心。 ```python from langchain.agents import create_agent from langchain_community.chat_models import ChatTongyi llm = ChatTongyi( model="qwen-max", dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"), temperature=0.2, # 使用较低温度以获得更确定性的输出 ) agent = create_agent( model=llm, tools=tools, debug=True, system_prompt="""你是一个专业的信息提取助手。你的任务是帮助用户： 1. 从文档中提取结构化信息（发票、医疗票据、合同、简历、产品规格、API接口等） 2. 将提取的信息格式化为标准格式（JSON、CSV等） 3. 验证提取数据的完整性和准确性在回答时，请： - 提供结构化的提取结果 - 使用JSON或表格格式展示数据 - 如果数据不完整，说明缺失的部分 - 使用工具获取准确的信息，不要猜测 - 对于财务类文档（发票、医疗票据），确保金额和税务信息的准确性 - 对于合同文档，重点关注关键条款和风险点 """ ) ``` ### Step 4：完整示例代码（封装成类）在实际项目中，最好把上面这些零散的代码封装成一个类。这样管理起来更清晰，也方便复用。 ```python #!/usr/bin/env python # -*- coding: utf-8 -*- import os import json from dotenv import load_dotenv from xparse_client import create_pipeline_from_config from langchain_core.tools import Tool from langchain.agents import create_agent from langchain_community.chat_models import ChatTongyi load_dotenv() class InformationExtractionAgent: """信息提取Agent""" def __init__(self): self.setup_pipeline() self.setup_llm() self.setup_agent() self._document_texts = {} def setup_pipeline(self): self.pipeline_config = { "source": { "type": "local", "directory": "./extraction_documents", "pattern": ["*.pdf", "*.docx", "*.xlsx", "*.xls", "*.png", "*.jpg"] }, "destination": { "type": "local", "output_dir": "./extraction_results" }, "api_base_url": "https://api.textin.com/api/xparse", "api_headers": { "x-ti-app-id": os.getenv("XTI_APP_ID"), "x-ti-secret-code": os.getenv("XTI_SECRET_CODE") }, "stages": [ { "type": "parse", "config": {"provider": "textin"} } ] } def setup_llm(self): self.llm = ChatTongyi( model="qwen-max", dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"), temperature=0, ) def parse_document(self, file_path: str) -> list: import os from copy import deepcopy file_dir = os.path.dirname(os.path.abspath(file_path)) file_name_pattern = os.path.basename(file_path) temp_config = deepcopy(self.pipeline_config) temp_config["source"] = { "type": "local", "directory": file_dir, "pattern": [file_name_pattern] } pipeline = create_pipeline_from_config(temp_config) pipeline.run() output_dir = self.pipeline_config["destination"]["output_dir"] os.makedirs(output_dir, exist_ok=True) file_name = os.path.splitext(os.path.basename(file_path))[0] result_file = os.path.join(output_dir, f"{file_name}.json") if not os.path.exists(result_file): raise FileNotFoundError( f"解析结果文件不存在: {result_file}\n" f"请检查输出目录: {output_dir}\n" f"原始文件路径: {file_path}" ) with open(result_file, 'r', encoding='utf-8') as f: elements = json.load(f) return elements def aggregate_text_from_elements(self, elements: list) -> str: texts = [] for element in elements: if isinstance(element, dict): text = element.get('text', '') else: text = getattr(element, 'text', '') if text and text.strip(): texts.append(text.strip()) return "\n\n".join(texts) def load_document(self, file_path: str): print(f"正在解析文档: {file_path}") elements = self.parse_document(file_path) text = self.aggregate_text_from_elements(elements) self._document_texts[file_path] = text print(f"文档解析完成，文本长度: {len(text)} 字符") def get_document_text(self, file_path: str = None) -> str: if not self._document_texts: return "" if file_path not in ("None", "none", None, "", "null"): return self._document_texts.get(file_path, "") return next(iter(self._document_texts.values()), "") def setup_agent(self): # 注意：这里将tools函数的func参数绑定到实例方法上 tools = [ Tool( name="extract_invoice_info", description="从发票中提取结构化信息，...", func=self.extract_invoice_info ), # ... 其他Tool类似 Tool( name="format_data", description="将提取的数据格式化为标准格式...", func=self.format_data ) ] self.agent = create_agent( model=self.llm, tools=tools, debug=True, system_prompt="你是一个专业的信息提取助手。你的任务是帮助用户：..." ) def extract_invoice_info(self, file_path: str = None) -> str: context_text = self.get_document_text(file_path) if not context_text: return "错误：未找到文档内容。请先使用 load_document() 方法加载文档，或提供文档路径。" prompt = f"""请从以下发票文本中提取结构化信息，并以 JSON 格式返回。 ... （同之前的发票prompt）""" return self._extract_with_llm(prompt) def _extract_with_llm(self, prompt: str) -> str: """通用的大模型提取方法""" try: from langchain_core.messages import HumanMessage response = self.llm.invoke([HumanMessage(content=prompt)]) response_text = response.content.strip() if response.content else "" # 处理markdown代码块 if "```json" in response_text: json_start = response_text.find("```json") + 7 json_end = response_text.find("```", json_start) if json_end != -1: response_text = response_text[json_start:json_end].strip() elif "```" in response_text: json_start = response_text.find("```") + 3 json_end = response_text.find("```", json_start) if json_end != -1: response_text = response_text[json_start:json_end].strip() data = json.loads(response_text) return json.dumps(data, ensure_ascii=False, indent=2) except json.JSONDecodeError as e: return json.dumps({ "error": "JSON 解析失败", "raw_response": response_text if 'response_text' in locals() else "无响应", "error_detail": str(e) }, ensure_ascii=False, indent=2) except Exception as e: return json.dumps({ "error": "提取信息失败", "error_detail": str(e), "raw_response": response_text if 'response_text' in locals() else "无响应" }, ensure_ascii=False, indent=2) # 其他提取函数（extract_medical_bill_info, extract_contract_info等）类似，都调用_extract_with_llm def extract_resume_info(self, file_path: str = None) -> str: # ... 类似 def extract_product_specs(self, file_path: str = None) -> str: # ... 类似 def extract_api_info(self, file_path: str = None) -> str: # ... 类似 def extract_contract_info(self, file_path: str = None) -> str: # ... 类似 def extract_medical_bill_info(self, file_path: str = None) -> str: # ... 类似 def format_data(self, file_path: str = None) -> str: import pandas as pd context_text = self.get_document_text(file_path) if not context_text: return "错误：未找到文档内容。请先使用 load_document() 方法加载文档，或提供文档路径。" prompt = f"""请从以下文本中提取所有键值对信息，并以 JSON 格式返回。 ... （同之前的format_data prompt）""" try: result_json = self._extract_with_llm(prompt) data_list = json.loads(result_json) if isinstance(data_list, dict) and "error" in data_list: return result_json if not data_list: return "未找到可格式化的数据" try: df = pd.DataFrame(data_list) csv_output = df.to_csv(index=False) return f"JSON格式：\n{result_json}\n\nCSV格式：\n{csv_output}" except Exception as e: return f"JSON格式：\n{result_json}\n\nCSV格式化失败：{str(e)}" except Exception as e: return f"数据格式化失败：{str(e)}" def query(self, question: str) -> str: from langchain_core.messages import HumanMessage response = self.agent.invoke({ "messages": [HumanMessage(content=question)] }) return response["messages"][-1].content def main(): agent = InformationExtractionAgent() document_path = "./extraction_documents/invoice.pdf" if os.path.exists(document_path): agent.load_document(document_path) questions = [ "从发票中提取发票代码、发票号码、销售方和购买方信息、商品明细和金额", "将提取的数据格式化为JSON格式" ] for question in questions: print(f"\n{'='*60}") print(f"问题: {question}") print(f"{'='*60}") answer = agent.query(question) print(f"\n回答:\n{answer}") if __name__ == "__main__": main() ``` ### 使用示例封装成类后，调用起来就非常清爽了。 ```python agent = InformationExtractionAgent() # 1. 加载文档 agent.load_document("./extraction_documents/invoice.pdf") # 2. 提取信息 response = agent.query("从发票中提取发票代码、发票号码、销售方和购买方信息、商品明细和金额") print(response) ``` 医疗票据、合同、简历、产品规格和API接口的使用方式大同小异，只是换一个文档路径和提问方式。 ```python # 医疗票据 agent.load_document("./extraction_documents/medical_bill.pdf") response = agent.query("从医疗票据中提取患者姓名、医院名称、诊断结果、总费用和医保支付金额") # 合同 agent.load_document("./extraction_documents/contract.pdf") response = agent.query("从合同中提取合同编号、甲方和乙方信息、合同金额、付款方式和违约责任") # 简历 agent.load_document("./extraction_documents/resume.pdf") response = agent.query("从简历中提取姓名、联系方式、教育经历和工作经历") # 产品规格 agent.load_document("./extraction_documents/product_spec.pdf") response = agent.query("从产品文档中提取产品名称、型号、技术参数和价格") # API接口 agent.load_document("./extraction_documents/api_docs.pdf") response = agent.query("从技术文档中提取所有API端点、请求方法和参数") ``` ### 最佳实践方案搭起来了，落地时还有些细节可以优化： 1. **解析引擎是关键：** 用`TextIn`解析引擎，对表格和列表的识别效果确实好，遇到复杂文档优先考虑。 2. **批量处理走起来：** 用`os.listdir`遍历文件夹，配合一个简单的`for`循环，就能实现批量处理。 3. **格式标准化：** 一定要统一输出格式（JSON或CSV），不然后续系统对接会很痛苦。 4. **财务文档要格外仔细：** * 提取发票信息时，发票代码、号码、金额是红线，一个都不能错。 * 医疗票据容易混费种，得让大模型区分清楚“自费”和“医保支付”。 * 金额计算结果必须做校验，最好能和财务系统对一下账。 5. **合同信息得抓重点：** * 合同中甲乙方信息、金额、付款方式、违约责任、争议解决条款是核心，必须提取出来。 * 提取有效期，方便后续做合同到期提醒。 6. **Prompt好，模型才会好：** 针对不同文档类型，认真打磨prompt，把提取字段、格式要求都列清楚，能显著提高准确率。 7. **错误处理不能少：** 大模型偶尔会抽风，返回非JSON的内容。`_extract_with_llm`里的错误捕获机制是必须的，对于彻底失败的，要记录日志，留待人工审核。 8. **存储放长远：** 示例里用内存字典存文档文本，玩玩可以。生产环境得用数据库或Redis，不然服务一重启，数据全没了。 9. **性能考虑：** 如果文档特别长，比如几百页的合同，可以分页或分章节处理，别一次把大模型撑爆了。 ### 常见问题 **Q: 提取准确率不够高怎么办？** * **A:** 建议从这几个方向排查：1）原始文档是否清晰，OCR准确率是基础；2）Prompt写得够不够详细，字段和格式是否明确；3）大模型能理解语义，但背景知识有限，对于特殊格式的发票或合同，可以在Prompt里给一个例子。 **Q: 格式不统一的文档怎么处理？** * **A:** 这是常态。可以先做一个预处理步骤，把所有文档统一成PDF。然后通过修改不同场景下的Prompt来应对。 **Q: 如何批量处理大量文档？** * **A:** 写个脚本遍历文件夹，对每个文件调用`load_document()`。如果想快一点，可以用多线程或异步，但要注意API的并发限制。 **Q: 发票信息提取结果不对？** * **A:** 先确认`xParse`解析出来的`elements`里的文本是否完整。如果OCR识别有问题，神仙也救不了。其次，检查prompt里的输出格式是否和发票实际内容匹配。 **Q: 医疗票据的费用明细怎么提取？** * **A:** `xParse`对表格识别很强，所以优先用表格识别功能。然后在Prompt里明确要求提取费用明细表中的“项目名称、数量、单价、金额、医保类型”等字段。 **Q: 合同关键条款人工智能识别？ * **A:** 把“违约责任”、“争议解决”、“保密条款”等关键词写进Prompt，让大模型去原文里找，然后完整复制出来。 **Q: 文档太长，大模型处理不了？** * **A:** 分段处理。比如每10页作为一个chunk丢给大模型，最后再把所有chunk的结果合并。或者只提取用户关心的关键章节。

信息提取Agent构建实战：xParse+LangChain结构化数据提取

相关阅读

最新教程

最新资讯