提取标准 OCR 遗漏的图表数据:Elastic Agent Builder 和 LlamaParse 在一个管道中
作者来自 Elastic Jeffrey Rengifo构建一个端到端流水线该流水线从复杂 PDF文档 中提取结构化数据包括图表中的数值并将其导入 Elasticsearch 支持使用 ES|QL 的 agent 查询就绪。Agent Builder 现在已正式发布。通过注册 Elastic Cloud Trial 开始使用并查看 Agent Builder 文档 这里。企业级文档很难处理。包含多栏表格、扫描页面、混合布局以及嵌入图表的 PDF 会让简单的文本提取失效。标准处理流水线会遗漏这些文档中隐藏的结构化数据这意味着你的 agent 在基于不完整上下文进行推理。大多数 agent 框架使用基础 OCR 工具处理 PDF它们在纯文本文档中表现尚可但在更复杂场景下表现不足。为了解决这个问题你需要一个系统能够从这些文档中准确提取结构化数据并自动将其索引以便 agent 使用。Elastic Agent Builder 和 LlamaParse Extract API 的组合覆盖了数据处理与推理两个环节。LlamaParse Extract 负责文档复杂性处理它将 schema 定义的提取模型应用到原始 PDF 上并返回结构化 JSON。Elastic Agent Builder 负责编排通过 Elastic Workflows 调用 LlamaParse Extract API将结果写入 Elasticsearch并让 agent 能够通过 Elasticsearch Query Language (ES|QL) 或语义查询对这些数据进行推理。本文将讲解完整实现流程从 schema 定义到可运行的 agent。完整代码可在配套 notebook 中查看 companion notebook。何时应该使用 LlamaParse Extract 而不是 LlamaParse当你需要返回结构化的 JSON 具体字段时应使用 LlamaParse Extract。当你需要完整文档内容用于 RAG pipeline 时应使用 LlamaParse。LlamaCloud 提供两种文档处理工具。LlamaParse 将文档转换为 LLM-ready 格式如 Markdown保留完整文档布局。它用于 RAGRetrieval Augmented Generationpipeline当你希望将整个文档内容送入向量库或 LLM 上下文窗口时使用。LlamaParse Extract 则相反它基于开发者定义的 schema只返回你指定的字段并以校验过的 JSON 形式输出。它还可以提取图表、图形以及文档中的视觉元素数据这是标准文本解析完全无法覆盖的能力。LlamaParse / LlamaParse Extract输出格式Markdown完整文档结构化 JSON基于 schema 定义的字段最适合场景RAG pipelinesLLM 上下文窗口搜索索引、数据库、类型化字段是否支持图表/图形否是输入PDFPDF 开发者定义的 schema对于这个用例来说Extract 是更好的选择。我们需要具体的数值GDP 占比百分比、投资增长率以及叙述性总结并将其作为类型化的 Elasticsearch 字段进行索引。schema 会告诉提取模型具体要寻找什么包括那些只存在于 PDF 可视化图表中的数值。如果你的用例是将完整文档内容输入到 RAG pipeline那么 LlamaParse 是正确工具。如果你需要用于搜索索引或数据库的结构化记录尤其是当数据存在于图表或表格中时应使用 Extract。LlamaParse Extract 和 Elastic Agent Builder 流水线做了什么这个流水线的工作方式如下用户发送一个问题并附带一个 PDF 的 URL。agent 会调用一个 workflow 工具将 PDF 上传到 LlamaParse通过 LlamaParse Extract API 执行基于 schema 的结构化提取然后把结构化结果索引到 Elasticsearch 中。然后 agent 会使用 ES|QL 对该索引进行查询以回答这个问题。为了演示这一点我们使用 世界银行全球经济展望2026年1月 报告作为源文档。该 agent 将能够直接从该 PDF 中回答关于前沿市场、经济指标以及政策建议的问题。如何为 LlamaParse Extract 构建一个 Elastic Workflows 工具Workflow tool 在 Elastic Agent Builder 中作为基于 YAML 定义的自动化运行。当被触发时它会执行六个步骤。关于将 Agent Builder 和 Workflows 连接起来 的通用流程该文章有更详细的搭建说明。这里我们重点讲 LlamaParse Extract 的特定集成方式。上传 PDF从提供的 URL 将 PDF 上传到 LlamaCloud。该 workflow 接收pdf_url作为输入你的 LlamaCloud API key、project ID 和 configuration ID 作为常量进行配置。创建提取任务使用已保存的配置在已上传文件上启动 Extract v2 任务。等待暂停 15 秒给 LlamaParse Extract 留出处理文档的时间。轮询直到完成在while循环中每 10 秒轮询一次提取状态直到状态变为COMPLETED最多循环 19 次约 3 分钟上限。索引将 PDF 中所有提取字段作为单个 Elasticsearch 文档写入。验证使用document_id以及查询 term 查询 在 Elasticsearch 中执行搜索以确认文档已成功索引。前置条件Elasticsearch 9.3并启用 WorkflowsLlamaParse Cloud 账号并具备 API key API keyPython 3.1x使用 Elastic Agent Builder 实现 LlamaParse Extract分步说明下面的代码展示了实现的关键步骤。完整版本包含所有细节请参考配套 notebook companion notebook。配置环境import os import json from dotenv import load_dotenv load_dotenv() ELASTICSEARCH_URL os.getenv(ELASTICSEARCH_URL) ELASTICSEARCH_API_KEY os.getenv(ELASTICSEARCH_API_KEY) LLAMA_CLOUD_API_KEY os.getenv(LLAMA_CLOUD_API_KEY) KIBANA_URL os.getenv(KIBANA_URL)定义提取 schemaLlamaParse Extract 是基于 schema 驱动的。你需要定义一个 Pydantic 模型用来描述希望提取的字段而 LlamaParse Extract 会基于该模型从原始 PDF 中引导提取。这正是它在复杂文档中保持可靠性的关键。与其 “期望” LLM 找到正确的值不如明确告诉它需要找什么。如上所述agent 需要处理两类问题结构化问题“2025 年前沿市场 GDP 占比是多少” 这类问题需要可用 ES|QL 过滤的数值字段。探索性问题“前沿市场的主要风险是什么” 这类问题需要可进行语义检索的叙述性文本字段。这个 schema 只捕捉支持这两类问题所需的最小信息集合。description字段不是文档说明而是对提取模型的指令因此越具体提取质量越高。其中两个字段frontier_market_gdp_share_pct和frontier_market_investment_growth_2020s_pct直接来自报告中的柱状图图 A 和图 C这些数据是标准文本提取工具无法捕获的。class EconomicReportSummary(BaseModel): report_title: str Field(descriptionFull title of the report) publication_date: str Field( descriptionPublication date in YYYY-MM format, e.g. 2026-01 ) frontier_market_gdp_share_pct: float Field( descriptionFrontier markets share of global GDP in 2025 as a percentage, extracted from Figure ES.A bar chart ) frontier_market_investment_growth_2020s_pct: float Field( descriptionAverage annual per capita investment growth for frontier markets in the early 2020s (2020-24), extracted from Figure ES.C bar chart, as a percentage ) executive_summary: str Field( descriptionConcise summary of the reports main findings and conclusions ) key_vulnerabilities: str Field( descriptionMain vulnerabilities and risks facing frontier markets, as a paragraph ) policy_recommendations: str Field( descriptionKey policy recommendations for frontier market policymakers, as a paragraph )创建 Extract 配置LlamaExtract v2 用 “已保存配置saved configurations” 替代了旧的 extraction agents本质上是一个可复用的参数集合用来将你的 schema 与 extraction tier 绑定在一起。首先获取你的 project ID然后通过POST创建配置。保存打印出来的两个 ID因为你会在 workflow YAML 中用到它们。import requests LLAMA_CLOUD_BASE_URL https://api.cloud.llamaindex.ai llama_headers { Authorization: fBearer {LLAMA_CLOUD_API_KEY}, Content-Type: application/json, } # Fetch the project ID (uses the first available project) projects_response requests.get( f{LLAMA_CLOUD_BASE_URL}/api/v1/projects, headersllama_headers, ) projects_response.raise_for_status() PROJECT_ID projects_response.json()[0][id] print(fProject ID: {PROJECT_ID}) # Create a saved Extract v2 configuration with our schema config_response requests.post( f{LLAMA_CLOUD_BASE_URL}/api/v1/beta/configurations, headersllama_headers, params{project_id: PROJECT_ID}, json{ name: global-economic-extractor, parameters: { product_type: extract_v2, data_schema: EconomicReportSummary.model_json_schema(), extraction_target: per_doc, tier: agentic, }, }, ) config_response.raise_for_status() CONFIGURATION_ID config_response.json()[id] print(fConfiguration ID: {CONFIGURATION_ID})创建 Elasticsearch 索引创建索引时需要使用与 extraction schema 对应的 mapping。数值字段例如frontier_market_gdp_share_pct使用 number 类型以支持 ES|QL 结构化过滤。publication_date字段使用 date 类型以支持时间范围查询。叙述性字段例如executive_summary和key_vulnerabilities使用 text 类型用于 全文检索。标识类字段例如report_title使用 keyword 类型。完整 mapping 定义可在 notebook 中查看。构建 Agent Builder workflow 工具将下面的 YAML 复制并粘贴到 Elastic UI 中的Elasticsearch Workflows Create a new Workflow。Elastic Workflows 文档 覆盖了完整 YAML schema 以及可用的 step types。该 workflow 使用 LlamaCloud 的 REST API通过 Files API/api/v1/files/upload_from_url从公开 URL 上传 PDF并通过 Extract v2 API/api/v2/extract创建任务并轮询结果/api/v2/extract/{id}。name: LlamaParse Extract Economic Report Processor description: Uploads a PDF from a URL to LlamaCloud, runs Extract v2, and indexes the structured results into Elasticsearch. enabled: true inputs: - name: pdf_url type: string description: Public URL of the PDF to process required: true consts: indexName: economic-reports llamaBaseUrl: https://api.cloud.llamaindex.ai projectId: YOUR_PROJECT_ID configurationId: YOUR_CONFIGURATION_ID documentId: global-economic-prospects-jan-2026 llamaCloudApiKey: llx-YOUR-API-KEY-HERE triggers: - type: manual steps: # Upload PDF from URL to LlamaCloud - name: upload_pdf type: http with: url: {{ consts.llamaBaseUrl }}/api/v1/files/upload_from_url method: PUT headers: Authorization: Bearer {{ consts.llamaCloudApiKey }} Content-Type: application/json body: | { url: {{ inputs.pdf_url }} } # Create the Extract v2 job using our saved configuration - name: create_extraction_job type: http with: url: {{ consts.llamaBaseUrl }}/api/v2/extract?project_id{{ consts.projectId }} method: POST headers: Authorization: Bearer {{ consts.llamaCloudApiKey }} Content-Type: application/json body: | { file_input: {{ steps.upload_pdf.output.data.id }}, configuration_id: {{ consts.configurationId }} } - name: wait_for_extraction type: wait with: duration: 15s # Poll every 10s until status is COMPLETED (max ~3 min) - name: poll_until_done type: while condition: not steps.poll_get.output.data.status : COMPLETED max-iterations: limit: 19 on-limit: fail steps: - name: poll_wait type: wait with: duration: 10s - name: poll_get type: http with: url: {{ consts.llamaBaseUrl }}/api/v2/extract/{{ steps.create_extraction_job.output.data.id }}?project_id{{ consts.projectId }} method: GET headers: Authorization: Bearer {{ consts.llamaCloudApiKey }} Accept: application/json - name: index_extracted_data type: elasticsearch.index with: index: {{ consts.indexName }} id: {{ consts.documentId }} document: report_title: {{ steps.poll_get.output.data.extract_result.report_title }} publication_date: {{ steps.poll_get.output.data.extract_result.publication_date }} frontier_market_gdp_share_pct: {{ steps.poll_get.output.data.extract_result.frontier_market_gdp_share_pct }} frontier_market_investment_growth_2020s_pct: {{ steps.poll_get.output.data.extract_result.frontier_market_investment_growth_2020s_pct }} executive_summary: {{ steps.poll_get.output.data.extract_result.executive_summary }} key_vulnerabilities: {{ steps.poll_get.output.data.extract_result.key_vulnerabilities }} policy_recommendations: {{ steps.poll_get.output.data.extract_result.policy_recommendations }} refresh: wait_for - name: verify_document type: elasticsearch.search with: index: {{ consts.indexName }} query: term: _id: {{ consts.documentId }}用你自己的llamaCloudApiKey、projectId和configurationId更新consts部分。与 Agent Builder 连接保存 workflow 之后使用 Agent Builder Kibana API 或 UI 创建两个工具和一个 agent。import requests headers { Authorization: fApiKey {ELASTICSEARCH_API_KEY}, kbn-xsrf: true, Content-Type: application/json, } WORKFLOW_ID workflow-abcabc-0073-4a08-98a8-werwer # Copy from UI after creating the workflow # Create the workflow tool workflow_tool_payload { id: run_llamaextract_workflow, type: workflow, description: ( Triggers the LlamaParse Extract extraction workflow. Use this tool to extract structured data from a PDF URL and index it into Elasticsearch. Requires only the public URL of the PDF to process. ), tags: [llama-extract, workflow], configuration: { workflow_id: WORKFLOW_ID, }, } response requests.post( f{KIBANA_URL}/api/agent_builder/tools, headersheaders, jsonworkflow_tool_payload, ) print(fWorkflow tool: {response.status_code}) # Create the structured indicators query tool structured_query_payload { id: query_structured_indicators, type: esql, description: ( Query structured economic indicators using exact filters on numeric fields. Use this for questions like Which reports show frontier market GDP share below 5%? or Show investment growth for reports published after 2025-01. ), tags: [economic-data, llama-extract], configuration: { query: ( FROM economic-reports | WHERE frontier_market_gdp_share_pct ?max_gdp_share | KEEP report_title, publication_date, frontier_market_gdp_share_pct, frontier_market_investment_growth_2020s_pct | SORT publication_date DESC | LIMIT 10 ), params: { max_gdp_share: { type: double, description: Maximum frontier market GDP share percentage to filter by, } }, }, } response requests.post( f{KIBANA_URL}/api/agent_builder/tools, headersheaders, jsonstructured_query_payload, ) print(fStructured query tool: {response.status_code}) # Create the narrative text search tool text_search_payload { id: search_economic_narratives, type: esql, description: ( Search narrative content in economic reports. Use this for open-ended questions like What are the main risks for frontier markets? or What does the report recommend for policymakers?. ), tags: [economic-data, llama-extract], configuration: { query: ( FROM economic-reports | WHERE MATCH(executive_summary, ?query) OR MATCH(key_vulnerabilities, ?query) OR MATCH(policy_recommendations, ?query) | KEEP report_title, executive_summary, key_vulnerabilities, policy_recommendations | LIMIT 5 ), params: { query: { type: keyword, description: The search query to find relevant narrative content, } }, }, } response requests.post( f{KIBANA_URL}/api/agent_builder/tools, headersheaders, jsontext_search_payload, ) print(fText search tool: {response.status_code}) # Create the agent agent_payload { id: economic-report-analyst, name: Economic Report Analyst, description: Extracts and analyzes economic reports from PDFs using LlamaParse Extract and Elasticsearch., labels: [economics, llama-extract], configuration: { instructions: ( You are an economic research assistant. You have three tools:\n 1. run_llamaextract_workflow: Use this FIRST to extract and index data from a PDF.\n 2. query_structured_indicators: Use this for structured questions that filter on numeric fields like GDP share or investment growth.\n 3. search_economic_narratives: Use this for open-ended questions about risks, vulnerabilities, or policy recommendations.\n When presenting data, use clear formatting with bullet points or tables. Always cite the report title and publication date. ), tools: [ { tool_ids: [ run_llamaextract_workflow, query_structured_indicators, search_economic_narratives, ] } ], }, } response requests.post( f{KIBANA_URL}/api/agent_builder/agents, headersheaders, jsonagent_payload, ) print(fAgent: {response.status_code})workflow ID 可以在 URL 中找到https://4622216ea8cd443ead5bef0a3de05135.us-central1.gcp.cloud.es.io/app/workflows/WORKFLOW-ID通过 Agent Builder 测试 workflow配置好 “经济报告分析员” agent 后打开 Agent Builder 聊天窗口并发送如下消息请替换为你自己的实际 IDProcess this PDF and extract its data: your-pdf-url Once extracted, answer: What are the main vulnerabilities facing frontier markets and what policy recommendations does the report suggest to address them?Agent 会首先调用run_llamaextract_workflow等待 workflow 完成然后再使用search_economic_reports来检索并总结提取后的数据。Process this PDF and extract its data: https://diaphysial-rafael-unscrupulously.ngrok-free.dev/Markets-Executive-Summary.pdf Once extracted, answer: What are the main vulnerabilities facing frontier markets and what policy recommendations does the report suggest to address them?提示在实际的使用中https://raw.githubusercontent.com/Delacrobix/notebook-llamaindex-agent-builder/main/Markets-Executive-Summary.pdf 这个文件的路径可能导致失败。一种做法就是使用如下的命令在本地构建一个文件服务器python3 -m http.server 8000ngrok http 8000你可以参考 ngrok: deliver your apps, APIs, and AI on local and prod 来进行设置。我们可以得到一个公共的访问地址比如https://diaphysial-rafael-unscrupulously.ngrok-free.dev - http://localhost:8000结果What is the frontier market GDP share in 2025?结论从原始 PDF 到 agent 可用数据LlamaParse Extract 解决了文档理解问题它可以从包含图表和表格的 PDF 中提取结构化、基于 schema 的数据。Elastic Agent Builder 解决了编排问题将提取、索引以及查询串联成 workflow并提供 agent 可按需调用的 ES|QL 工具。两者结合在一起就弥合了企业原始文档与 agent 可用数据之间的鸿沟。这一模式不仅适用于经济报告。任何具有明确结构的企业文档合同、技术规范、财务报表都可以通过 Pydantic schema 建模并用同样的 pipeline 进行处理。下一步尝试配套 notebookElastic Agent Builder 文档LlamaParse Extract 入门Elastic Workflows 文档Agent Builder Kibana API原文PDF data extraction with LlamaParse and Elastic Agent Builder - Elasticsearch Labs