Python爬虫AI Agent新手入门：从专用到通用爬虫

2026-06-20阅读 0热度 0

ai 人工智能

到目前为止，整个系列已经陆续介绍了几个爬虫程序，有的能把网页内容直接扒下来，有的还能顺手存成PDF。现在把这些方法从头到尾理一理，从最简单的单页爬虫，到借助大模型做定向信息提取，再到利用AI Agent打造通用爬虫，一步步把过程讲清楚。

说明：这系列文章面向的是爬虫新手——包括我自己——所以里面整理的都是那些开箱即用的简单程序。没有复杂的黑科技，也不涉及底层原理，适合只想拿爬虫当数据源的非专业选手。如果追求的是爬虫技术内核，可能这里的内容就太浅了。

0. 单网页的专用爬虫实现方法

先说说最简单的场景：针对特定网页的爬虫，它针对的可能就是一个页面，也可能是一系列结构相似的页面。

这类爬虫的核心方法很直接：打开网页，F12进开发者工具，找到目标内容对应的HTML标签或CSS类名，然后写代码提取。

0.1 基本的爬虫程序实现方法

要是已经有一点爬虫基础，看到网页结构后，用BeautifulSoup写个简单爬虫基本不在话下。但如果是零基础呢？可以让ChatGPT、文心一言、智谱清言这类工具来帮忙。具体怎么操作，可以看这篇保姆级教程：

【提效】让GPT帮你写爬虫程序，不懂爬虫也能行

这篇文章详细讲了怎么找到目标文本在HTML中的标签和类名，以及怎么向大模型提Prompt和来回调整。另外，之前也有一篇专门讲用LangChain封装爬虫功能的文章，通过URL加载网页内容来用。不过，当初试了试用它抓微信公众号文章的数据，结果翻车了。

0.2 利用selenium实现爬虫

在另外一篇【Python实用技能】文章里，我们试过用selenium把网页自动保存成PDF。其实selenium也能直接从网页里捞内容。比如下面这段示例代码，就是用selenium模拟打开浏览器，通过xpath定位元素并爬取：

import os, json, time
from selenium import webdriver

def crawel_url(url):
    driver = webdriver.Chrome()
    print('-'*100)
    print(f'now: url: {url}')
    driver.get(url)
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    wait = WebDriverWait(driver, 10)
    element = driver.find_element("xpath", "/html/body/div[1]/div[2]/div[1]/div/div[1]/div[2]")
    content = element.text
    print(content)
    driver.close()

url_list = [
    'https://mp.weixin.qq.com/s/2m8MrsCxf5boiH4Dzpphrg',
]
for url in url_list:
    crawel_url(url)
    time.sleep(5)

xpath的获取方法也很简单：在F12调试面板里，找到元素，右键→复制→复制完整XPath，替换到代码里就行。跑出来的效果还行。

0.3 利用LangChain爬取网页内容

到目前为止，我们接触了两种用LangChain抓网页的方法：

0.3.1 Loading + Transforming

第一种是用LangChain的Loader模块加载HTML页面，然后通过Transformer模块把HTML转换成纯文本。Loader可以用AsyncHtmlLoader或AsyncChromiumLoader，Transformer可以用HTML2Text或BeautifulSoup。具体用法可以参考前半部分那篇LangChain系列文章。

0.3.2 WebBaseLoader

另一种是直接用WebBaseLoader类，这其实是对上面方法的高层封装：

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

小结：以上是单网页专用爬虫的几个实现。无论用哪种方式，都绕不开手动F12分析HTML结构这一步——找到标签或类名才能精准提取。换一个网页，标签可能就不通用了，得重新分析。

1. 利用大模型直接提取指定信息的探索

刚才虽然让大模型帮忙生成Python代码了，但代码本身并不通用。怎么让这个过程通用化？我们之前也做过尝试。比如在这篇文章的后半部分，就用LangChain的create_extraction_chain创建了一个“从网页内容中提取特定内容”的通用爬虫。

核心代码如下：

schema = {
    "properties": {
        "article_title": {"type": "string"},
        "article_content": {"type": "string"},
        "article_example_python_code": {"type": "string"},
    },
    "required": ["article_title", "article_content", "article_example_python_code"],
}

def extract(content: str, schema: dict):
    return create_extraction_chain(schema=schema, llm=llm).run(content)

关键是定义schema，它告诉大模型我们要抓什么。原理也不复杂：内部把schema转成OpenAI的function calling格式，让大模型用Function Calling能力去提取信息。具体步骤和原理，可以看上面链接那篇文章。

小结：这种方法在拿到网页全文后，直接让大模型从中抽信息，不需要知道标签或类名，因此有一定通用性。但效果完全取决于大模型能力和我们定义的schema质量。目前来看，能派上点用场，但要真正可用，还差了很大一截。

2. 利用AI Agent实现通用爬虫

2.1 实现思路

要真正实现通用爬虫，还得把目光放到AI Agent上。正好前段时间学MetaGPT，里面就有通用爬虫的实现例子，拿来借鉴了一下。具体步骤可以参考这篇文章的第3部分。它的核心思路是：先让大模型理解用户想要什么数据，然后根据需求让它写爬虫代码，再自动执行代码把内容抓回来。

2.2 自动化爬虫代码生成器

我把这个过程单独抽了出来，自动写爬虫代码的完整代码如下：

import asyncio
from metagpt.tools.web_browser_engine import WebBrowserEngine
from metagpt.utils.common import CodeParser
from metagpt.utils.parse_html import _get_soup
from openai_test import openai_test

def get_outline(page):
    soup = _get_soup(page.html)
    outline = []
    def process_element(element, depth):
        name = element.name
        if not name:
            return
        if name in ["script", "style"]:
            return
        element_info = {"name": element.name, "depth": depth}
        if name in ["svg"]:
            element_info["text"] = None
            outline.append(element_info)
            return
        element_info["text"] = element.string
        if "id" in element.attrs:
            element_info["id"] = element["id"]
        if "class" in element.attrs:
            element_info["class"] = element["class"]
        outline.append(element_info)
        for child in element.children:
            process_element(child, depth + 1)
    for element in soup.body.children:
        process_element(element, 1)
    return outline

async def test(url, query):
    page = await WebBrowserEngine().run(url)
    outline = get_outline(page)
    outline = "\n".join(
        f"{' '*i['depth']}{'.'.join([i['name'], *i.get('class', [])])}: {i['text'] if i['text'] else ''}"
        for i in outline
    )
    PROMPT_TEMPLATE = """Please complete the web page crawler parse function to achieve the User Requirement. The parse function should take a BeautifulSoup object as input, which corresponds to the HTML outline provided in the Context.

    ```python
    from bs4 import BeautifulSoup
    # only complete the parse function
    def parse(soup: BeautifulSoup):
        ...
        # Return the object that the user wants to retrieve, don't use print
    ```
    ## User Requirement
    {requirement}
    ## Context
    The outline of html page to scrabe is show like below:
    ```tree
    {outline}
    ```
    """
    code_rsp = openai_test.get_chat_completion(PROMPT_TEMPLATE.format(outline=outline, requirement=query))
    code = CodeParser.parse_code(block="", text=code_rsp)
    print(code)

asyncio.run(test("https://mp.weixin.qq.com/s/2m8MrsCxf5boiH4Dzpphrg", "获取标题，正文中的所有问题，正文中的代码"))

步骤总结一下：
（1）用WebBrowserEngine拿到网页的HTML结构：page = await WebBrowserEngine().run(url)
（2）用get_outline提取出网页主体结构，去掉无用数据、减少Token消耗：outline = get_outline(page)
（3）把outline和用户需求拼成Prompt，让大模型写代码：code_rsp = openai_test.get_chat_completion(PROMPT_TEMPLATE.format(outline=outline, requirement=query))

跑出来的结果（针对指定URL和需求的爬虫代码）：

def parse(soup: BeautifulSoup):
    title = soup.find('h1', class_='rich_media_title').text
    questions = []
    codes = []
    sections = soup.find_all('section')
    for section in sections:
        blocks = section.find_all(['p', 'h2', 'h3', 'pre', 'ul', 'blockquote'])
        for block in blocks:
            text = block.get_text(strip=True)
            if text:
                if block.name == 'p' or block.name == 'h2' or block.name == 'h3':
                    if text not in ['公众号内文章一览', '原创', '同学小张', '2024-03-13 08:00', '北京']:
                        questions.append(text)
                if block.name == 'pre' or block.name == 'ul' or block.name == 'blockquote':
                    codes.append(text)
    return {'title': title, 'questions': questions, 'codes': codes}

测了一下大模型写的这个爬虫：

import asyncio
from metagpt.tools.web_browser_engine import WebBrowserEngine
from bs4 import BeautifulSoup

def parse(soup: BeautifulSoup):
    title = soup.find('h1', class_='rich_media_title').text
    questions = []
    codes = []
    sections = soup.find_all('section')
    for section in sections:
        blocks = section.find_all(['p', 'h2', 'h3', 'pre', 'ul', 'blockquote'])
        for block in blocks:
            text = block.get_text(strip=True)
            if text:
                if block.name == 'p' or block.name == 'h2' or block.name == 'h3':
                    if text not in ['公众号内文章一览', '原创', '同学小张', '2024-03-13 08:00', '北京']:
                        questions.append(text)
                if block.name == 'pre' or block.name == 'ul' or block.name == 'blockquote':
                    codes.append(text)
    return {'title': title, 'questions': questions, 'codes': codes}

async def test(url):
    page = await WebBrowserEngine().run(url)
    result = parse(page.soup)
    print(result)

asyncio.run(test("https://mp.weixin.qq.com/s/2m8MrsCxf5boiH4Dzpphrg"))

结果页面直接给出标题、正文问题和代码，效果相当不错。这里只是手动把爬虫代码拿出来测的；在真正的AI Agent里，再加一个Agent专门自动跑这段代码，就能跑通整个通用爬虫流程——用户只需输入URL和想抓的数据，省心省力，而且比第1节里直接用大模型提取的效果要好很多。

2.3 可能遇到的问题

虽然用get_outline精简过HTML了，但有时还是会碰到超Token限制的情况，报这样的错：

处理办法很简单粗暴：限制最终Prompt的Token数——这些Token用来表达HTML结构已经够了。

if (len(prompt) > 16000):
    prompt = prompt[0:16000]

3. 总结

从这篇文章里，我们盘点了截至目前用过的所有爬虫代码，并分析了各自实现思路。从专用爬虫，到大模型直接提取指定信息，再到用AI Agent打造通用爬虫——循序渐进，从手动到几乎全自动，覆盖面挺全。本篇文章以及关联文章里的代码全部亲测可用，可以直接拿去用。

题外话：运行过程中可能涉及Playwright环境，LangChain和MetaGPT底层也经常用到它。关于怎么在云服务上装Playwright，之前写了一篇避坑实录，可以参考。