ScrapeGraphAI LLM爬虫

在数据驱动的动态领域，从在线资源中提取有价值的见解至关重要。从市场分析到学术研究，对特定数据的需求推动了对强大的网络抓取工具的需求。

传统上，像 BeautifulSoup 和 Scrapy 这样的 Python 库一直是首选解决方案，需要用户利用编程专业知识来浏览复杂的网络结构。例如这个BeautifulSoup的示例：

# BeautifulSoup Example
from bs4 import BeautifulSoup
import requests

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title)

或这个Scrapy的示例：

# Scrapy Example
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        print(title)

1、ScrapeGraphAI 简介

ScrapeGraphAI 是一款开创性的 Python 库，可重塑网络抓取格局。这款创新工具利用大型语言模型 (LLM) 和直接图形逻辑的强大功能来简化数据收集。与前代产品不同，ScrapeGraphAI 使用户能够表达他们的数据需求，从而消除网络抓取的复杂性。

%%capture
!apt install chromium-chromedriver
!pip install nest_asyncio
!pip install scrapegraphai
!playwright install

# if you plan on using text_to_speech and GPT4-Vision models be sure to use the
# correct APIKEY
OPENAI_API_KEY = "YOUR API KEY"
GOOGLE_API_KEY = "YOUR API KEY"

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}


smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()

import json

output = json.dumps(result, indent=2)

line_list = output.split("\n")  # Sort of line replacing "\n" with a new line

for line in line_list:
    print(line)

2、SpeechGraph

SpeechGraph 是一个类，代表默认抓取管道之一，可生成答案和音频文件。与 SmartScraperGraph 类似，但添加了 TextToSpeechNode 节点。

from scrapegraphai.graphs import SpeechGraph

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": OPENAI_API_KEY,
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "website_summary.mp3",
}

# Create the SpeechGraph instance
speech_graph = SpeechGraph(
    prompt="Create a summary of the website",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
answer = result.get("answer", "No answer found")

import json

output = json.dumps(answer, indent=2)

line_list = output.split("\n")  # Sort of line replacing "\n" with a new line

for line in line_list:
    print(line)

from IPython.display import Audio
wn = Audio("website_summary.mp3", autoplay=True)
display(wn)

3、GraphBuilder（实验性）

GraphBuilder 根据用户提示从头开始创建抓取管道。它返回包含节点和边的图形。

GraphBuilder 是一个实验性类，可帮助您根据提示创建自定义图形。它创建一个包含标识图形的基本元素的 json，并允许您使用 graphviz 对其进行可视化。它知道库默认提供的节点类型，并将它们连接起来以帮助您实现目标。

from scrapegraphai.builders import GraphBuilder

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

# Example usage of GraphBuilder
graph_builder = GraphBuilder(
    user_prompt="Extract the news and generate a text summary with a voiceover.",
    config=graph_config
)

graph_json = graph_builder.build_graph()

# Convert the resulting JSON to Graphviz format
graphviz_graph = graph_builder.convert_json_to_graphviz(graph_json)

# Save the graph to a file and open it in the default viewer
graphviz_graph.render('ScrapeGraphAI_generated_graph', view=True)

graph_json
graphviz_graph

4、ScrapeGraphAI 的工作原理

ScrapeGraphAI 通过解释用户查询并智能地导航 Web 内容以获取所需信息来运行。利用 LLM，它可以自主构建抓取管道，最大限度地减少用户干预。这种方法不仅提高了效率，还降低了进入门槛，使用户能够专注于数据分析而不是技术复杂性。

ScrapeGraphAI 能够自动执行复杂的抓取任务，同时确保高精度，是各行各业专业人士的游戏规则改变者。无论是监控竞争对手还是进行学术研究，此工具都使用户能够有效地利用网络数据。随着数字格局的不断发展，ScrapeGraphAI 成为推动数据驱动决策向前发展的不可或缺的盟友。

5、结束语

在以数据为中心的世界中，高效数据提取的重要性怎么强调也不为过。

ScrapeGraphAI 代表了网络抓取的范式转变，提供了一种由尖端技术支持的用户友好方法。当企业和研究人员力争在竞争环境中保持领先地位时，采用这样的工具对于获得可行的见解和做出明智的决策至关重要。

原文链接：LLM Web Scraping with ScrapeGraphAI: A Breakthrough in Data Extraction

BimAnt翻译整理，转载请标明出处