ScrapeGraphAI LLM爬虫

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎 - AI模型在线查看 - Three.js虚拟轴心开发包 - 3D模型在线减面 - STL模型在线切割 - 3D道路快速建模

在数据驱动的动态领域，从在线资源中提取有价值的见解至关重要。从市场分析到学术研究，对特定数据的需求推动了对强大的网络抓取工具的需求。

传统上，像 BeautifulSoup 和 Scrapy 这样的 Python 库一直是首选解决方案，需要用户利用编程专业知识来浏览复杂的网络结构。例如这个BeautifulSoup的示例：

# BeautifulSoup Example
from bs4 import BeautifulSoup
import requests

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title)

或这个Scrapy的示例：

# Scrapy Example
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        print(title)

1、ScrapeGraphAI 简介

ScrapeGraphAI 是一款开创性的 Python 库，可重塑网络抓取格局。这款创新工具利用大型语言模型 (LLM) 和直接图形逻辑的强大功能来简化数据收集。与前代产品不同，ScrapeGraphAI 使用户能够表达他们的数据需求，从而消除网络抓取的复杂性。

%%capture
!apt install chromium-chromedriver
!pip install nest_asyncio
!pip install scrapegraphai
!playwright install

# if you plan on using text_to_speech and GPT4-Vision models be sure to use the
# correct APIKEY
OPENAI_API_KEY = "YOUR API KEY"
GOOGLE_API_KEY = "YOUR API KEY"

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}


smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()

import json

output = json.dumps(result, indent=2)

line_list = output.split("\n")  # Sort of line replacing "\n" with a new line

for line in line_list:
    print(line)

2、SpeechGraph

SpeechGraph 是一个类，代表默认抓取管道之一，可生成答案和音频文件。与 SmartScraperGraph 类似，但添加了 TextToSpeechNode 节点。

from scrapegraphai.graphs import SpeechGraph

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": OPENAI_API_KEY,
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "website_summary.mp3",
}

# Create the SpeechGraph instance
speech_graph = SpeechGraph(
    prompt="Create a summary of the website",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
answer = result.get("answer", "No answer found")

import json

output = json.dumps(answer, indent=2)

line_list = output.split("\n")  # Sort of line replacing "\n" with a new line

for line in line_list:
    print(line)

from IPython.display import Audio
wn = Audio("website_summary.mp3", autoplay=True)
display(wn)

3、GraphBuilder（实验性）

GraphBuilder 根据用户提示从头开始创建抓取管道。它返回包含节点和边的图形。

GraphBuilder 是一个实验性类，可帮助您根据提示创建自定义图形。它创建一个包含标识图形的基本元素的 json，并允许您使用 graphviz 对其进行可视化。它知道库默认提供的节点类型，并将它们连接起来以帮助您实现目标。

from scrapegraphai.builders import GraphBuilder

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

# Example usage of GraphBuilder
graph_builder = GraphBuilder(
    user_prompt="Extract the news and generate a text summary with a voiceover.",
    config=graph_config
)

graph_json = graph_builder.build_graph()

# Convert the resulting JSON to Graphviz format
graphviz_graph = graph_builder.convert_json_to_graphviz(graph_json)

# Save the graph to a file and open it in the default viewer
graphviz_graph.render('ScrapeGraphAI_generated_graph', view=True)

graph_json
graphviz_graph

4、ScrapeGraphAI 的工作原理

ScrapeGraphAI 通过解释用户查询并智能地导航 Web 内容以获取所需信息来运行。利用 LLM，它可以自主构建抓取管道，最大限度地减少用户干预。这种方法不仅提高了效率，还降低了进入门槛，使用户能够专注于数据分析而不是技术复杂性。

ScrapeGraphAI 能够自动执行复杂的抓取任务，同时确保高精度，是各行各业专业人士的游戏规则改变者。无论是监控竞争对手还是进行学术研究，此工具都使用户能够有效地利用网络数据。随着数字格局的不断发展，ScrapeGraphAI 成为推动数据驱动决策向前发展的不可或缺的盟友。

5、结束语

在以数据为中心的世界中，高效数据提取的重要性怎么强调也不为过。

ScrapeGraphAI 代表了网络抓取的范式转变，提供了一种由尖端技术支持的用户友好方法。当企业和研究人员力争在竞争环境中保持领先地位时，采用这样的工具对于获得可行的见解和做出明智的决策至关重要。

原文链接：LLM Web Scraping with ScrapeGraphAI: A Breakthrough in Data Extraction

BimAnt翻译整理，转载请标明出处

ScrapeGraphAI LLM爬虫

1、ScrapeGraphAI 简介

2、SpeechGraph

3、GraphBuilder（实验性）

4、ScrapeGraphAI 的工作原理

5、结束语

admin

4个顶级LLM推理引擎

Scrapy 大模型爬虫

1、ScrapeGraphAI 简介

2、SpeechGraph

3、GraphBuilder（实验性）

4、ScrapeGraphAI 的工作原理

5、结束语

4个顶级LLM推理引擎

Scrapy 大模型爬虫

You might also like...

You might also like...