Firecrawl基于AI的链接内容提取工具 - 文章 - 开发者社区

🔥 什么是Firecrawl？

Firecrawl是一个API服务，它接受一个URL，对其进行爬取，并将其转换为干净的markdown或结构化数据。我们爬取所有可访问的子页面，并为每一个页面提供干净的数据。无需站点地图。

Pst. 嘿，你，加入我们的星标用户吧 :)

picture.image

如何使用？

我们的托管版本提供易于使用的API。您可以在此处找到游乐场和文档。如果您愿意，也可以自托管后端。

•API •Python SDK •Node SDK •Langchain集成 🦜🔗 •Llama Index集成 🦙 •Langchain JS集成 🦜🔗 •想要一个SDK或集成？通过开启一个issue让我们知道。 •要在本地运行，请参考这里的指南。

API密钥

要使用API，您需要在Firecrawl上注册并获取一个API密钥。

爬取

用于爬取一个URL及其所有可访问的子页面。这将提交一个爬取作业，并返回一个作业ID以检查爬取状态。


            
curl -X POST https://api.firecrawl.dev/v0/crawl \
            
    -H 'Content-Type: application/json' \
            
    -H 'Authorization: Bearer YOUR_API_KEY' \
            
    -d '{
            
      "url": "https://mendable.ai"
            
    }'

返回一个jobId：


        
            

          
 {
  
 "jobId"
 :
  
 "1234-5678-9101"
  
 }

检查爬取作业

用于检查爬取作业的状态并获取其结果。


          
curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
          
  -H 'Content-Type: application/json' \
          
  -H 'Authorization: Bearer YOUR_API_KEY'


          
{
          
  "status": "completed",
          
  "current": 22,
          
  "total": 22,
          
  "data": [
          
    {
          
      "content": "Raw Content",
          
      "markdown": "# Markdown Content",
          
      "provider": "web-scraper",
          
      "metadata": {
          
        "title": "Mendable | AI for CX and Sales",
          
        "description": "AI for CX and Sales",
          
        "language": null,
          
        "sourceURL": "https://www.mendable.ai/"
          
      }
          
    }
          
  ]
          
}

爬取用于爬取一个URL并获取其内容。


          
curl -X POST https://api.firecrawl.dev/v0/scrape \
          
    -H 'Content-Type: application/json' \
          
    -H 'Authorization: Bearer YOUR_API_KEY' \
          
    -d '{
          
      "url": "https://mendable.ai"
          
    }'

响应：


          
{
          
  "success": true,
          
  "data": {
          
    "content": "Raw Content",
          
    "markdown": "# Markdown Content",
          
    "provider": "web-scraper",
          
    "metadata": {
          
      "title": "Mendable | AI for CX and Sales",
          
      "description": "AI for CX and Sales",
          
      "language": null,
          
      "sourceURL": "https://www.mendable.ai/"
          
    }
          
  }
          
}

搜索（Beta版）

用于搜索网络，获取最相关结果，爬取每个页面并返回markdown格式内容。


          
curl -X POST https://api.firecrawl.dev/v0/search \
          
    -H 'Content-Type: application/json' \
          
    -H 'Authorization: Bearer YOUR_API_KEY' \
          
    -d '{
          
      "query": "firecrawl",
          
      "pageOptions": {
          
        "fetchPageContent": true // false for a fast serp api
          
      }
          
    }'

结果：


          
{
          
  "success": true,
          
  "data": [
          
    {
          
      "url": "https://mendable.ai",
          
      "markdown": "# Markdown Content",
          
      "provider": "web-scraper",
          
      "metadata": {
          
        "title": "Mendable | AI for CX and Sales",
          
        "description": "AI for CX and Sales",
          
        "language": null,
          
        "sourceURL": "https://www.mendable.ai/"
          
      }
          
    }
          
  ]
          
}

智能提取（Beta版）

用于从爬取的页面中提取结构化数据。


          
curl -X POST https://api.firecrawl.dev/v0/scrape \
          
    -H 'Content-Type: application/json' \
          
    -H 'Authorization: Bearer YOUR_API_KEY' \
          
    -d '{
          
      "url": "https://www.mendable.ai/",
          
      "extractorOptions": {
          
        "mode": "llm-extraction",
          
        "extractionPrompt": "Based on the information on the page, extract the information from the schema.",
          
        "extractionSchema": {
          
          "type": "object",
          
          "properties": {
          
            "company_mission": {"type": "string"},
          
            "supports_sso": {"type": "boolean"},
          
            "is_open_source": {"type": "boolean"},
          
            "is_in_yc": {"type": "boolean"}
          
          },
          
          "required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
          
        }
          
      }
          
    }'


          
{
          
    "success": true,
          
    "data": {
          
      "content": "Raw Content",
          
      "metadata": {
          
        "title": "Mendable",
          
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
          
        "robots": "follow, index",
          
        "ogTitle": "Mendable",
          
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
          
        "ogUrl": "https://mendable.ai/",
          
        "ogImage": "https://mendable.ai/mendable_new_og1.png",
          
        "ogLocaleAlternate": [],
          
        "ogSiteName": "Mendable",
          
        "sourceURL": "https://mendable.ai/"
          
      },
          
      "llm_extraction": {
          
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
          
        "supports_sso": true,
          
        "is_open_source": false,
          
        "is_in_yc": true
          
      }
          
    }
          
}

使用Python SDK

安装Python SDK


        
            

          
 pip install firecrawl
 -
 py

爬取一个网站


          
from firecrawl import FirecrawlApp
          

          
app = FirecrawlApp(api_key="YOUR_API_KEY")
          

          
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
          
for result in crawl_result:
          
    print(result['markdown'])

爬取单个URL

要爬取单个URL，使用scrape_url方法。它接受URL作为参数，并返回作为字典的爬取数据。


          
url = 'https://example.com'
          
scraped_data = app.scrape_url(url)

从URL提取结构化数据

使用LLM提取，您可以轻松地从任何URL提取结构化数据。我们支持pydantic架构，以便您更容易使用。以下是使用方法：


          
from pydantic import BaseModel, Field
          
from typing import List
          

          
class ArticleSchema(BaseModel):
          
    title: str
          
    points: int
          
    by: str
          
    commentsURL: str
          

          
class TopArticlesSchema(BaseModel):
          
    top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")
          

          
data = app.scrape_url('https://news.ycombinator.com', {
          
    'extractorOptions': {
          
        'extractionSchema': TopArticlesSchema.schema_json(),
          
        'mode': 'llm-extraction'
          
    },
          
    'pageOptions':{
          
        'onlyMainContent': True
          
    }
          
})
          
print(data["llm_extraction"])

搜索查询

执行网络搜索，检索顶部结果，从每个页面提取数据，并返回其markdown格式内容。


          
query = 'What is Mendable?'
          
search_result = app.search(query)

使用Node SDK

安装

要安装Firecrawl Node SDK，您可以使用npm：


        
            

          
 npm install 
 @mendable
 /
 firecrawl
 -
 js

使用

1.从firecrawl.dev获取API密钥 2.并将API密钥设置为名为FIRECRAWL_API_KEY的环境变量，或将其作为参数传递给FirecrawlApp类。

爬取单个URL

要带错误处理爬取单个URL，请使用scrapeUrl方法。它接受URL作为参数，并返回作为字典的爬取数据。


          
try {
          
  const url = 'https://example.com';
          
  const scrapedData = await app.scrapeUrl(url);
          
  console.log(scrapedData);
          
} catch (error) {
          
  console.error(
          
    'Error occurred while scraping:',
          
    error.message
          
  );
          
}

爬取网站

要带错误处理爬取网站，请使用crawlUrl方法。它接受起始URL和可选参数作为参数。params参数允许您为爬取作业指定额外选项，如最大页面数、允许的域和输出格式。


          
const crawlUrl = 'https://example.com';
          
const params = {
          
  crawlerOptions: {
          
    excludes: ['blog/'],
          
    includes: [], // 留空表示所有页面
          
    limit: 1000,
          
  },
          
  pageOptions: {
          
    onlyMainContent: true
          
  }
          
};
          
const waitUntilDone = true;
          
const timeout = 5;
          
const crawlResult = await app.crawlUrl(
          
  crawlUrl,
          
  params,
          
  waitUntilDone,
          
  timeout
          
);

检查爬取状态

要带错误处理检查爬取作业状态，请使用checkCrawlStatus方法。它接受作业ID作为参数，并返回爬取作业的当前状态。


          
const status = await app.checkCrawlStatus(jobId);
          
console.log(status);

从URL提取结构化数据

使用LLM提取，您也可以轻松地从任何URL提取结构化数据。我们支持zod架构，以便您更容易使用。以下是使用方法：


          
import FirecrawlApp from "@mendable/firecrawl-js";
          
import { z } from "zod";
          

          
const app = new FirecrawlApp({
          
  apiKey: "fc-YOUR_API_KEY",
          
});
          

          
// 定义架构以提取内容
          
const schema = z.object({
          
  top: z
          
    .array(
          
      z.object({
          
        title: z.string(),
          
        points: z.number(),
          
        by: z.string(),
          
        commentsURL: z.string(),
          
      })
          
    )
          
    .length(5)
          
    .describe("Top 5 stories on Hacker News"),
          
});
          

          
const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
          
  extractorOptions: { extractionSchema: schema },
          
});
          

          
console.log(scrapeResult.data["llm_extraction"]);

搜索查询

使用search方法，您可以在搜索引擎中搜索查询，并获取顶部结果以及每个结果的页面内容。该方法接受查询作为参数，并返回搜索结果。


          
const query = 'what is mendable?';
          
const searchResults = await app.search(query, {
          
  pageOptions: {
          
    fetchPageContent: true // 获取每个搜索结果的页面内容
          
  }
          
});