Scrape

Scrape 用于抓取单个 URL，并将页面内容输出为 LLM 更友好的格式（如 Markdown、摘要、结构化 JSON 等）。

将网页正文转换为 Markdown，便于 LLM 处理
提取页面中的链接（用于后续定向抓取）
对页面生成 AI 摘要
使用 json 模式做结构化抽取（提取价格、表格字段、产品信息等）
截图以检查页面渲染效果

使用 XCrawl 抓取指定 URL

/scrape 接口

使用方式

curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
  -H 'Authorization: Bearer $XCRAWL_API_KEY'\
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "output": {
      "formats": ["markdown"]
    }
  }'

响应示例

{
  "scrape_id": "01KKE88ETDN4RE9J7EPC5HR89B",
  "endpoint": "scrape",
  "version": "dca0d4b3bff035e4",
  "status": "completed",
  "url": "https://example.com",
  "data": {
    "markdown": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n[Learn more](https://iana.org/domains/example)",
    "metadata": {
      "content_type": "text/html",
      "final_url": "https://example.com/",
      "status_code": 200,
      "title": "Example Domain"
    },
    "traffic_bytes": 1410,
    "credits_used": 1
  },
  "started_at": "2026-03-11T10:49:39Z",
  "ended_at": "2026-03-11T10:49:44Z",
  "total_credits_used": 1
}

有关参数的说明，请参阅 Scrape API 参考。

输出格式

通过 output.formats 控制输出内容：

markdown - 以Markdown格式提取页面内容
html - 获取页面的HTML数据
raw_html - 未经任何修改的原始HTML
links - 提取页面上出现的链接
summary - 通过 AI 分析，提供页面内容的摘要
screenshot - 页面截图
json - 配合 prompt 和 json_schema 进行结构化数据抽取

详见：输出格式

JS 渲染

我们的抓取接口默认开启 JS 渲染，确保获取到完整的页面内容。你可以设置渲染时的视口大小，也可以将 enabled 设为 false 来关闭 JS 渲染。这会降低对动态页面的兼容程度，但可以显著加快抓取速度。

curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
  -H 'Authorization: Bearer $XCRAWL_API_KEY'\
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://xcrawl.com",
    "js_render": {
      "enabled": true,
      "wait_until": "load",
      "viewport": {"width": 1920, "height": 1080}
    },
    "output": {
      "formats": ["markdown", "html"]
    }
  }'

详见：JS 渲染

结构化数据提取

通过 json 输出格式，结合 prompt 和可选的 json_schema，可以实现对页面内容的结构化数据提取。例如从产品页面提取产品名称和价格

curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
  -H 'Authorization: Bearer $XCRAWL_API_KEY'\
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "output": {
      "formats": ["json"]
    },
    "json": {
      "prompt": "Extract the page title and main CTA link."
    }
  }'

响应示例：

{
  "scrape_id": "01KKE8JTFD20YP1KXPVJHGCS28",
  "endpoint": "scrape",
  "version": "dca0d4b3bff035e4",
  "status": "completed",
  "url": "https://example.com",
  "data": {
    "json": {
      "description": "This domain is for use in documentation examples without needing permission. Avoid use in operations.",
      "domain": {
        "name": "example",
        "permissions_required": false,
        "purpose": "documentation examples"
      },
      "links": [
        {
          "text": "Learn more",
          "url": "https://iana.org/domains/example"
        }
      ],
      "title": "Example Domain"
    },
    "credits_used": 5,
    "credits_detail": {
      "base_cost": 1,
      "traffic_cost": 0,
      "json_extract_cost": 4
    }
  },
  "started_at": "2026-03-11T10:55:19Z",
  "ended_at": "2026-03-11T10:55:26Z",
  "total_credits_used": 5
}

当你需要确定的字段和数据类型时，可以提供 json_schema 来约束输出格式

curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
  -H 'Authorization: Bearer $XCRAWL_API_KEY'\
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://xcrawl.com",
    "output": {
      "formats": ["json"]
    },
    "json": {
        "prompt": "Extract product name and price from the page.",
        "json_schema": {
            "type": "object",
            "properties": {
            "product_name": {"type": "string"},
            "price": {"type": "string"}
            },
            "required": ["product_name", "price"]
        }
      }
    }'

json 输出的字段说明和 json_schema 约束方式，请参阅 Scrape API Reference。