Scrape API

概述

POST /v1/scrape 用于抓取指定 URL 内容，支持同步与异步两种模式：

同步：mode=sync（默认）。接口在本次请求内完成抓取，并直接返回结果。
异步：mode=async。接口立即返回 scrape_id，可通过 Scrape Result API 查询结果；也可配置 webhook 接收任务状态回调。

请求

Endpoint

POST https://run.xcrawl.com/v1/scrape

Headers

Content-Type: application/json
Authorization: Bearer <api_key>

请求体

顶层字段

字段	类型	必填	默认值	说明
`url`	string	是	-	目标 URL
`mode`	string	否	`sync`	`sync` 或 `async`
`proxy`	object	否	-	代理配置
`request`	object	否	-	请求配置
`js_render`	object	否	-	JS 渲染配置
`output`	object	否	-	输出配置
`webhook`	object	否	-	异步回调配置（仅异步模式需要）

`proxy`（代理配置）

字段	类型	必填	默认值	说明
`location`	string	否	`US`	ISO-3166-1 alpha-2 国家代码，例如 `US` / `JP` / `SG`
`sticky_session`	string	否	自动生成	粘性会话 ID；相同 ID 会尽量复用相同出口

`request`（请求配置）

字段	类型	必填	默认值	说明
`locale`	string	否	`en-US,en;q=0.9`	影响 `Accept-Language`
`device`	string	否	`desktop`	目前支持 `desktop` / `mobile`，会影响 UA 与 viewport
`cookies`	object map	否	-	Cookie 键值对
`headers`	object map	否	-	Header 键值对
`only_main_content`	boolean	否	`true`	仅返回主要内容（尽量过滤导航、侧栏等）
`block_ads`	boolean	否	`true`	尝试屏蔽广告相关资源
`skip_tls_verification`	boolean	否	`true`	跳过 TLS 证书校验

`js_render`（JS 渲染配置）

字段	类型	必填	默认值	说明
`enabled`	boolean	否	`true`	是否启用浏览器渲染
`wait_until`	string	否	`load`	支持 `load` / `domcontentloaded` / `networkidle`
`viewport.width`	integer	否	-	视口宽度：桌面端默认 `1920`，移动端默认 `402`
`viewport.height`	integer	否	-	视口高度：桌面端默认 `1080`，移动端默认 `874`

`output`（输出配置）

字段	类型	必填	默认值	说明
`formats`	string[]	否	`["markdown"]`	输出格式列表，见下方枚举
`screenshot`	string	否	`viewport`	`full_page` / `viewport`；仅当 `formats` 包含 `screenshot` 时生效
`json.prompt`	string	否	-	结构化抽取提示词
`json.json_schema`	object	否	-	结构化抽取 JSON Schema（https://json-schema.org/docs）

output.formats 支持的枚举值：

html
raw_html
markdown
links
summary
screenshot
json

`webhook`（异步回调配置）

字段	类型	必填	默认值	说明
`url`	string	否	-	回调地址
`headers`	object map	否	-	回调请求自定义请求头
`events`	string[]	否	`["started","completed","failed"]`	回调事件：`started` / `completed` / `failed`

webhook.events 事件含义：

started：任务开始时回调
completed：任务成功完成时回调
failed：任务失败时回调

响应

同步模式响应（`mode=sync`）

字段	类型	说明
`scrape_id`	string	任务 ID
`endpoint`	string	固定为 `scrape`
`version`	string	版本标识
`status`	string	`completed` / `failed`
`url`	string	本次抓取的 URL
`data`	object	抓取结果数据
`started_at`	string	任务开始时间（ISO 8601）
`ended_at`	string	任务结束时间（ISO 8601）
`total_credits_used`	integer	总积分消耗

data 字段说明（按 output.formats 返回相应内容）：

html：剔除 <head>、<script> 等后的正文 HTML
raw_html：原始 HTML
markdown：页面内容转换后的 Markdown
links：页面中解析到的链接列表
metadata：页面元信息（包含 title、status_code、content_type、proxy_location、proxy_sticky_session 等；会根据响应头与 HTML <head> 内容动态扩展）
screenshot：截图下载地址
summary：页面 AI 摘要
json：AI 结构化抽取结果（JSON）
traffic_bytes：本次抓取消耗的流量
credits_used：本次抓取消耗的积分
credits_detail：本次抓取消耗的积分明细

credits_detail 字段说明：

字段	类型	说明
`base_cost`	integer	抓取基础积分
`traffic_cost`	integer	流量消耗积分
`json_extract_cost`	integer	JSON 抽取消耗积分

异步模式响应（`mode=async`）

字段	类型	说明
`scrape_id`	string	任务 ID
`endpoint`	string	固定为 `scrape`
`version`	string	版本标识
`status`	string	创建任务后通常立即返回 `pending`

示例

同步请求示例

{
  "url": "https://example.com",
  "output": {
    "formats": [
      "markdown"
    ]
  }
}

同步响应示例

{
  "scrape_id": "01KKE88ETDN4RE9J7EPC5HR89B",
  "endpoint": "scrape",
  "version": "dca0d4b3bff035e4",
  "status": "completed",
  "url": "https://example.com",
  "data": {
    "markdown": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n[Learn more](https://iana.org/domains/example)",
    "metadata": {
      "content_type": "text/html",
      "final_url": "https://example.com/",
      "proxy_location": "US",
      "proxy_sticky_session": "sticky-da8ab47a",
      "requested_url": "https://example.com",
      "status_code": 200,
      "title": "Example Domain",
      "viewport": "width=device-width, initial-scale=1"
    },
    "traffic_bytes": 1410,
    "credits_used": 1,
    "credits_detail": {
      "base_cost": 1,
      "traffic_cost": 0,
      "json_extract_cost": 0
    }
  },
  "started_at": "2026-03-11T10:49:39Z",
  "ended_at": "2026-03-11T10:49:44Z",
  "total_credits_used": 1
}

异步请求示例

{
  "url": "https://docs.xcrawl.com/doc/introduction/",
  "mode": "async",
  "output": {
    "formats": [
      "markdown"
    ]
  }
}

异步响应示例

{
  "scrape_id": "01KKE89TSDTP063R5PEX6XWSE1",
  "endpoint": "scrape",
  "version": "dca0d4b3bff035e4",
  "status": "pending"
}