Scrape
Scrape 用于抓取单个 URL,并将页面内容输出为 LLM 更友好的格式(如 Markdown、摘要、结构化 JSON 等)。
- 将网页正文转换为 Markdown,便于 LLM 处理
- 提取页面中的链接(用于后续定向抓取)
- 对页面生成 AI 摘要
- 使用
json模式做结构化抽取(提取价格、表格字段、产品信息等) - 截图以检查页面渲染效果
详情请参阅 Scrape API 参考。
使用 XCrawl 抓取指定 URL
/scrape 接口
使用方式
curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
-H 'Authorization: Bearer $XCRAWL_API_KEY'\
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"output": {
"formats": ["markdown"]
}
}'响应示例
{
"scrape_id": "01KKE88ETDN4RE9J7EPC5HR89B",
"endpoint": "scrape",
"version": "dca0d4b3bff035e4",
"status": "completed",
"url": "https://example.com",
"data": {
"markdown": "# Example Domain\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n[Learn more](https://iana.org/domains/example)",
"metadata": {
"content_type": "text/html",
"final_url": "https://example.com/",
"status_code": 200,
"title": "Example Domain"
},
"traffic_bytes": 1410,
"credits_used": 1
},
"started_at": "2026-03-11T10:49:39Z",
"ended_at": "2026-03-11T10:49:44Z",
"total_credits_used": 1
}有关参数的说明,请参阅 Scrape API 参考。
输出格式
通过 output.formats 控制输出内容:
markdown- 以Markdown格式提取页面内容html- 获取页面的HTML数据raw_html- 未经任何修改的原始HTMLlinks- 提取页面上出现的链接summary- 通过 AI 分析,提供页面内容的摘要screenshot- 页面截图json- 配合 prompt 和 json_schema 进行结构化数据抽取
详见:输出格式
JS 渲染
我们的抓取接口默认开启 JS 渲染,确保获取到完整的页面内容。你可以设置渲染时的视口大小,也可以将 enabled 设为 false 来关闭 JS 渲染。这会降低对动态页面的兼容程度,但可以显著加快抓取速度。
curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
-H 'Authorization: Bearer $XCRAWL_API_KEY'\
-H 'Content-Type: application/json' \
-d '{
"url": "https://xcrawl.com",
"js_render": {
"enabled": true,
"wait_until": "load",
"viewport": {"width": 1920, "height": 1080}
},
"output": {
"formats": ["markdown", "html"]
}
}'详见:JS 渲染
结构化数据提取
通过 json 输出格式,结合 prompt 和可选的 json_schema,可以实现对页面内容的结构化数据提取。例如从产品页面提取产品名称和价格
curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
-H 'Authorization: Bearer $XCRAWL_API_KEY'\
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"output": {
"formats": ["json"]
},
"json": {
"prompt": "Extract the page title and main CTA link."
}
}'响应示例:
{
"scrape_id": "01KKE8JTFD20YP1KXPVJHGCS28",
"endpoint": "scrape",
"version": "dca0d4b3bff035e4",
"status": "completed",
"url": "https://example.com",
"data": {
"json": {
"description": "This domain is for use in documentation examples without needing permission. Avoid use in operations.",
"domain": {
"name": "example",
"permissions_required": false,
"purpose": "documentation examples"
},
"links": [
{
"text": "Learn more",
"url": "https://iana.org/domains/example"
}
],
"title": "Example Domain"
},
"credits_used": 5,
"credits_detail": {
"base_cost": 1,
"traffic_cost": 0,
"json_extract_cost": 4
}
},
"started_at": "2026-03-11T10:55:19Z",
"ended_at": "2026-03-11T10:55:26Z",
"total_credits_used": 5
}当你需要确定的字段和数据类型时,可以提供 json_schema 来约束输出格式
curl -s -X POST 'https://run.xcrawl.com/v1/scrape' \
-H 'Authorization: Bearer $XCRAWL_API_KEY'\
-H 'Content-Type: application/json' \
-d '{
"url": "https://xcrawl.com",
"output": {
"formats": ["json"]
},
"json": {
"prompt": "Extract product name and price from the page.",
"json_schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "string"}
},
"required": ["product_name", "price"]
}
}
}'json 输出的字段说明和 json_schema 约束方式,请参阅 Scrape API Reference。
