Crawl API
Overview
POST /v1/crawl crawls pages in bulk based on rules.
Request
Endpoint
POST https://run.xcrawl.com/v1/crawl
Headers
Content-Type: application/jsonAuthorization: Bearer <api_key>
Request body
Top-level fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | - | Site entry URL |
crawler | object | No | - | Crawler config |
proxy | object | No | - | Proxy config |
request | object | No | - | Request config |
js_render | object | No | - | JS rendering config |
output | object | No | - | Output config |
webhook | object | No | - | Async callback config |
crawler
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
limit | integer | No | 100 | Max pages |
include | string[] | No | - | Only URLs matching patterns (regex supported) |
exclude | string[] | No | - | Exclude URLs matching patterns (regex supported) |
max_depth | integer | No | 3 | Max depth from entry URL |
include_entire_domain | boolean | No | false | Crawl entire site instead of only subpaths |
include_subdomains | boolean | No | false | Include subdomains |
include_external_links | boolean | No | false | Include external links |
sitemaps | boolean | No | true | Use site sitemap |
proxy
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
location | string | No | US | ISO-3166-1 alpha-2 country code, e.g. US / JP / SG |
sticky_session | string | No | Auto-generated | Sticky session ID; same ID attempts to reuse exit |
request
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
locale | string | No | en-US,en;q=0.9 | Affects Accept-Language |
device | string | No | desktop | desktop / mobile, affects UA & viewport |
cookies | object map | No | - | Cookie key/value pairs |
headers | object map | No | - | Header key/value pairs |
only_main_content | boolean | No | true | Return main content only |
block_ads | boolean | No | true | Attempt to block ad resources |
skip_tls_verification | boolean | No | true | Skip TLS verification |
js_render
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
enabled | boolean | No | true | Enable browser rendering |
wait_until | string | No | load | load / domcontentloaded / networkidle |
viewport.width | integer | No | - | Viewport width (desktop 1920, mobile 402) |
viewport.height | integer | No | - | Viewport height (desktop 1080, mobile 874) |
output
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
formats | string[] | No | ["markdown"] | Output formats, see enum below |
screenshot | string | No | viewport | full_page / viewport; only when formats includes screenshot |
json.prompt | string | No | - | Extraction prompt |
json.json_schema | object | No | - | JSON Schema (https://json-schema.org/docs) |
output.formats enum:
htmlraw_htmlmarkdownlinkssummaryscreenshotjson
webhook
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | No | - | Callback URL |
headers | object map | No | - | Custom callback headers |
events | string[] | No | ["started","completed","failed"] | Events: started / completed / failed |
webhook.events meanings:
started: task startedcompleted: task completed successfullyfailed: task failed
Response
| Field | Type | Description |
|---|---|---|
crawl_id | string | Task ID |
endpoint | string | Always crawl |
version | string | Version |
status | string | Queue status when created, typically pending or crawling |
Examples
Request example
{
"url": "https://docs.xcrawl.com/doc/",
"crawler": {
"limit": 1,
"max_depth": 1
},
"output": {
"formats": [
"markdown"
]
}
}Response example
{
"crawl_id": "01KKE8BNNVQH9PCYEEKJGXKE07",
"endpoint": "crawl",
"version": "dca0d4b3bff035e4",
"status": "crawling"
}