Website Crawling AI
Learn how to index your public marketing pages and support documentation automatically. Sentrup crawls sitemaps, scrapes content dynamically, checks hashes, and generates high-density vector embeddings.
How Does Sitemap and URL Discovery Work?
Sitemap and URL discovery works by fetching the public sitemap.xml file of a target website. Sentrup parses the URLs using high-performance regex scanners. If no sitemap is present, the crawler automatically extracts outgoing hyperlinks on the homepage to map the website.
What is the Waterfall Scraper Pipeline?
The waterfall scraper pipeline is a multi-tier scraping mechanism that optimizes crawl speed and accuracy. It processes static pages via fast HTML fetches, dynamically rendered guides via our dynamic document reader, and React or Next.js single-page applications via our single-page application scraper.
| Scraper Tier | Service | Target Use Case | Latency | Cost |
|---|---|---|---|---|
| Tier 1 (Cheap Static) | Local HTML Fetch | Static pages, plain markdown, articles | <500ms | Free |
| Tier 2 (Medium Dynamic) | Dynamic Document Reader | Documentation sites, guides, API hubs | ~1.2s | Low |
| Tier 3 (Heavy SPA/CF) | SPA Scraper | React/Next SPAs, Cloudflare-protected domains | ~3.5s | Managed |
How Does Incremental Hashing and Vectorization Work?
Incremental hashing and vectorization prevents redundant processing by computing SHA-256 hashes of extracted text. Sentrup only runs text chunking and vector embedding generation if the content hash differs from the stored values in our vector database, saving API calls.
What is the Crawl Request API Payload?
The crawl request API payload is a structured JSON body sent via POST to initiate an automated crawl. It defines parameters such as target URL, page limit, path filters, and source priority to constrain the scope of the crawl.
POST /api/v1/admin/documents/crawl
Content-Type: application/json
Authorization: Bearer <your-admin-token>
{
"url": "https://www.sentrup.com",
"max_pages": 15,
"source_priority": "high",
"allowed_paths": ["/docs", "/faq", "/pricing"]
}