Website Crawling AI

Learn how to index your public marketing pages and support documentation automatically. Sentrup crawls sitemaps, scrapes content dynamically, checks hashes, and generates high-density vector embeddings.

How Does Sitemap and URL Discovery Work?

Sitemap and URL discovery works by fetching the public sitemap.xml file of a target website. Sentrup parses the URLs using high-performance regex scanners. If no sitemap is present, the crawler automatically extracts outgoing hyperlinks on the homepage to map the website.

What is the Waterfall Scraper Pipeline?

The waterfall scraper pipeline is a multi-tier scraping mechanism that optimizes crawl speed and accuracy. It processes static pages via fast HTML fetches, dynamically rendered guides via our dynamic document reader, and React or Next.js single-page applications via our single-page application scraper.

Scraper Tier	Service	Target Use Case	Latency	Cost
Tier 1 (Cheap Static)	Local HTML Fetch	Static pages, plain markdown, articles	<500ms	Free
Tier 2 (Medium Dynamic)	Dynamic Document Reader	Documentation sites, guides, API hubs	~1.2s	Low
Tier 3 (Heavy SPA/CF)	SPA Scraper	React/Next SPAs, Cloudflare-protected domains	~3.5s	Managed

How Does Incremental Hashing and Vectorization Work?

Incremental hashing and vectorization prevents redundant processing by computing SHA-256 hashes of extracted text. Sentrup only runs text chunking and vector embedding generation if the content hash differs from the stored values in our vector database, saving API calls.

What is the Crawl Request API Payload?

The crawl request API payload is a structured JSON body sent via POST to initiate an automated crawl. It defines parameters such as target URL, page limit, path filters, and source priority to constrain the scope of the crawl.

POST /api/v1/admin/documents/crawl
Content-Type: application/json
Authorization: Bearer <your-admin-token>

{
  "url": "https://www.sentrup.com",
  "max_pages": 15,
  "source_priority": "high",
  "allowed_paths": ["/docs", "/faq", "/pricing"]
}

How Does Sitemap and URL Discovery Work?​

What is the Waterfall Scraper Pipeline?​

How Does Incremental Hashing and Vectorization Work?​

What is the Crawl Request API Payload?​

How Does Sitemap and URL Discovery Work?

What is the Waterfall Scraper Pipeline?

How Does Incremental Hashing and Vectorization Work?

What is the Crawl Request API Payload?