Building a Website Contact Scraper API in .NET 10: Crawling, Extraction, and a Cloudflare Problem I Can't Fully Solve
I built an API that takes a domain and returns emails, phones, social profiles, and company info. One call:
GET /api/v1/website/contacts?domain=stripe.com
Returns verified emails with confidence scores, phones, LinkedIn/Twitter/GitHub links, and crawl metadata. Here's how the interesting parts work.
Architecture
Clean layered architecture — Api → Application → Domain, with Infrastructure implementing the Application interfaces. The controller is 12 lines of plumbing. Everything real happens in the crawler and extractor.
The Two-Phase Crawler
The crawler uses a priority queue and runs in two phases.
Fast path — first 18 pages, only high-value routes: /contact, /about, /privacy, /legal. Gets real contacts in under 2 seconds for most sites.
Stage two — deferred URLs get promoted once the fast path finishes. Handles sites where contacts are buried under /company/offices/regional/emea/contact.
Every URL gets a priority score before entering the queue:
private static readonly (string Segment, int Score)[] PriorityPathSegments =
[
("/contact", 120),
("/contact-us", 118),
("/support", 115),
("/privacy", 110),
("/about", 95),
...
];
Route family deduplication strips locale prefixes so /en/contact, /fr/contact, /de/contact are treated as one family and fetched once. This was the highest-leverage optimization — cut unnecessary fetches dramatically on international sites.
Email Extraction
Five passes over each page's DOM: text nodes, mailto: anchors, data-cfemail attributes, element attributes, and JSON-LD blocks.
The Cloudflare email decoder was satisfying to build — CF XORs each byte with the first byte of the encoded string:
private static string? DecodeCloudflareProtectedEmail(IElement element)
{
var encoded = element.GetAttribute("data-cfemail");
if (string.IsNullOrWhiteSpace(encoded) || encoded.Length % 2 != 0) return null;
var key = Convert.ToByte(encoded[..2], 16);
var characters = new char[(encoded.Length / 2) - 1];
for (var i = 2; i < encoded.Length; i += 2)
characters[(i / 2) - 1] = (char)(Convert.ToByte(encoded.Substring(i, 2), 16) ^ key);
return new string(characters);
}
Each email gets a confidence score built from multiple signals: domain match, role-based address, mailto: source, page context, footer placement, surrounding phrase ("email us at", "send resumes to"). Scoring beats hard accept/reject rules — real-world emails are messy.
Social Extraction
JSON-LD sameAs fields are the most reliable source. Sites that care about SEO publish their structured data carefully. Footer anchor tags are noisier — share buttons, partner links, and embedded widgets all look like profiles. Weighting sameAs much higher than anchors halved the false-positive rate.
The Cloudflare Problem I Haven't Fully Solved
This is where I'm stuck and genuinely want input from anyone who's dealt with this.
Locally, the crawler handles Cloudflare-protected sites reasonably well — persistent cookie jar, correct Sec-Fetch-* headers, headless Chrome fallback with a spoofed user agent. Works fine on my machine.
In production on Railway (datacenter IP), the same code gets blocked on a significant percentage of Cloudflare-protected sites. Challenge pages, 403s, silent blocks. The headless fallback helps but doesn't fully solve it.
My current setup:
// Persistent cookie jar across requests
handler.UseCookies = true;
handler.CookieContainer = new CookieContainer();
// Full Chrome header fingerprint
client.DefaultRequestHeaders.TryAddWithoutValidation("Sec-Fetch-Dest", "document");
client.DefaultRequestHeaders.TryAddWithoutValidation("Sec-Fetch-Mode", "navigate");
client.DefaultRequestHeaders.TryAddWithoutValidation("sec-ch-ua",
"\"Google Chrome\";v=\"135\", \"Not-A.Brand\";v=\"8\"");
I understand the core issue — datacenter IPs are pre-scored as high-risk by Cloudflare regardless of headers. Residential proxies are the obvious answer but add cost and complexity I haven't wired up yet.
What I'm wondering:
- Has anyone solved this cleanly in .NET without proxies?
- Is there a proxy provider that works well for this use case without breaking the bank?
- Any other signals I'm missing that would help on datacenter IPs?
You can test the API yourself and see where it succeeds and fails — free tier, no credit card:
👉 https://rapidapi.com/zoktrapi-zoktrapi-default/api/website-contacts-finder
If you find a domain where results are wrong or missing, drop it in the comments. Genuinely useful for debugging.
Stack
.NET 10 · ASP.NET Core · HtmlAgilityPack · AngleSharp · Redis · headless Chrome · Railway
Happy to answer questions — and really hoping someone has cracked the datacenter IP problem.
Top comments (0)