Data platform · Cloud · Full-stack .NET

RA Import Platform

A production-grade data-aggregation and contact-enrichment engine that scrapes regulatory records for residential assisted-living facilities across all 50 U.S. states, normalizes them into a single source of truth, verifies contact data, and feeds marketing systems — running fully automated on Azure.

50
U.S. states + DC covered
59
Source-specific scrapers
~15.5k
Lines of service code
.NET 10
Containerized on Azure
01

What it does

A single automated pipeline turns fragmented public health-department data into a clean, sales-ready contact database.

🗺️

Nationwide coverage

Harvests every state's assisted-living registry — from modern open-data APIs to PDF-only records and CAPTCHA-gated portals — into one consistent dataset.

Verified contacts

Enriches each facility with a website, email, and phone, then validates deliverability so only real, mailable contacts reach the marketing team.

🔁

Always current

Runs on a schedule, tracks change history per facility, and exports ready-to-import lists for ActiveCampaign (email) and Postalytics (direct mail).

02

Architecture

Layered pipeline — acquire → normalize → enrich → export — behind a secured API, orchestrated by scheduled background workers.

DATA SOURCES REST / Open-Data APIs ArcGIS · Socrata Portals (Playwright) JS forms · pagination Files & PDFs Excel · PdfPig SCRAPER REGISTRY 59 keyed IScraperService implementations resolved from a single registry map NORMALIZE Ingest + Upsert stable FacilityKey (idempotent SHA-256) Enrichment Google Places → site scrape → ZeroBounce verify SOURCE OF TRUTH Azure SQL Facility · Address Person · Email Phone · Website Details (JSON) + history timeline Dapper · 11 repos ASP.NET CORE WEB API · X-Api-Key · Swagger · scheduled background workers /scrape /enrich /latest · /latest-multi /status /states on-demand + interval scheduler (default 24h) · in-memory result store · SQL warm-up on boot DELIVERY ActiveCampaign CSV — email Postalytics CSV — direct mail JSON API — downstream CRM
03

Technology stack

Modern .NET, real browser automation, and a normalized SQL model — deployed as a container on Azure with CI/CD.

Platform & language
C# .NET 10 ASP.NET Core Web API Swagger / OpenAPI Async/await throughout
Data & scraping
Azure SQL Dapper 2.1 Microsoft Playwright 1.54 HtmlAgilityPack ClosedXML Excel PdfPig PDF CsvHelper
Cloud & DevOps
Docker multi-stage Azure Container Apps Azure Container Registry Azure DevOps CI/CD Azure Key Vault Managed Identity
Integrations
Google Places API ZeroBounce verification ActiveCampaign Postalytics
04

Engineering highlights for technical reviewers

The problems that made this hard, and the patterns used to solve them.

Registry-driven scraper fan-out

A single KnownStates map registers 59 scrapers as keyed DI services. Adding a state is a one-line registration — no factory or switch logic. Multi-track states (OH has 3 license systems, CA/AZ have 3 sources each) coexist cleanly.

Stable identity & idempotent upserts

Every facility gets a deterministic FacilityKey (SHA-256 of canonical name + address). Re-scrapes upsert by key, so history and hard-won enrichment data survive re-runs instead of being overwritten.

Normalized schema + JSON grab-bag

Core entities (Facility, Address, Person, Email, Phone, Website) are relational; volatile state-specific fields live in an ISJSON-checked Details column with an append-only history table — schema stays stable as 50 states' quirks change.

Cost-aware enrichment

Email verification is metered per call, so verdicts are cached in-process and in SQL to avoid re-billing. A Facebook fallback with a login-wall circuit breaker recovers contacts the primary path misses.

Heterogeneous source handling

One pipeline absorbs ArcGIS/Socrata REST feeds, JS-heavy portals via real Chromium, Excel workbooks, and PDF-only state records parsed positionally — plus a reCAPTCHA-gated portal handled via snapshot.

Cloud-native operations

Multi-stage Docker image bakes Chromium + system deps for headless scraping in-container. Background workers run scrapes on an interval without blocking Kestrel startup; a warm-up service rebuilds in-memory state from SQL on boot.

05

Data & API surface

Normalized SQL model
  • Facility — identity, source, active/seen/scraped timestamps
  • Address / Person / FacilityPerson — typed addresses, role-carrying links (Owner, Administrator, Agent…)
  • Email / Phone / Website — owned by facility or person; carry source & verification state
  • FacilityDetails + History — current JSON snapshot plus append-only change timeline
  • ScrapeRun — per-run audit: counts, success, errors
API endpoints
  • POST /scrape — trigger an on-demand state scrape
  • POST /enrich · /enrich-all — run contact enrichment
  • GET /latest · /latest-multi — results as JSON or CSV
  • GET /status — facility counts & scrape-run history
  • GET /states — catalog of supported sources
  • Secured with X-Api-Key; documented via Swagger
06

Capabilities demonstrated services & hiring

What building and running this system proves I can deliver.

Full-stack .NET / ASP.NET Core Web scraping & browser automation at scale Data engineering & pipeline design Relational schema design & SQL Third-party API integration Data quality & email deliverability Docker & containerization Azure cloud (Container Apps, SQL, Key Vault, ACR) CI/CD pipeline authoring Background-job / scheduler design Cost optimization API design & security Marketing-ops integration (ActiveCampaign, direct mail)