Data Operations

High-volume multi-source data pipeline

100,000+
records/month, 200+ sources
CLIENT

A national event-aggregator platform

The problem

The client needed to absorb event listings from 200+ disparate sources (municipal calendars, library systems, museum sites, third-party feeds), normalize them into a single structured format, deduplicate across sources, and emit clean records ready for downstream consumption.

The approach

We designed a production pipeline that treats source heterogeneity as a first-class concern: per-source adapters, a unified intermediate schema, an LLM-assisted normalization layer for messy fields, and a deduplication pass that handles fuzzy matches across name, time, and venue. Observability is built in: every source has a freshness signal and a quality score.

Architecture

[ Architecture sketch placeholder, replace with diagram when ready. ]

  • Source adapters: One per source format. Scraping (Playwright), API polling, ICS feed parsing, vendor-specific exports.
  • Normalization layer: LLM-assisted for messy free-text fields. Schema-checked output.
  • Deduplication: Fuzzy matching on name + time window + venue.
  • Output: Clean structured records emitted to the client’s primary database.
  • Monitoring: Per-source freshness, per-source quality score, alerting on drift.

Results

  • 100,000+ records/month absorbed, normalized, and emitted.
  • 200+ sources in production, with new sources added in under a day.
  • Client’s editorial team reclaimed approximately 30 hours/week previously spent on manual review.

Stack

Python, Playwright, LLM-assisted normalization, PostgreSQL, scheduled workers on AWS.

Bring us the hard problem.

Send us a few details about your project. You'll be redirected to book a call with Kristin right after you hit submit.

Start a project →