Pipeline Architecture SOP

Staging-first data pipeline for all sources — members, bills, and finance

Reference implementation: New York (NY) — sync_nyleg_members.py, sync_nyleg_bills.py, promote_nyleg_staging_to_canonical.py.
Companion docs: STATE-PRIMARY-INGEST-SOP.md · Site Architecture
1. Core Principle: Staging First

Nothing from the outside touches canonical tables directly.

Every data source writes to its own staging tables. Promotion to canonical is a separate, explicit step with pre-validation and post-validation.

External API
    ↓ ingest (HTTP → JSON)
Staging Tables (raw JSON, per-source)
    ↓ promote (validate → transform → upsert)
Canonical Tables (entities_*, bill_texts, bill_sponsorships)
    ↓ refresh
Materialized Views (mv_bills_list, mv_legislator_profiles, ...)
    ↓ serve
API / Site

Why staging-first

BenefitHow
IsolationIngest can fail/retry/re-run without corrupting canonical data
AuditabilityRaw JSON preserved; inspect before promoting
IdempotencyON CONFLICT upserts on staging; re-ingest is safe
Admin controlFuture admin UI: review staging → click “Promote” → see results
Dependency enforcementPromote pre-validates that upstream data exists (e.g., members before bills)
Decoupled scoringPromote enqueues scoring_queue; scoring is never auto-triggered

DB size

Staging tables are JSON blobs. At current scale (40K NY, 339 federal, 60 Open States) this is negligible. Archive or truncate old staging rows after confirmed promotion if needed.

2. Two Tracks: Members and Bills

Every data source has two parallel tracks with a hard dependency between them.

┌─────────────────────────────────────────────────────┐
│ MEMBERS TRACK (must complete first)                 │
│                                                     │
│ ingest → {source}_member_staging                    │
│ promote → people + entities_legislators             │
│           + people_offices                          │
└──────────────────────┬──────────────────────────────┘
                       │ dependency: sponsor resolution
┌──────────────────────┴──────────────────────────────┐
│ BILLS TRACK                                         │
│                                                     │
│ ingest → {source}_bill_staging                      │
│ promote → entities_bills + bill_texts               │
│           + bill_sponsorships + scoring_queue        │
│ refresh → mv_bills_list + mv_bill_detail            │
│           + mv_legislator_profiles                  │
└─────────────────────────────────────────────────────┘

Why members first: Bill sponsorship resolution maps sponsor IDs to entities_legislators rows. If members aren't promoted, sponsors silently fail (the NY bug we discovered — 79/39,780 bills had sponsors because members were synced after the first promote run).

3. Staging Tables (per source)

3.1 Naming convention

{source}_member_staging    — raw member/legislator JSON
{source}_bill_staging      — raw bill list JSON + optional detail JSON

3.2 Required columns (minimum)

Member staging

ColumnTypePurpose
source_member_idTEXTVendor's unique member ID (PK component)
session_yearINTLegislative session (PK component, nullable if not applicable)
chamberTEXTsenate/house/assembly
raw_jsonJSONBFull API response payload
supplement_jsonJSONBOptional gap-fill data (e.g., Open States photo/bio)
fetched_atTIMESTAMPTZWhen ingested
merged_atTIMESTAMPTZWhen promoted (NULL = not yet promoted)

Bill staging

ColumnTypePurpose
session_yearINTLegislative session (PK component)
print_no / identifierTEXTBill identifier (PK component)
list_jsonJSONBList/search API response (title, sponsor, status, summary)
detail_jsonJSONBFull bill detail (text, amendments, cosponsors) — fetched separately
fetched_atTIMESTAMPTZWhen ingested
merged_atTIMESTAMPTZWhen promoted (NULL = not yet promoted)

3.3 Current staging tables

SourceMembersBillsStatus
NY Open Legislationnyleg_member_stagingnyleg_bill_stagingLive
Federal (Congress.gov)— (direct to people)— (direct to prism_bills)Needs migration
Open States— (direct to people)— (direct to openstates_bills)Needs migration

3.4 Migration path (federal + Open States)

For each source, the migration is:

  1. Create {source}_member_staging and {source}_bill_staging tables
  2. Update ingest scripts to write staging instead of canonical
  3. Create promote scripts following the NY pattern
  4. Backfill staging from existing canonical data (one-time)
  5. Update pipeline shell scripts and _run-all.sh
  6. Retire direct-write code paths

Order: Do federal first (smaller dataset, simpler API), then Open States.

4. Promote Script Pattern (universal)

Every promote script follows this sequence. The NY promote (promote_nyleg_staging_to_canonical.py) is the reference implementation.

4.1 Members promote

1. PRE-VALIDATE
   - Required tables exist (people, entities_legislators, people_offices, offices)
   - Offices seeded for this jurisdiction (e.g., new-york-state-senate)

2. PER STAGING ROW
   a. Parse raw_json + supplement_json
   b. Upsert people (slug, display_name, role_title, image, bio, contact, external_ids)
   c. Upsert people_offices (link person to office)
   d. Upsert entities_legislators (level, state, chamber, party, district, active)
   e. Set merged_at on staging row

3. POST-PROMOTE
   - Refresh mv_legislator_profiles
   - Validate: count people rows, entities_legislators rows, warn on anomalies
   - Log summary

4.2 Bills promote

1. PRE-VALIDATE
   - Required tables exist (entities_bills, bill_texts, bill_sponsorships)
   - Members are synced (entities_legislators rows exist for this jurisdiction)
     → FAIL FAST if no members found

2. PER STAGING ROW
   a. Optionally fetch detail_json if missing (rate-limited HTTP)
   b. Upsert entities_bills (title, source_summary, status, dates, jurisdiction)
   c. Upsert bill_texts from detail_json full text (if available)
   d. Resolve sponsor/cosponsor IDs → entities_legislators
   e. Write bill_sponsorships (delete old + insert fresh)
   f. Enqueue scoring_queue (pending) — scoring is NOT triggered
   g. Set merged_at on staging row

3. POST-PROMOTE
   - Refresh mv_bills_list + mv_bill_detail
   - Validate: count bills, sponsors (% coverage), texts, summaries
   - Warn if sponsor coverage < threshold
   - Log summary with counters

4.3 CLI flags (standard)

Every promote script should support:

FlagPurpose
--dry-runLog what would happen, no DB writes
--all-stagingRe-process rows even if merged_at is set
--no-fetch-detailSkip HTTP calls for detail_json (fast re-promote)
--skip-refreshSkip MV refresh (when caller handles it, e.g., entities.sh)
--limit NCap rows processed (0 = unlimited with safety max)
--session-year YFilter to specific session
5. Open States Whitelist / Exclusion

Open States provides data for all 50 states. When a state has its own native processor (e.g., NY via Open Legislation), Open States data for that state should be excluded from promotion to avoid duplicates.

5.1 Exclusion pattern

OPENSTATES_PROMOTE_EXCLUDE_STATES=NY,AL    # states with native processors

The Open States promote script checks this list and skips promotion for excluded states. Staging still receives the data (useful for gap-fill, comparison, or fallback), but canonical tables only get the native processor's data.

5.2 Current state

StateNative processorOpen States excluded?
NYNY Open LegislationShould be (OPENSTATES_BILLS_EXCLUDE_STATES=ny)
ALNext candidateNot yet
All othersNoneOpen States is canonical

5.3 Members exclusion

Same pattern applies to members. When NY has its own member sync, Open States NY members should not overwrite entities_legislators rows. The duplicate slug problem (ny-leg-* vs state-new-york-*) was caused by both sources writing without exclusion.

6. Scoring (always decoupled)

Scoring is never triggered by ingest or promote. It is always a separate, explicit step.

6.1 Flow

promote → INSERT scoring_queue (status='pending')
                    ↓
        (explicit action: admin click, cron timer, manual script)
                    ↓
score_prism_bills.py → read scoring_queue/bill_texts
                    ↓
        entities_bills.prism_score, prism_summary, prism_rating_category
        bill_taxonomy (categories, policy_actions, special_flags)
                    ↓
        refresh_views.py → mv_bills_list updated

6.2 Why decoupled

  • Scoring uses Ollama (GPU-intensive, slow)
  • Admin may want to review data quality before scoring
  • Scoring can be rate-limited independently
  • Re-scoring doesn't require re-ingestion
  • Future admin UI: “Score” is a separate button from “Promote”

6.3 Env controls

VarPurpose
{SOURCE}_SKIP_SCORING_QUEUESet 1 to skip enqueuing during promote
PRISM_SCORE_MAX_PER_RUNCap bills scored per run
PRISM_SCORE_DELAY_MSPause between Ollama calls
PRISM_SCORE_SOURCESfederal, openstates, or both
7. Pipeline Execution Order

7.1 Full pipeline (poliprism-pipeline.sh --all)

INGEST (Layer 1)
  1. federal-legislators      → people, entities_legislators
  2. state-legislators         → people, entities_legislators (Open States)
  3. ny-legislators            → nyleg_member_staging → promote → people, entities_legislators
  4. federal-bills             → prism_bills (TODO: federal_bill_staging)
  5. federal-bill-texts        → raw_congress_bill_texts
  6. federal-enrichment        → people (committees, sponsored bills)
  7. state-bills               → openstates_bills (TODO: openstates_bill_staging)
  8. ny-bills                  → nyleg_bill_staging
  9. state-bill-texts          → openstates_bills.bill_text
  10. fec                      → raw_fec_*
  11. usaspending              → raw_usaspending_*
  12. state-assets             → state asset files
  13. profile-photos           → /images/profiles/
  14. entities                 → promote all staging → canonical + refresh MVs

PROCESS (Layer 2)
  1. promote-bill-texts        → entities_bills + bill_texts
  2. score-federal-bills       → prism_bills scoring
  3. score-state-bills         → openstates_bills scoring
  4. score-legislators (×4)    → entities_legislators scores
  5. classify-organizations    → organization classifications
  6. entity-resolver           → cross-source entity matching
  7. standardize-profiles      → profile normalization
  8. refresh-views             → all MVs (always last)

BUILD (Layer 3)
  1. css                       → poliprism.css
  2. hubs                      → home, states grid, error pages
  3. search                    → Fuse.js JSON indexes
  4. sitemap                   → sitemap.xml + robots.txt

7.2 Target state (after staging migration)

INGEST (all write to staging only)
  members: federal_member_staging, openstates_member_staging, nyleg_member_staging
  bills:   federal_bill_staging, openstates_bill_staging, nyleg_bill_staging

PROMOTE (staging → canonical, with pre-validation)
  1. promote-federal-members    → people + entities_legislators
  2. promote-openstates-members → people + entities_legislators (exclude native states)
  3. promote-ny-members         → people + entities_legislators
  4. promote-federal-bills      → entities_bills + bill_texts + bill_sponsorships
  5. promote-openstates-bills   → entities_bills + bill_texts + bill_sponsorships (exclude native states)
  6. promote-ny-bills           → entities_bills + bill_texts + bill_sponsorships
  7. refresh-views              → all MVs

SCORE (explicit, never auto)
  score-federal-bills, score-state-bills (from scoring_queue)

BUILD (unchanged)
  css, hubs, search, sitemap
8. Admin UI Mental Model (future)

Each step is a button with status feedback:

┌──────────────────────────────────────────────────┐
│ Data Source: NY Open Legislation                  │
├──────────────────────────────────────────────────┤
│ [Ingest Members]  Last: 2026-04-09  220 rows     │
│ [Promote Members] Last: 2026-04-09  220 promoted │
│ [Ingest Bills]    Last: 2026-04-10  39,780 rows  │
│ [Promote Bills]   Last: 2026-04-10  39,780 done  │
│   Sponsors: 38,758 (97%)                         │
│   Summaries: 39,744 (99.9%)                      │
│   Texts: 79 (0.2%)                               │
│ [Score Bills]     Last: never  0/39,780          │
│ [Refresh Views]   Last: 2026-04-10               │
├──────────────────────────────────────────────────┤
│ Data Source: Federal (Congress.gov)               │
│ ... same button pattern ...                      │
├──────────────────────────────────────────────────┤
│ Data Source: Open States (AL, ...)               │
│ ... same button pattern ...                      │
│ Excluded states: NY                              │
└──────────────────────────────────────────────────┘

Each button:

  • Shows last run timestamp and result counts
  • Can be clicked to run that step
  • Shows progress while running
  • Blocks downstream steps if upstream hasn't completed (e.g., can't promote bills if members aren't promoted)
9. Canonical Tables (shared across all sources)

All sources converge into the same canonical tables:

TablePurposeSource key columns
peopleAll legislators (slug is PK)external_ids JSONB
entities_legislatorsUnified legislator metadatapeople_slug, bioguide_id, openstates_id
people_officesPerson ↔ office linksperson_id, office_id
entities_billsAll bills (one source key per row)prism_bill_id, openstates_bill_id, nyleg_bill_id
bill_textsBill full textbill_id (FK), source discriminator
bill_sponsorshipsSponsor/cosponsor linksbill_id, legislator_id
scoring_queuePending bills for AI scoringbill_id
bill_taxonomyAI-assigned categoriesbill_id

9.1 Source key constraint

entities_bills enforces exactly one source key per row:

CHECK (
  (prism_bill_id IS NOT NULL)::integer
  + (openstates_bill_id IS NOT NULL)::integer
  + (nyleg_bill_id IS NOT NULL)::integer = 1
)

New native processors add a new column + extend this CHECK.

9.2 Summary fields

ColumnSourceContent
entities_bills.source_summaryVendor APIStaff-written legislative digest (CRS for federal, Open Legislation for NY)
entities_bills.prism_summaryOllama AIAI-generated neutral analysis (lives in Prism Factor Panel on site)

These are distinct — never overwrite one with the other.

10. Finance Track: FEC + USASpending

Finance data follows the same staging-first principle but has its own entity relationships. Two sources feed into legislator finance profiles.

10.1 Data sources

SourceAPIWhat it providesRaw tables (today)
FECapi.fec.gov (key required)Candidates, committees, PAC contributions, disbursementsraw_fec_candidates, raw_fec_committees, raw_fec_contributions (partitioned by cycle), raw_fec_disbursements
USASpendingapi.usaspending.gov (no key)Federal spending obligations by congressional districtraw_usaspending_awards, raw_usaspending_by_district

10.2 Current flow

FEC already has a partial staging pattern — sync_fec_data.py runs in phases:

Phase 1: candidates    → raw_fec_candidates + entities_legislators.fec_candidate_id
Phase 2: committees    → raw_fec_committees + entities_organizations
Phase 3: contributions → raw_fec_contributions (partitioned by cycle: 2020, 2022, 2024)
Phase 4: normalize     → campaign_contributions (links legislators ↔ organizations)

The raw_fec_* tables are staging — phase 4 (normalize) is the promote step that writes to campaign_contributions. This is already close to the target pattern.

USASpending writes to raw tables then summarizes:

sync_usaspending.py → raw_usaspending_by_district → usaspending_district_summary

10.3 Canonical finance tables

TablePurposeLinks to
entities_organizationsPACs, committees, donorsFEC committee IDs
campaign_contributionsLegislator ← PAC contribution linksentities_legislators, entities_organizations
raw_fec_disbursementsCommittee spending (partitioned by cycle)raw_fec_committees
usaspending_district_summaryFederal spending per House districtentities_legislators (state + district)
mv_finance_ideology_gapPre-computed donor alignment scoresMaterialized view

10.4 Finance dependency chain

┌─────────────────────────────────────────────────────┐
│ MEMBERS TRACK (must exist first)                    │
│ entities_legislators with fec_candidate_id          │
└──────────────────────┬──────────────────────────────┘
                       │ dependency: candidate ↔ legislator link
┌──────────────────────┴──────────────────────────────┐
│ FEC TRACK                                           │
│                                                     │
│ Phase 1: candidates → raw_fec_candidates            │
│          + backfill entities_legislators.fec_id       │
│ Phase 2: committees → raw_fec_committees            │
│          + entities_organizations                     │
│ Phase 3: contributions → raw_fec_contributions      │
│ Phase 4: normalize → campaign_contributions          │
│                                                     │
│ Score: score_legislators_donor_alignment              │
│        → entities_legislators.donor_alignment_score  │
│        → mv_finance_ideology_gap                     │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ USASPENDING TRACK                                   │
│                                                     │
│ Ingest: raw_usaspending_by_district                 │
│ Summarize: usaspending_district_summary              │
│ Links to House members via state + district          │
└─────────────────────────────────────────────────────┘

10.5 Current data

TableRowsNotes
raw_fec_candidates537Matched federal legislators
raw_fec_committees546Principal campaign committees
raw_fec_contributions25,502PAC contributions (partitioned by 2020/2022/2024 cycles)
campaign_contributions25,502Normalized (phase 4 output)
entities_organizations3,446PACs and committees
raw_usaspending_by_district381District-level spending
usaspending_district_summary425Aggregated per district
raw_fec_disbursements0Not yet ingested

10.6 What's already right

FEC is actually the closest to the target patternraw_fec_* tables are staging, and phase 4 (normalize) is a promote step. The main gaps:

  • No pre-validation (check members exist before linking candidates)
  • No post-validation (count coverage, warn on low matches)
  • No MV refresh built into normalize phase
  • Disbursements not yet ingested

10.7 Target state

INGEST (staging)
  fec-candidates    → raw_fec_candidates
  fec-committees    → raw_fec_committees
  fec-contributions → raw_fec_contributions
  fec-disbursements → raw_fec_disbursements
  usaspending       → raw_usaspending_by_district

PROMOTE (staging → canonical, with validation)
  promote-fec       → entities_organizations + campaign_contributions
                      pre-validate: entities_legislators.fec_candidate_id populated
                      post-validate: match rates, contribution counts
  promote-usaspending → usaspending_district_summary
                        pre-validate: House members exist with state+district

SCORE (explicit, decoupled)
  score-legislators-donor-alignment → entities_legislators scores
  refresh-views → mv_finance_ideology_gap

10.8 Admin UI for finance

┌──────────────────────────────────────────────────┐
│ Finance: FEC (cycle 2024)                        │
├──────────────────────────────────────────────────┤
│ [Ingest Candidates]  537 matched                 │
│ [Ingest Committees]  546 committees              │
│ [Ingest Contributions] 25,502 records            │
│ [Ingest Disbursements] 0 (not yet run)           │
│ [Promote/Normalize]  25,502 → campaign_contribs  │
│ [Score Donor Alignment] Last: 2026-04-10         │
├──────────────────────────────────────────────────┤
│ Finance: USASpending (FY2025)                    │
├──────────────────────────────────────────────────┤
│ [Ingest Districts]   381 district records        │
│ [Summarize]          425 district summaries      │
│ [Refresh Views]      mv_finance_ideology_gap     │
└──────────────────────────────────────────────────┘
11. Verification Checklist (per source)

Members

  • Staging rows match API count for session
  • people rows created with correct slugs
  • entities_legislators rows with level, state, chamber, party, district
  • people_offices links exist
  • No duplicate districts (check for slug collisions across sources)
  • mv_legislator_profiles refreshed and row count matches

Bills

  • Staging rows match expected count (check filters)
  • entities_bills rows: one source key populated, title/session present
  • source_summary populated (% of bills with vendor summary)
  • bill_texts: text rows for bills with detail_json (has_bill_text in MV)
  • bill_sponsorships: sponsor coverage ≥ 80% of bills with sponsor data in staging
  • scoring_queue: pending rows created (unless explicitly skipped)
  • MVs refreshed (mv_bills_list, mv_bill_detail)
  • API returns data: /api/state-bills?state_abbr=XX or /api/bills

Finance (FEC)

  • raw_fec_candidates rows match expected legislator count
  • entities_legislators.fec_candidate_id populated for matched members
  • raw_fec_committees linked to candidates
  • raw_fec_contributions count reasonable for cycle (check partitions)
  • campaign_contributions (normalized) row count matches raw contributions
  • entities_organizations populated from committees
  • Donor alignment scores computed (mv_finance_ideology_gap refreshed)

Finance (USASpending)

  • raw_usaspending_by_district covers expected House districts
  • usaspending_district_summary linked to House members via state+district
  • No orphan districts (districts without matching legislators)
12. Implementation Priority
PriorityWorkScope
DoneNY staging-first (members + bills)Reference implementation
1Federal bills → stagingCreate federal_bill_staging, update sync_prism_bills.py, new promote
2Federal members → stagingCreate federal_member_staging, update federal legislator sync
3Open States bills → stagingCreate openstates_bill_staging, update sync, add exclude list
4Open States members → stagingCreate openstates_member_staging, update sync, add exclude list
5FEC promote hardeningAdd pre/post validation + MV refresh to normalize phase
6USASpending promote hardeningAdd pre/post validation to summarize step
7FEC disbursements ingestNew ingestion for committee spending data
8Deduplicate NY legislatorsMerge ny-leg-* and state-new-york-* slugs
9Admin UIPipeline control page with per-source buttons

Related docs

  • STATE-PRIMARY-INGEST-SOP.md — discovery, API stewardship, per-vendor feasibility
  • processors/ — per-processor config docs
  • SERVER-ENV.md — env var reference
  • BILL-TEXT-FETCH-STRATEGY.md — text fetch patterns
  • /sitedetails/ — live architecture page
  • CLAUDE.md — agent context and deploy rules