Pipeline Architecture SOP

Staging-first data pipeline for all sources — members, bills, and finance

Reference implementation: New York (NY) — sync_nyleg_members.py, sync_nyleg_bills.py, promote_nyleg_staging_to_canonical.py.
Companion docs: STATE-PRIMARY-INGEST-SOP.md · Site Architecture

1. Core Principle: Staging First

Nothing from the outside touches canonical tables directly.

Every data source writes to its own staging tables. Promotion to canonical is a separate, explicit step with pre-validation and post-validation.

External API
    ↓ ingest (HTTP → JSON)
Staging Tables (raw JSON, per-source)
    ↓ promote (validate → transform → upsert)
Canonical Tables (entities_*, bill_texts, bill_sponsorships)
    ↓ refresh
Materialized Views (mv_bills_list, mv_legislator_profiles, ...)
    ↓ serve
API / Site

Why staging-first

Benefit	How
Isolation	Ingest can fail/retry/re-run without corrupting canonical data
Auditability	Raw JSON preserved; inspect before promoting
Idempotency	`ON CONFLICT` upserts on staging; re-ingest is safe
Admin control	Future admin UI: review staging → click “Promote” → see results
Dependency enforcement	Promote pre-validates that upstream data exists (e.g., members before bills)
Decoupled scoring	Promote enqueues `scoring_queue`; scoring is never auto-triggered

DB size

Staging tables are JSON blobs. At current scale (40K NY, 339 federal, 60 Open States) this is negligible. Archive or truncate old staging rows after confirmed promotion if needed.

2. Two Tracks: Members and Bills

Every data source has two parallel tracks with a hard dependency between them.

┌─────────────────────────────────────────────────────┐
│ MEMBERS TRACK (must complete first)                 │
│                                                     │
│ ingest → {source}_member_staging                    │
│ promote → people + entities_legislators             │
│           + people_offices                          │
└──────────────────────┬──────────────────────────────┘
                       │ dependency: sponsor resolution
┌──────────────────────┴──────────────────────────────┐
│ BILLS TRACK                                         │
│                                                     │
│ ingest → {source}_bill_staging                      │
│ promote → entities_bills + bill_texts               │
│           + bill_sponsorships + scoring_queue        │
│ refresh → mv_bills_list + mv_bill_detail            │
│           + mv_legislator_profiles                  │
└─────────────────────────────────────────────────────┘

Why members first: Bill sponsorship resolution maps sponsor IDs to entities_legislators rows. If members aren't promoted, sponsors silently fail (the NY bug we discovered — 79/39,780 bills had sponsors because members were synced after the first promote run).

3. Staging Tables (per source)

3.1 Naming convention

{source}_member_staging    — raw member/legislator JSON
{source}_bill_staging      — raw bill list JSON + optional detail JSON

3.2 Required columns (minimum)

Member staging

Column	Type	Purpose
`source_member_id`	TEXT	Vendor's unique member ID (PK component)
`session_year`	INT	Legislative session (PK component, nullable if not applicable)
`chamber`	TEXT	senate/house/assembly
`raw_json`	JSONB	Full API response payload
`supplement_json`	JSONB	Optional gap-fill data (e.g., Open States photo/bio)
`fetched_at`	TIMESTAMPTZ	When ingested
`merged_at`	TIMESTAMPTZ	When promoted (NULL = not yet promoted)

Bill staging

Column	Type	Purpose
`session_year`	INT	Legislative session (PK component)
`print_no` / `identifier`	TEXT	Bill identifier (PK component)
`list_json`	JSONB	List/search API response (title, sponsor, status, summary)
`detail_json`	JSONB	Full bill detail (text, amendments, cosponsors) — fetched separately
`fetched_at`	TIMESTAMPTZ	When ingested
`merged_at`	TIMESTAMPTZ	When promoted (NULL = not yet promoted)

3.3 Current staging tables

Source	Members	Bills	Status
NY Open Legislation	`nyleg_member_staging`	`nyleg_bill_staging`	Live
Federal (Congress.gov)	— (direct to `people`)	— (direct to `prism_bills`)	Needs migration
Open States	— (direct to `people`)	— (direct to `openstates_bills`)	Needs migration

3.4 Migration path (federal + Open States)

For each source, the migration is:

Create {source}_member_staging and {source}_bill_staging tables
Update ingest scripts to write staging instead of canonical
Create promote scripts following the NY pattern
Backfill staging from existing canonical data (one-time)
Update pipeline shell scripts and _run-all.sh
Retire direct-write code paths

Order: Do federal first (smaller dataset, simpler API), then Open States.

4. Promote Script Pattern (universal)

Every promote script follows this sequence. The NY promote (promote_nyleg_staging_to_canonical.py) is the reference implementation.

4.1 Members promote

1. PRE-VALIDATE
   - Required tables exist (people, entities_legislators, people_offices, offices)
   - Offices seeded for this jurisdiction (e.g., new-york-state-senate)

2. PER STAGING ROW
   a. Parse raw_json + supplement_json
   b. Upsert people (slug, display_name, role_title, image, bio, contact, external_ids)
   c. Upsert people_offices (link person to office)
   d. Upsert entities_legislators (level, state, chamber, party, district, active)
   e. Set merged_at on staging row

3. POST-PROMOTE
   - Refresh mv_legislator_profiles
   - Validate: count people rows, entities_legislators rows, warn on anomalies
   - Log summary

4.2 Bills promote

1. PRE-VALIDATE
   - Required tables exist (entities_bills, bill_texts, bill_sponsorships)
   - Members are synced (entities_legislators rows exist for this jurisdiction)
     → FAIL FAST if no members found

2. PER STAGING ROW
   a. Optionally fetch detail_json if missing (rate-limited HTTP)
   b. Upsert entities_bills (title, source_summary, status, dates, jurisdiction)
   c. Upsert bill_texts from detail_json full text (if available)
   d. Resolve sponsor/cosponsor IDs → entities_legislators
   e. Write bill_sponsorships (delete old + insert fresh)
   f. Enqueue scoring_queue (pending) — scoring is NOT triggered
   g. Set merged_at on staging row

3. POST-PROMOTE
   - Refresh mv_bills_list + mv_bill_detail
   - Validate: count bills, sponsors (% coverage), texts, summaries
   - Warn if sponsor coverage < threshold
   - Log summary with counters

4.3 CLI flags (standard)

Every promote script should support:

Flag	Purpose
`--dry-run`	Log what would happen, no DB writes
`--all-staging`	Re-process rows even if `merged_at` is set
`--no-fetch-detail`	Skip HTTP calls for detail_json (fast re-promote)
`--skip-refresh`	Skip MV refresh (when caller handles it, e.g., entities.sh)
`--limit N`	Cap rows processed (0 = unlimited with safety max)
`--session-year Y`	Filter to specific session

5. Open States Whitelist / Exclusion

Open States provides data for all 50 states. When a state has its own native processor (e.g., NY via Open Legislation), Open States data for that state should be excluded from promotion to avoid duplicates.

5.1 Exclusion pattern

OPENSTATES_PROMOTE_EXCLUDE_STATES=NY,AL    # states with native processors

The Open States promote script checks this list and skips promotion for excluded states. Staging still receives the data (useful for gap-fill, comparison, or fallback), but canonical tables only get the native processor's data.

5.2 Current state

State	Native processor	Open States excluded?
NY	NY Open Legislation	Should be (`OPENSTATES_BILLS_EXCLUDE_STATES=ny`)
AL	Next candidate	Not yet
All others	None	Open States is canonical

5.3 Members exclusion

Same pattern applies to members. When NY has its own member sync, Open States NY members should not overwrite entities_legislators rows. The duplicate slug problem (ny-leg-* vs state-new-york-*) was caused by both sources writing without exclusion.

6. Scoring (always decoupled)

Scoring is never triggered by ingest or promote. It is always a separate, explicit step.

6.1 Flow

promote → INSERT scoring_queue (status='pending')
                    ↓
        (explicit action: admin click, cron timer, manual script)
                    ↓
score_prism_bills.py → read scoring_queue/bill_texts
                    ↓
        entities_bills.prism_score, prism_summary, prism_rating_category
        bill_taxonomy (categories, policy_actions, special_flags)
                    ↓
        refresh_views.py → mv_bills_list updated

6.2 Why decoupled

Scoring uses Ollama (GPU-intensive, slow)
Admin may want to review data quality before scoring
Scoring can be rate-limited independently
Re-scoring doesn't require re-ingestion
Future admin UI: “Score” is a separate button from “Promote”

6.3 Env controls

Var	Purpose
`{SOURCE}_SKIP_SCORING_QUEUE`	Set 1 to skip enqueuing during promote
`PRISM_SCORE_MAX_PER_RUN`	Cap bills scored per run
`PRISM_SCORE_DELAY_MS`	Pause between Ollama calls
`PRISM_SCORE_SOURCES`	`federal`, `openstates`, or both

7. Pipeline Execution Order

7.1 Full pipeline (`poliprism-pipeline.sh --all`)

INGEST (Layer 1)
  1. federal-legislators      → people, entities_legislators
  2. state-legislators         → people, entities_legislators (Open States)
  3. ny-legislators            → nyleg_member_staging → promote → people, entities_legislators
  4. federal-bills             → prism_bills (TODO: federal_bill_staging)
  5. federal-bill-texts        → raw_congress_bill_texts
  6. federal-enrichment        → people (committees, sponsored bills)
  7. state-bills               → openstates_bills (TODO: openstates_bill_staging)
  8. ny-bills                  → nyleg_bill_staging
  9. state-bill-texts          → openstates_bills.bill_text
  10. fec                      → raw_fec_*
  11. usaspending              → raw_usaspending_*
  12. state-assets             → state asset files
  13. profile-photos           → /images/profiles/
  14. entities                 → promote all staging → canonical + refresh MVs

PROCESS (Layer 2)
  1. promote-bill-texts        → entities_bills + bill_texts
  2. score-federal-bills       → prism_bills scoring
  3. score-state-bills         → openstates_bills scoring
  4. score-legislators (×4)    → entities_legislators scores
  5. classify-organizations    → organization classifications
  6. entity-resolver           → cross-source entity matching
  7. standardize-profiles      → profile normalization
  8. refresh-views             → all MVs (always last)

BUILD (Layer 3)
  1. css                       → poliprism.css
  2. hubs                      → home, states grid, error pages
  3. search                    → Fuse.js JSON indexes
  4. sitemap                   → sitemap.xml + robots.txt

7.2 Target state (after staging migration)

INGEST (all write to staging only)
  members: federal_member_staging, openstates_member_staging, nyleg_member_staging
  bills:   federal_bill_staging, openstates_bill_staging, nyleg_bill_staging

PROMOTE (staging → canonical, with pre-validation)
  1. promote-federal-members    → people + entities_legislators
  2. promote-openstates-members → people + entities_legislators (exclude native states)
  3. promote-ny-members         → people + entities_legislators
  4. promote-federal-bills      → entities_bills + bill_texts + bill_sponsorships
  5. promote-openstates-bills   → entities_bills + bill_texts + bill_sponsorships (exclude native states)
  6. promote-ny-bills           → entities_bills + bill_texts + bill_sponsorships
  7. refresh-views              → all MVs

SCORE (explicit, never auto)
  score-federal-bills, score-state-bills (from scoring_queue)

BUILD (unchanged)
  css, hubs, search, sitemap

8. Admin UI Mental Model (future)

Each step is a button with status feedback:

┌──────────────────────────────────────────────────┐
│ Data Source: NY Open Legislation                  │
├──────────────────────────────────────────────────┤
│ [Ingest Members]  Last: 2026-04-09  220 rows     │
│ [Promote Members] Last: 2026-04-09  220 promoted │
│ [Ingest Bills]    Last: 2026-04-10  39,780 rows  │
│ [Promote Bills]   Last: 2026-04-10  39,780 done  │
│   Sponsors: 38,758 (97%)                         │
│   Summaries: 39,744 (99.9%)                      │
│   Texts: 79 (0.2%)                               │
│ [Score Bills]     Last: never  0/39,780          │
│ [Refresh Views]   Last: 2026-04-10               │
├──────────────────────────────────────────────────┤
│ Data Source: Federal (Congress.gov)               │
│ ... same button pattern ...                      │
├──────────────────────────────────────────────────┤
│ Data Source: Open States (AL, ...)               │
│ ... same button pattern ...                      │
│ Excluded states: NY                              │
└──────────────────────────────────────────────────┘

Each button:

Shows last run timestamp and result counts
Can be clicked to run that step
Shows progress while running
Blocks downstream steps if upstream hasn't completed (e.g., can't promote bills if members aren't promoted)

9. Canonical Tables (shared across all sources)

All sources converge into the same canonical tables:

Table	Purpose	Source key columns
`people`	All legislators (slug is PK)	`external_ids` JSONB
`entities_legislators`	Unified legislator metadata	`people_slug`, `bioguide_id`, `openstates_id`
`people_offices`	Person ↔ office links	`person_id`, `office_id`
`entities_bills`	All bills (one source key per row)	`prism_bill_id`, `openstates_bill_id`, `nyleg_bill_id`
`bill_texts`	Bill full text	`bill_id` (FK), `source` discriminator
`bill_sponsorships`	Sponsor/cosponsor links	`bill_id`, `legislator_id`
`scoring_queue`	Pending bills for AI scoring	`bill_id`
`bill_taxonomy`	AI-assigned categories	`bill_id`

9.1 Source key constraint

entities_bills enforces exactly one source key per row:

CHECK (
  (prism_bill_id IS NOT NULL)::integer
  + (openstates_bill_id IS NOT NULL)::integer
  + (nyleg_bill_id IS NOT NULL)::integer = 1
)

New native processors add a new column + extend this CHECK.

9.2 Summary fields

Column	Source	Content
`entities_bills.source_summary`	Vendor API	Staff-written legislative digest (CRS for federal, Open Legislation for NY)
`entities_bills.prism_summary`	Ollama AI	AI-generated neutral analysis (lives in Prism Factor Panel on site)

These are distinct — never overwrite one with the other.

10. Finance Track: FEC + USASpending

Finance data follows the same staging-first principle but has its own entity relationships. Two sources feed into legislator finance profiles.

10.1 Data sources

Source	API	What it provides	Raw tables (today)
FEC	`api.fec.gov` (key required)	Candidates, committees, PAC contributions, disbursements	`raw_fec_candidates`, `raw_fec_committees`, `raw_fec_contributions` (partitioned by cycle), `raw_fec_disbursements`
USASpending	`api.usaspending.gov` (no key)	Federal spending obligations by congressional district	`raw_usaspending_awards`, `raw_usaspending_by_district`

10.2 Current flow

FEC already has a partial staging pattern — sync_fec_data.py runs in phases:

Phase 1: candidates    → raw_fec_candidates + entities_legislators.fec_candidate_id
Phase 2: committees    → raw_fec_committees + entities_organizations
Phase 3: contributions → raw_fec_contributions (partitioned by cycle: 2020, 2022, 2024)
Phase 4: normalize     → campaign_contributions (links legislators ↔ organizations)

The raw_fec_* tables are staging — phase 4 (normalize) is the promote step that writes to campaign_contributions. This is already close to the target pattern.

USASpending writes to raw tables then summarizes:

sync_usaspending.py → raw_usaspending_by_district → usaspending_district_summary

10.3 Canonical finance tables

Table	Purpose	Links to
`entities_organizations`	PACs, committees, donors	FEC committee IDs
`campaign_contributions`	Legislator ← PAC contribution links	`entities_legislators`, `entities_organizations`
`raw_fec_disbursements`	Committee spending (partitioned by cycle)	`raw_fec_committees`
`usaspending_district_summary`	Federal spending per House district	`entities_legislators` (state + district)
`mv_finance_ideology_gap`	Pre-computed donor alignment scores	Materialized view

10.4 Finance dependency chain

┌─────────────────────────────────────────────────────┐
│ MEMBERS TRACK (must exist first)                    │
│ entities_legislators with fec_candidate_id          │
└──────────────────────┬──────────────────────────────┘
                       │ dependency: candidate ↔ legislator link
┌──────────────────────┴──────────────────────────────┐
│ FEC TRACK                                           │
│                                                     │
│ Phase 1: candidates → raw_fec_candidates            │
│          + backfill entities_legislators.fec_id       │
│ Phase 2: committees → raw_fec_committees            │
│          + entities_organizations                     │
│ Phase 3: contributions → raw_fec_contributions      │
│ Phase 4: normalize → campaign_contributions          │
│                                                     │
│ Score: score_legislators_donor_alignment              │
│        → entities_legislators.donor_alignment_score  │
│        → mv_finance_ideology_gap                     │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ USASPENDING TRACK                                   │
│                                                     │
│ Ingest: raw_usaspending_by_district                 │
│ Summarize: usaspending_district_summary              │
│ Links to House members via state + district          │
└─────────────────────────────────────────────────────┘

10.5 Current data

Table	Rows	Notes
`raw_fec_candidates`	537	Matched federal legislators
`raw_fec_committees`	546	Principal campaign committees
`raw_fec_contributions`	25,502	PAC contributions (partitioned by 2020/2022/2024 cycles)
`campaign_contributions`	25,502	Normalized (phase 4 output)
`entities_organizations`	3,446	PACs and committees
`raw_usaspending_by_district`	381	District-level spending
`usaspending_district_summary`	425	Aggregated per district
`raw_fec_disbursements`	0	Not yet ingested

10.6 What's already right

FEC is actually the closest to the target pattern — raw_fec_* tables are staging, and phase 4 (normalize) is a promote step. The main gaps:

No pre-validation (check members exist before linking candidates)
No post-validation (count coverage, warn on low matches)
No MV refresh built into normalize phase
Disbursements not yet ingested

10.7 Target state

INGEST (staging)
  fec-candidates    → raw_fec_candidates
  fec-committees    → raw_fec_committees
  fec-contributions → raw_fec_contributions
  fec-disbursements → raw_fec_disbursements
  usaspending       → raw_usaspending_by_district

PROMOTE (staging → canonical, with validation)
  promote-fec       → entities_organizations + campaign_contributions
                      pre-validate: entities_legislators.fec_candidate_id populated
                      post-validate: match rates, contribution counts
  promote-usaspending → usaspending_district_summary
                        pre-validate: House members exist with state+district

SCORE (explicit, decoupled)
  score-legislators-donor-alignment → entities_legislators scores
  refresh-views → mv_finance_ideology_gap

10.8 Admin UI for finance

┌──────────────────────────────────────────────────┐
│ Finance: FEC (cycle 2024)                        │
├──────────────────────────────────────────────────┤
│ [Ingest Candidates]  537 matched                 │
│ [Ingest Committees]  546 committees              │
│ [Ingest Contributions] 25,502 records            │
│ [Ingest Disbursements] 0 (not yet run)           │
│ [Promote/Normalize]  25,502 → campaign_contribs  │
│ [Score Donor Alignment] Last: 2026-04-10         │
├──────────────────────────────────────────────────┤
│ Finance: USASpending (FY2025)                    │
├──────────────────────────────────────────────────┤
│ [Ingest Districts]   381 district records        │
│ [Summarize]          425 district summaries      │
│ [Refresh Views]      mv_finance_ideology_gap     │
└──────────────────────────────────────────────────┘

11. Verification Checklist (per source)

Members

Staging rows match API count for session
people rows created with correct slugs
entities_legislators rows with level, state, chamber, party, district
people_offices links exist
No duplicate districts (check for slug collisions across sources)
mv_legislator_profiles refreshed and row count matches

Bills

Staging rows match expected count (check filters)
entities_bills rows: one source key populated, title/session present
source_summary populated (% of bills with vendor summary)
bill_texts: text rows for bills with detail_json (has_bill_text in MV)
bill_sponsorships: sponsor coverage ≥ 80% of bills with sponsor data in staging
scoring_queue: pending rows created (unless explicitly skipped)
MVs refreshed (mv_bills_list, mv_bill_detail)
API returns data: /api/state-bills?state_abbr=XX or /api/bills

Finance (FEC)

raw_fec_candidates rows match expected legislator count
entities_legislators.fec_candidate_id populated for matched members
raw_fec_committees linked to candidates
raw_fec_contributions count reasonable for cycle (check partitions)
campaign_contributions (normalized) row count matches raw contributions
entities_organizations populated from committees
Donor alignment scores computed (mv_finance_ideology_gap refreshed)

Finance (USASpending)

raw_usaspending_by_district covers expected House districts
usaspending_district_summary linked to House members via state+district
No orphan districts (districts without matching legislators)

12. Implementation Priority

Priority	Work	Scope
Done	NY staging-first (members + bills)	Reference implementation
1	Federal bills → staging	Create `federal_bill_staging`, update `sync_prism_bills.py`, new promote
2	Federal members → staging	Create `federal_member_staging`, update federal legislator sync
3	Open States bills → staging	Create `openstates_bill_staging`, update sync, add exclude list
4	Open States members → staging	Create `openstates_member_staging`, update sync, add exclude list
5	FEC promote hardening	Add pre/post validation + MV refresh to normalize phase
6	USASpending promote hardening	Add pre/post validation to summarize step
7	FEC disbursements ingest	New ingestion for committee spending data
8	Deduplicate NY legislators	Merge `ny-leg-` and `state-new-york-` slugs
9	Admin UI	Pipeline control page with per-source buttons

Related docs

STATE-PRIMARY-INGEST-SOP.md — discovery, API stewardship, per-vendor feasibility
processors/ — per-processor config docs
SERVER-ENV.md — env var reference
BILL-TEXT-FETCH-STRATEGY.md — text fetch patterns
/sitedetails/ — live architecture page
CLAUDE.md — agent context and deploy rules