Pipeline Architecture SOP
Staging-first data pipeline for all sources — members, bills, and finance
sync_nyleg_members.py, sync_nyleg_bills.py,
promote_nyleg_staging_to_canonical.py.
Companion docs: STATE-PRIMARY-INGEST-SOP.md · Site Architecture
1. Core Principle: Staging First
Nothing from the outside touches canonical tables directly.
Every data source writes to its own staging tables. Promotion to canonical is a separate, explicit step with pre-validation and post-validation.
External API
↓ ingest (HTTP → JSON)
Staging Tables (raw JSON, per-source)
↓ promote (validate → transform → upsert)
Canonical Tables (entities_*, bill_texts, bill_sponsorships)
↓ refresh
Materialized Views (mv_bills_list, mv_legislator_profiles, ...)
↓ serve
API / Site
Why staging-first
| Benefit | How |
|---|---|
| Isolation | Ingest can fail/retry/re-run without corrupting canonical data |
| Auditability | Raw JSON preserved; inspect before promoting |
| Idempotency | ON CONFLICT upserts on staging; re-ingest is safe |
| Admin control | Future admin UI: review staging → click “Promote” → see results |
| Dependency enforcement | Promote pre-validates that upstream data exists (e.g., members before bills) |
| Decoupled scoring | Promote enqueues scoring_queue; scoring is never auto-triggered |
DB size
Staging tables are JSON blobs. At current scale (40K NY, 339 federal, 60 Open States) this is negligible. Archive or truncate old staging rows after confirmed promotion if needed.
2. Two Tracks: Members and Bills
Every data source has two parallel tracks with a hard dependency between them.
┌─────────────────────────────────────────────────────┐
│ MEMBERS TRACK (must complete first) │
│ │
│ ingest → {source}_member_staging │
│ promote → people + entities_legislators │
│ + people_offices │
└──────────────────────┬──────────────────────────────┘
│ dependency: sponsor resolution
┌──────────────────────┴──────────────────────────────┐
│ BILLS TRACK │
│ │
│ ingest → {source}_bill_staging │
│ promote → entities_bills + bill_texts │
│ + bill_sponsorships + scoring_queue │
│ refresh → mv_bills_list + mv_bill_detail │
│ + mv_legislator_profiles │
└─────────────────────────────────────────────────────┘
Why members first: Bill sponsorship resolution maps sponsor IDs to
entities_legislators rows. If members aren't promoted, sponsors silently fail
(the NY bug we discovered — 79/39,780 bills had sponsors because members
were synced after the first promote run).
3. Staging Tables (per source)
3.1 Naming convention
{source}_member_staging — raw member/legislator JSON
{source}_bill_staging — raw bill list JSON + optional detail JSON
3.2 Required columns (minimum)
Member staging
| Column | Type | Purpose |
|---|---|---|
source_member_id | TEXT | Vendor's unique member ID (PK component) |
session_year | INT | Legislative session (PK component, nullable if not applicable) |
chamber | TEXT | senate/house/assembly |
raw_json | JSONB | Full API response payload |
supplement_json | JSONB | Optional gap-fill data (e.g., Open States photo/bio) |
fetched_at | TIMESTAMPTZ | When ingested |
merged_at | TIMESTAMPTZ | When promoted (NULL = not yet promoted) |
Bill staging
| Column | Type | Purpose |
|---|---|---|
session_year | INT | Legislative session (PK component) |
print_no / identifier | TEXT | Bill identifier (PK component) |
list_json | JSONB | List/search API response (title, sponsor, status, summary) |
detail_json | JSONB | Full bill detail (text, amendments, cosponsors) — fetched separately |
fetched_at | TIMESTAMPTZ | When ingested |
merged_at | TIMESTAMPTZ | When promoted (NULL = not yet promoted) |
3.3 Current staging tables
| Source | Members | Bills | Status |
|---|---|---|---|
| NY Open Legislation | nyleg_member_staging | nyleg_bill_staging | Live |
| Federal (Congress.gov) | — (direct to people) | — (direct to prism_bills) | Needs migration |
| Open States | — (direct to people) | — (direct to openstates_bills) | Needs migration |
3.4 Migration path (federal + Open States)
For each source, the migration is:
- Create
{source}_member_stagingand{source}_bill_stagingtables - Update ingest scripts to write staging instead of canonical
- Create promote scripts following the NY pattern
- Backfill staging from existing canonical data (one-time)
- Update pipeline shell scripts and
_run-all.sh - Retire direct-write code paths
Order: Do federal first (smaller dataset, simpler API), then Open States.
4. Promote Script Pattern (universal)
Every promote script follows this sequence. The NY promote
(promote_nyleg_staging_to_canonical.py) is the reference implementation.
4.1 Members promote
1. PRE-VALIDATE
- Required tables exist (people, entities_legislators, people_offices, offices)
- Offices seeded for this jurisdiction (e.g., new-york-state-senate)
2. PER STAGING ROW
a. Parse raw_json + supplement_json
b. Upsert people (slug, display_name, role_title, image, bio, contact, external_ids)
c. Upsert people_offices (link person to office)
d. Upsert entities_legislators (level, state, chamber, party, district, active)
e. Set merged_at on staging row
3. POST-PROMOTE
- Refresh mv_legislator_profiles
- Validate: count people rows, entities_legislators rows, warn on anomalies
- Log summary
4.2 Bills promote
1. PRE-VALIDATE
- Required tables exist (entities_bills, bill_texts, bill_sponsorships)
- Members are synced (entities_legislators rows exist for this jurisdiction)
→ FAIL FAST if no members found
2. PER STAGING ROW
a. Optionally fetch detail_json if missing (rate-limited HTTP)
b. Upsert entities_bills (title, source_summary, status, dates, jurisdiction)
c. Upsert bill_texts from detail_json full text (if available)
d. Resolve sponsor/cosponsor IDs → entities_legislators
e. Write bill_sponsorships (delete old + insert fresh)
f. Enqueue scoring_queue (pending) — scoring is NOT triggered
g. Set merged_at on staging row
3. POST-PROMOTE
- Refresh mv_bills_list + mv_bill_detail
- Validate: count bills, sponsors (% coverage), texts, summaries
- Warn if sponsor coverage < threshold
- Log summary with counters
4.3 CLI flags (standard)
Every promote script should support:
| Flag | Purpose |
|---|---|
--dry-run | Log what would happen, no DB writes |
--all-staging | Re-process rows even if merged_at is set |
--no-fetch-detail | Skip HTTP calls for detail_json (fast re-promote) |
--skip-refresh | Skip MV refresh (when caller handles it, e.g., entities.sh) |
--limit N | Cap rows processed (0 = unlimited with safety max) |
--session-year Y | Filter to specific session |
5. Open States Whitelist / Exclusion
Open States provides data for all 50 states. When a state has its own native processor (e.g., NY via Open Legislation), Open States data for that state should be excluded from promotion to avoid duplicates.
5.1 Exclusion pattern
OPENSTATES_PROMOTE_EXCLUDE_STATES=NY,AL # states with native processors
The Open States promote script checks this list and skips promotion for excluded states. Staging still receives the data (useful for gap-fill, comparison, or fallback), but canonical tables only get the native processor's data.
5.2 Current state
| State | Native processor | Open States excluded? |
|---|---|---|
| NY | NY Open Legislation | Should be (OPENSTATES_BILLS_EXCLUDE_STATES=ny) |
| AL | Next candidate | Not yet |
| All others | None | Open States is canonical |
5.3 Members exclusion
Same pattern applies to members. When NY has its own member sync, Open States NY members
should not overwrite entities_legislators rows. The duplicate slug problem
(ny-leg-* vs state-new-york-*) was caused by both sources writing without exclusion.
6. Scoring (always decoupled)
Scoring is never triggered by ingest or promote. It is always a separate, explicit step.
6.1 Flow
promote → INSERT scoring_queue (status='pending')
↓
(explicit action: admin click, cron timer, manual script)
↓
score_prism_bills.py → read scoring_queue/bill_texts
↓
entities_bills.prism_score, prism_summary, prism_rating_category
bill_taxonomy (categories, policy_actions, special_flags)
↓
refresh_views.py → mv_bills_list updated
6.2 Why decoupled
- Scoring uses Ollama (GPU-intensive, slow)
- Admin may want to review data quality before scoring
- Scoring can be rate-limited independently
- Re-scoring doesn't require re-ingestion
- Future admin UI: “Score” is a separate button from “Promote”
6.3 Env controls
| Var | Purpose |
|---|---|
{SOURCE}_SKIP_SCORING_QUEUE | Set 1 to skip enqueuing during promote |
PRISM_SCORE_MAX_PER_RUN | Cap bills scored per run |
PRISM_SCORE_DELAY_MS | Pause between Ollama calls |
PRISM_SCORE_SOURCES | federal, openstates, or both |
7. Pipeline Execution Order
7.1 Full pipeline (poliprism-pipeline.sh --all)
INGEST (Layer 1)
1. federal-legislators → people, entities_legislators
2. state-legislators → people, entities_legislators (Open States)
3. ny-legislators → nyleg_member_staging → promote → people, entities_legislators
4. federal-bills → prism_bills (TODO: federal_bill_staging)
5. federal-bill-texts → raw_congress_bill_texts
6. federal-enrichment → people (committees, sponsored bills)
7. state-bills → openstates_bills (TODO: openstates_bill_staging)
8. ny-bills → nyleg_bill_staging
9. state-bill-texts → openstates_bills.bill_text
10. fec → raw_fec_*
11. usaspending → raw_usaspending_*
12. state-assets → state asset files
13. profile-photos → /images/profiles/
14. entities → promote all staging → canonical + refresh MVs
PROCESS (Layer 2)
1. promote-bill-texts → entities_bills + bill_texts
2. score-federal-bills → prism_bills scoring
3. score-state-bills → openstates_bills scoring
4. score-legislators (×4) → entities_legislators scores
5. classify-organizations → organization classifications
6. entity-resolver → cross-source entity matching
7. standardize-profiles → profile normalization
8. refresh-views → all MVs (always last)
BUILD (Layer 3)
1. css → poliprism.css
2. hubs → home, states grid, error pages
3. search → Fuse.js JSON indexes
4. sitemap → sitemap.xml + robots.txt
7.2 Target state (after staging migration)
INGEST (all write to staging only)
members: federal_member_staging, openstates_member_staging, nyleg_member_staging
bills: federal_bill_staging, openstates_bill_staging, nyleg_bill_staging
PROMOTE (staging → canonical, with pre-validation)
1. promote-federal-members → people + entities_legislators
2. promote-openstates-members → people + entities_legislators (exclude native states)
3. promote-ny-members → people + entities_legislators
4. promote-federal-bills → entities_bills + bill_texts + bill_sponsorships
5. promote-openstates-bills → entities_bills + bill_texts + bill_sponsorships (exclude native states)
6. promote-ny-bills → entities_bills + bill_texts + bill_sponsorships
7. refresh-views → all MVs
SCORE (explicit, never auto)
score-federal-bills, score-state-bills (from scoring_queue)
BUILD (unchanged)
css, hubs, search, sitemap
8. Admin UI Mental Model (future)
Each step is a button with status feedback:
┌──────────────────────────────────────────────────┐
│ Data Source: NY Open Legislation │
├──────────────────────────────────────────────────┤
│ [Ingest Members] Last: 2026-04-09 220 rows │
│ [Promote Members] Last: 2026-04-09 220 promoted │
│ [Ingest Bills] Last: 2026-04-10 39,780 rows │
│ [Promote Bills] Last: 2026-04-10 39,780 done │
│ Sponsors: 38,758 (97%) │
│ Summaries: 39,744 (99.9%) │
│ Texts: 79 (0.2%) │
│ [Score Bills] Last: never 0/39,780 │
│ [Refresh Views] Last: 2026-04-10 │
├──────────────────────────────────────────────────┤
│ Data Source: Federal (Congress.gov) │
│ ... same button pattern ... │
├──────────────────────────────────────────────────┤
│ Data Source: Open States (AL, ...) │
│ ... same button pattern ... │
│ Excluded states: NY │
└──────────────────────────────────────────────────┘
Each button:
- Shows last run timestamp and result counts
- Can be clicked to run that step
- Shows progress while running
- Blocks downstream steps if upstream hasn't completed (e.g., can't promote bills if members aren't promoted)
9. Canonical Tables (shared across all sources)
All sources converge into the same canonical tables:
| Table | Purpose | Source key columns |
|---|---|---|
people | All legislators (slug is PK) | external_ids JSONB |
entities_legislators | Unified legislator metadata | people_slug, bioguide_id, openstates_id |
people_offices | Person ↔ office links | person_id, office_id |
entities_bills | All bills (one source key per row) | prism_bill_id, openstates_bill_id, nyleg_bill_id |
bill_texts | Bill full text | bill_id (FK), source discriminator |
bill_sponsorships | Sponsor/cosponsor links | bill_id, legislator_id |
scoring_queue | Pending bills for AI scoring | bill_id |
bill_taxonomy | AI-assigned categories | bill_id |
9.1 Source key constraint
entities_bills enforces exactly one source key per row:
CHECK (
(prism_bill_id IS NOT NULL)::integer
+ (openstates_bill_id IS NOT NULL)::integer
+ (nyleg_bill_id IS NOT NULL)::integer = 1
)
New native processors add a new column + extend this CHECK.
9.2 Summary fields
| Column | Source | Content |
|---|---|---|
entities_bills.source_summary | Vendor API | Staff-written legislative digest (CRS for federal, Open Legislation for NY) |
entities_bills.prism_summary | Ollama AI | AI-generated neutral analysis (lives in Prism Factor Panel on site) |
These are distinct — never overwrite one with the other.
10. Finance Track: FEC + USASpending
Finance data follows the same staging-first principle but has its own entity relationships. Two sources feed into legislator finance profiles.
10.1 Data sources
| Source | API | What it provides | Raw tables (today) |
|---|---|---|---|
| FEC | api.fec.gov (key required) | Candidates, committees, PAC contributions, disbursements | raw_fec_candidates, raw_fec_committees, raw_fec_contributions (partitioned by cycle), raw_fec_disbursements |
| USASpending | api.usaspending.gov (no key) | Federal spending obligations by congressional district | raw_usaspending_awards, raw_usaspending_by_district |
10.2 Current flow
FEC already has a partial staging pattern — sync_fec_data.py runs in phases:
Phase 1: candidates → raw_fec_candidates + entities_legislators.fec_candidate_id
Phase 2: committees → raw_fec_committees + entities_organizations
Phase 3: contributions → raw_fec_contributions (partitioned by cycle: 2020, 2022, 2024)
Phase 4: normalize → campaign_contributions (links legislators ↔ organizations)
The raw_fec_* tables are staging — phase 4 (normalize) is the promote step that writes
to campaign_contributions. This is already close to the target pattern.
USASpending writes to raw tables then summarizes:
sync_usaspending.py → raw_usaspending_by_district → usaspending_district_summary
10.3 Canonical finance tables
| Table | Purpose | Links to |
|---|---|---|
entities_organizations | PACs, committees, donors | FEC committee IDs |
campaign_contributions | Legislator ← PAC contribution links | entities_legislators, entities_organizations |
raw_fec_disbursements | Committee spending (partitioned by cycle) | raw_fec_committees |
usaspending_district_summary | Federal spending per House district | entities_legislators (state + district) |
mv_finance_ideology_gap | Pre-computed donor alignment scores | Materialized view |
10.4 Finance dependency chain
┌─────────────────────────────────────────────────────┐
│ MEMBERS TRACK (must exist first) │
│ entities_legislators with fec_candidate_id │
└──────────────────────┬──────────────────────────────┘
│ dependency: candidate ↔ legislator link
┌──────────────────────┴──────────────────────────────┐
│ FEC TRACK │
│ │
│ Phase 1: candidates → raw_fec_candidates │
│ + backfill entities_legislators.fec_id │
│ Phase 2: committees → raw_fec_committees │
│ + entities_organizations │
│ Phase 3: contributions → raw_fec_contributions │
│ Phase 4: normalize → campaign_contributions │
│ │
│ Score: score_legislators_donor_alignment │
│ → entities_legislators.donor_alignment_score │
│ → mv_finance_ideology_gap │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ USASPENDING TRACK │
│ │
│ Ingest: raw_usaspending_by_district │
│ Summarize: usaspending_district_summary │
│ Links to House members via state + district │
└─────────────────────────────────────────────────────┘
10.5 Current data
| Table | Rows | Notes |
|---|---|---|
raw_fec_candidates | 537 | Matched federal legislators |
raw_fec_committees | 546 | Principal campaign committees |
raw_fec_contributions | 25,502 | PAC contributions (partitioned by 2020/2022/2024 cycles) |
campaign_contributions | 25,502 | Normalized (phase 4 output) |
entities_organizations | 3,446 | PACs and committees |
raw_usaspending_by_district | 381 | District-level spending |
usaspending_district_summary | 425 | Aggregated per district |
raw_fec_disbursements | 0 | Not yet ingested |
10.6 What's already right
FEC is actually the closest to the target pattern — raw_fec_* tables are staging,
and phase 4 (normalize) is a promote step. The main gaps:
- No pre-validation (check members exist before linking candidates)
- No post-validation (count coverage, warn on low matches)
- No MV refresh built into normalize phase
- Disbursements not yet ingested
10.7 Target state
INGEST (staging)
fec-candidates → raw_fec_candidates
fec-committees → raw_fec_committees
fec-contributions → raw_fec_contributions
fec-disbursements → raw_fec_disbursements
usaspending → raw_usaspending_by_district
PROMOTE (staging → canonical, with validation)
promote-fec → entities_organizations + campaign_contributions
pre-validate: entities_legislators.fec_candidate_id populated
post-validate: match rates, contribution counts
promote-usaspending → usaspending_district_summary
pre-validate: House members exist with state+district
SCORE (explicit, decoupled)
score-legislators-donor-alignment → entities_legislators scores
refresh-views → mv_finance_ideology_gap
10.8 Admin UI for finance
┌──────────────────────────────────────────────────┐
│ Finance: FEC (cycle 2024) │
├──────────────────────────────────────────────────┤
│ [Ingest Candidates] 537 matched │
│ [Ingest Committees] 546 committees │
│ [Ingest Contributions] 25,502 records │
│ [Ingest Disbursements] 0 (not yet run) │
│ [Promote/Normalize] 25,502 → campaign_contribs │
│ [Score Donor Alignment] Last: 2026-04-10 │
├──────────────────────────────────────────────────┤
│ Finance: USASpending (FY2025) │
├──────────────────────────────────────────────────┤
│ [Ingest Districts] 381 district records │
│ [Summarize] 425 district summaries │
│ [Refresh Views] mv_finance_ideology_gap │
└──────────────────────────────────────────────────┘
11. Verification Checklist (per source)
Members
- Staging rows match API count for session
peoplerows created with correct slugsentities_legislatorsrows with level, state, chamber, party, districtpeople_officeslinks exist- No duplicate districts (check for slug collisions across sources)
mv_legislator_profilesrefreshed and row count matches
Bills
- Staging rows match expected count (check filters)
entities_billsrows: one source key populated, title/session presentsource_summarypopulated (% of bills with vendor summary)bill_texts: text rows for bills with detail_json (has_bill_textin MV)bill_sponsorships: sponsor coverage ≥ 80% of bills with sponsor data in stagingscoring_queue: pending rows created (unless explicitly skipped)- MVs refreshed (
mv_bills_list,mv_bill_detail) - API returns data:
/api/state-bills?state_abbr=XXor/api/bills
Finance (FEC)
raw_fec_candidatesrows match expected legislator countentities_legislators.fec_candidate_idpopulated for matched membersraw_fec_committeeslinked to candidatesraw_fec_contributionscount reasonable for cycle (check partitions)campaign_contributions(normalized) row count matches raw contributionsentities_organizationspopulated from committees- Donor alignment scores computed (
mv_finance_ideology_gaprefreshed)
Finance (USASpending)
raw_usaspending_by_districtcovers expected House districtsusaspending_district_summarylinked to House members via state+district- No orphan districts (districts without matching legislators)
12. Implementation Priority
| Priority | Work | Scope |
|---|---|---|
| Done | NY staging-first (members + bills) | Reference implementation |
| 1 | Federal bills → staging | Create federal_bill_staging, update sync_prism_bills.py, new promote |
| 2 | Federal members → staging | Create federal_member_staging, update federal legislator sync |
| 3 | Open States bills → staging | Create openstates_bill_staging, update sync, add exclude list |
| 4 | Open States members → staging | Create openstates_member_staging, update sync, add exclude list |
| 5 | FEC promote hardening | Add pre/post validation + MV refresh to normalize phase |
| 6 | USASpending promote hardening | Add pre/post validation to summarize step |
| 7 | FEC disbursements ingest | New ingestion for committee spending data |
| 8 | Deduplicate NY legislators | Merge ny-leg-* and state-new-york-* slugs |
| 9 | Admin UI | Pipeline control page with per-source buttons |
Related docs
- STATE-PRIMARY-INGEST-SOP.md — discovery, API stewardship, per-vendor feasibility
processors/— per-processor config docsSERVER-ENV.md— env var referenceBILL-TEXT-FETCH-STRATEGY.md— text fetch patterns- /sitedetails/ — live architecture page
CLAUDE.md— agent context and deploy rules