Three steps, one daily output.
The Federal Register is the U.S. government's daily journal of every new and proposed rule. Each business day it publishes thousands of paragraphs of regulatory text. This pipeline does three things with that text.
First, it downloads today's bulk XML and yesterday's. Second, it diffs them at the word level with difflib.SequenceMatcher, surfacing the exact tokens added, removed, or modified inside each paragraph. Third, it classifies every changed paragraph into one of eleven regulatory domains via a zero-shot DistilRoBERTa NLI model, so the feed can be filtered to whatever a given user actually cares about.
Records land in PostgreSQL with full provenance (document number, agency, publication date, paragraph hash) and are served through a small FastAPI surface. Most regulatory monitors tell you that a document changed. This one tells you which words.
Eight of 328 paragraphs that differ from yesterday.
Each row is one changed paragraph. Type is whether it was added, removed, or modified in place. Domain is the predicted regulatory area; conf. is the classifier's softmax score on a 0 to 1 scale.
| Doc | Agency | Type | Domain | Conf. | Excerpt |
|---|---|---|---|---|---|
| 2026-08901 | EPA | modified | environmental | 0.981 | PM2.5 standard reduced from 35 μg/m³ to 25 μg/m³ under revised NAAQS attainment criteria. |
| 2026-08874 | SEC | added | financial | 0.976 | Registrants must disclose material cybersecurity incidents within four business days of determining materiality, pursuant to Item 1.05. |
| 2026-08812 | CMS | modified | healthcare | 0.963 | Readmission reduction program extended to skilled nursing facilities; measurement window from 30 to 45 days. |
| 2026-08770 | USDA | removed | agriculture | 0.948 | Interim waiver of origin labeling requirements for processed beef imports under USMCA expires 2026-06-30. |
| 2026-08741 | FMCSA | added | transportation | 0.957 | AV commercial vehicles operating above SAE Level 3 require real-time telemetry logging and FMCSA safety certification. |
| 2026-08712 | FERC | modified | energy | 0.944 | Transmission planning updated to include climate scenario analysis under Order 896; 10-year 20-year planning horizon required. |
| 2026-08688 | OCC | added | financial | 0.969 | National banks engaging in crypto-asset custody must maintain segregated ledger accounts with monthly attestation. |
| 2026-08651 | DOL | modified | labor | 0.938 | Fiduciary duty standard expanded from ERISA plans only to all tax-advantaged retirement accounts. |
Financial and environmental rules dominate, as usual.
Two views of the same 328 changes: the share each domain holds, and how those changes split across additions, modifications, and removals.
What a token-level change actually looks like.
The EPA's PM2.5 rule, paragraph 3 of 12 in document 2026-08901. Strikethrough red is what was removed; solid green is what was inserted. A 4-token edit that tightens an air-quality limit, easy to miss inside a 40-page rule.
Daily change volume, fourteen business days.
Federal Register volume is naturally bumpy. Quiet days hover near 250 changed paragraphs; busier rule-issuing days push past 320.
How sure the classifier is, by domain.
Average confidence over the last 30 days, n equals 4,872 paragraphs. Domains with distinctive vocabulary (environmental, financial) score high. The other bucket is the deliberate fallback when no candidate label exceeds the min_confidence threshold, so its mean is lower by construction.
How you would consume this feed.
Once records are in Postgres, FastAPI serves them. Filter by pub_date and domain; paginate with limit and offset. Full OpenAPI spec at /docs.
// 200 OK, paginated change records { "total": 42, "limit": 50, "offset": 0, "items": [ { "id": 18421, "pub_date": "2026-04-24", "diff_type": "added", "domain": "financial", "domain_score": 0.976, "agency": "SEC" } ] }
// 200 OK, totals + avg confidence by domain [ { "domain": "financial", "total_changes": 8421, "avg_score": 0.94 }, { "domain": "environmental", "total_changes": 6188, "avg_score": 0.95 }, { "domain": "healthcare", "total_changes": 4972, "avg_score": 0.92 } ]
Five stages, end to end in about three minutes.
Each stage is its own module so any of them can be tested or swapped. Total runtime on a laptop is dominated by classifier inference, which is batched across the whole day's changed paragraphs.
SequenceMatcher over each document, paragraphs hashed with SHA-256.changed_paragraphs, indexed by pub_date./changes, /domains, /stats.Terms used on this page.
- Federal Register
- The official daily journal of the U.S. government, where every federal agency publishes new and proposed regulations.
- Bulk XML
- The full text of a single day's Federal Register, served as one downloadable XML file. Whole-document atomicity is what makes day-over-day diffing reliable.
diff_type- How a paragraph changed:
added,removed, ormodifiedin place. - Token-level diff
- A comparison at the word level rather than the paragraph or document level, so the exact words that changed inside a sentence can be highlighted.
- Zero-shot NLI
- A classifier that maps text to candidate labels using natural-language entailment, without any labelled training data. Useful here because there is no public corpus of regulatory paragraphs labelled by domain.
domain_score- The model's confidence in its top-predicted domain, between 0 and 1. Predictions below the configurable
min_confidenceare returned as"other", by design.