LLMで任意のデータを作って、登録されたメアドに定期的に送信したい
What I learned
-
Idempotent= running the same operation multiple times leaves the system state the same as running it once - Idempotency key (UNIQUE key in DB): prevents duplicate rows/effects under retries (subscribe, send, insert meeting)
- Content cache key (hash you compare): prevents re-computing expensive work when inputs didn’t change (summaries, assembled report)
- Principle of least privilege. Only allow mail service to decrypt email_ciphertext
- token does not always need to be expired. it depends on the use case
- The bigger risk isn’t reuse, it’s forgery
- UNIQUE and primary key
- PRIMARY KEY = the row’s identity; UNIQUE = a business rule
- Q. Why keep a PRIMARY KEY and UNIQUE columns?
- A. Stability: Natural identifiers (email, SKU, URL) can change
- a UNIQUE constraint creates a unique b-tree index under the hood (not on all rdbms)
Terms
PM = parliamentary minutes (source docs)
Report = the LLM-generated daily summary derived from all PMs discovered that day
Subscriber = a verified email address receiving reports
User story
1st priority
- Subscribe (no auth)
- As a visitor, I can submit my email to subscribe so I receive future reports
- Acceptance: POST /subscribe with a valid email → returns success; DB has a new active subscriber (or dedup if already active)
- Unsubscribe (one-click)
- As a subscriber, I can unsubscribe via a link in any email without logging in
- Acceptance: GET /u?t=… flips status to unsubscribed; I no longer receive mail
- Secure email storage
- As an operator, I need emails stored safely
- Acceptance: email is encrypted at rest (AES-GCM via envelope encryption) + HMAC column for dedupe (keyed pepper); no plaintext emails in logs
- Fetch PM daily
- As the system, I check the gov site every day at a set time and record newly published PMs
- Acceptance: new PMs for a date are persisted exactly once (idempotent), with their source ids & hashes
- Multiple sessions in a day
- If multiple PMs occur in the same calendar day, bundle them into one Report (with per-session sections)
- Create daily report if new PM exists
- As the system, if at least one new PM was found today, generate a single consolidated Report for that date via LLM
- Acceptance: one report(day=YYYY-MM-DD) is created with model metadata; retries safe; token/cost caps enforced
- Send at most one email per day
- As a subscriber, I receive at most one email per day, only if a new Report exists for that date
- Acceptance: per-subscriber cap; deliveries recorded; no duplicates for a day
- Deployable service
- As an operator, I can run this in the cloud with scheduled jobs, queues, and an email provider
- Acceptance: infra manifests exist; health check endpoint passes; secrets managed
Design
Components
- Web Client
- where users request to subscribe/unsubscribe
- API Gateway
- Email service
- Report service
- Relational DB (RDB)
API endpoints
POST /subscribe
Body: { "email": "user@example.com" }
Always 200. Server normalizes, computes HMAC(dedupe), envelope-encrypts, inserts if new.
GET /u?t=<token>
One-click unsubscribe. Token verified via KMS VerifyMac; idempotent success page.
GET /.well-known/health
Returns { status:"ok" } when DB reachable and last cron ran within window.
Schema
- subscribers
- id @id
- email_ciphertext
- email_iv
- wrapped_dek
- email_hmac_sha256 @UNIQUE
- status
- created_at
- activated_at
- unsubscribed_at
- sittings
- day: DATE @PK
- index_url: TEXT
- source_id: TEXT @UNIQUE // e.g., hash of index_url or url+etag)
- published_at: TIMESTAMPTZ? // if the site provides it
- fetched_at: TIMESTAMPTZ? // when fetch(day) succeeded
- meetings
- id: UUID @PK
- sitting_day: DATE // FK -> sittings(day)
- seq: INT // display/order within the day (10,20,30…)
- title: TEXT
- chamber: TEXT // or Enum
- url: TEXT
- raw_path: TEXT? // where the fetched page/text is stored (e.g., S3 key)
- content_hash: TEXT // hash of normalized “who-said-what” text
- summary_md: TEXT? // LLM output; NULL until summarized
- summary_key: TEXT? // sha256(content_hash : prompt_version : model)
- summarized_at: TIMESTAMPTZ?
- prompt_version: TEXT?
- model: TEXT?
- reports
- day: DATE @PK // FK -> sittings(day)
- summary_md: TEXT // assembled from that day's meetings in seq order
- bundle_key: TEXT? // sha256(concat(summary_key in seq order))
- status: TEXT = 'not_sent'
- created_at: TIMESTAMPTZ = now()
- deliveries
- day
- subscriber_id
- status
- provider_message_id
- sent_at
- error
Flows
Users
subscribe
- New user talks to API gateway, specifically POST /subscribe {email}
- The request is received on Email service
- Email service will talk to KMS and compute it's HMAC
- Email service looks up the computed HMAC on RDB
- if it exists, do nothing
- else, follow next step
- Talk to KMS, get DEK + wrapped_DEK. Also get CMK version from KMS
- Produce random iv
- Use DEK and iv to perform AES-GCM encryption on the email
- Add new subscriber to the RDB
- email_ciphertext, email_iv, wrapped_dek, email_hmac_sha256 (UNIQUE), created_at
unsubscribe
- Email contains token for unsubscribe request
- Token is created as following
payload = base64url("sid=<uuid>&purpose=unsub")mac = KMS.GenerateMac(hmac_key, payload))token = payload + "." + mac
- The request is received on Email service
- Get subscriber.id, and lookup RDB to get email_ciphertext
- Apply HMAC(hmac_key, payload) and compare to validate if the request is from expected user
- If it matches, set the status of that subscriber to
unsubscribed
Coordination
- fetch(day) — discover + store raw meeting content (no LLM here)
- summarize(day) — enqueue per-meeting summary tasks; a worker does the LLM calls
- assemble_report(day) — build the day’s report from stored meeting summaries
- send(day) — deliver at most once per subscriber
fetch
Responsibility: Discover meetings for day, store raw content + content_hash.
Input: day.
Reads: government pages.
Writes: sittings, meetings.
Flow (idempotent):
1. Build index_url; UPSERT sittings(day, index_url, source_id).
2. Parse meeting links in page order → assign seq (10,20,30…).
3. For each link: fetch detail → normalize → content_hash.
4. UPSERT meetings (sitting_day, url) with seq, title, chamber, content_hash, raw_path.
If content_hash changed, set summary_md=NULL (forces re-summary).
Outputs/logs: { meetings_upserted, content_changed }
summarize
Responsibility: Ensure each meeting needing a summary gets one (use a queue worker).
Input: day.
Reads: meetings.
Writes: meetings.summary_md, meetings.summary_key.
Enqueue payload: { meeting_id, content_hash, prompt_version, model }
Flow (idempotent):
1. Recompute summary_key; if unchanged → skip.
2. Build prompt from raw (via raw_path), call LLM (caps).
3. Write summary_md, summary_key, summarized_at.
Key: summ:{meeting_id}:{summary_key}
assemble_report
Responsibility: Build/refresh the day’s report from whatever summaries exist now.
Input: day.
Reads: meetings.
Writes: reports.
Flow:
1. SELECT * FROM meetings WHERE sitting_day=$day AND summary_md IS NOT NULL ORDER BY seq;
2. If none → no-op. Else concatenate headings + summary_md.
3. bundle_key = sha256(concat(summary_key in seq order)); if unchanged → no-op.
4. UPSERT reports(day, summary_md, bundle_key, status='not_sent').
Note: If later summaries arrive, rerun; this safely overwrites reports(day).
send
Responsibility: Send at most once per subscriber for day.
Input: day.
Reads: reports, subscribers, deliveries.
Writes: deliveries.
Precheck: require reports(day) to exist; otherwise no-op.
Per recipient:
1. Decrypt email (envelope decryption), send via SES with List-Unsubscribe + one-click.
2. Insert deliveries(day, subscriber_id, 'sent', provider_id, now(), NULL) (UNIQUE enforces at-most-once).