0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

system design (HLD) 勉強

Last updated at Posted at 2025-08-25

LLMで任意のデータを作って、登録されたメアドに定期的に送信したい

What I learned

  • Idempotent = running the same operation multiple times leaves the system state the same as running it once
  • Idempotency key (UNIQUE key in DB): prevents duplicate rows/effects under retries (subscribe, send, insert meeting)
  • Content cache key (hash you compare): prevents re-computing expensive work when inputs didn’t change (summaries, assembled report)
  • Principle of least privilege. Only allow mail service to decrypt email_ciphertext
  • token does not always need to be expired. it depends on the use case
    • The bigger risk isn’t reuse, it’s forgery
  • UNIQUE and primary key
    • PRIMARY KEY = the row’s identity; UNIQUE = a business rule
    • Q. Why keep a PRIMARY KEY and UNIQUE columns?
    • A. Stability: Natural identifiers (email, SKU, URL) can change
  • a UNIQUE constraint creates a unique b-tree index under the hood (not on all rdbms)

Terms

PM = parliamentary minutes (source docs)

Report = the LLM-generated daily summary derived from all PMs discovered that day

Subscriber = a verified email address receiving reports

User story

1st priority

  • Subscribe (no auth)
    • As a visitor, I can submit my email to subscribe so I receive future reports
    • Acceptance: POST /subscribe with a valid email → returns success; DB has a new active subscriber (or dedup if already active)
  • Unsubscribe (one-click)
    • As a subscriber, I can unsubscribe via a link in any email without logging in
    • Acceptance: GET /u?t=… flips status to unsubscribed; I no longer receive mail
  • Secure email storage
    • As an operator, I need emails stored safely
    • Acceptance: email is encrypted at rest (AES-GCM via envelope encryption) + HMAC column for dedupe (keyed pepper); no plaintext emails in logs
  • Fetch PM daily
    • As the system, I check the gov site every day at a set time and record newly published PMs
    • Acceptance: new PMs for a date are persisted exactly once (idempotent), with their source ids & hashes
  • Multiple sessions in a day
    • If multiple PMs occur in the same calendar day, bundle them into one Report (with per-session sections)
  • Create daily report if new PM exists
    • As the system, if at least one new PM was found today, generate a single consolidated Report for that date via LLM
    • Acceptance: one report(day=YYYY-MM-DD) is created with model metadata; retries safe; token/cost caps enforced
  • Send at most one email per day
    • As a subscriber, I receive at most one email per day, only if a new Report exists for that date
    • Acceptance: per-subscriber cap; deliveries recorded; no duplicates for a day
  • Deployable service
    • As an operator, I can run this in the cloud with scheduled jobs, queues, and an email provider
    • Acceptance: infra manifests exist; health check endpoint passes; secrets managed

Design

Components

  • Web Client
    • where users request to subscribe/unsubscribe
  • API Gateway
  • Email service
  • Report service
  • Relational DB (RDB)

API endpoints

POST /subscribe
Body: { "email": "user@example.com" }
Always 200. Server normalizes, computes HMAC(dedupe), envelope-encrypts, inserts if new.
GET /u?t=<token>
One-click unsubscribe. Token verified via KMS VerifyMac; idempotent success page.
GET /.well-known/health
Returns { status:"ok" } when DB reachable and last cron ran within window.

Schema

- subscribers
    - id @id
    - email_ciphertext
    - email_iv
    - wrapped_dek
    - email_hmac_sha256 @UNIQUE
    - status
    - created_at
    - activated_at
    - unsubscribed_at

- sittings
    - day: DATE @PK
    - index_url: TEXT
    - source_id: TEXT @UNIQUE        // e.g., hash of index_url or url+etag)
    - published_at: TIMESTAMPTZ?     // if the site provides it
    - fetched_at: TIMESTAMPTZ?       // when fetch(day) succeeded

- meetings
    - id: UUID @PK
    - sitting_day: DATE  // FK -> sittings(day)
    - seq: INT           // display/order within the day (10,20,30…)
    - title: TEXT
    - chamber: TEXT      // or Enum
    - url: TEXT
    - raw_path: TEXT?    // where the fetched page/text is stored (e.g., S3 key)
    - content_hash: TEXT // hash of normalized “who-said-what” text
    - summary_md: TEXT?  // LLM output; NULL until summarized
    - summary_key: TEXT? // sha256(content_hash : prompt_version : model)
    - summarized_at: TIMESTAMPTZ?
    - prompt_version: TEXT?
    - model: TEXT?

- reports
    - day: DATE @PK            // FK -> sittings(day)
    - summary_md: TEXT         // assembled from that day's meetings in seq order
    - bundle_key: TEXT?        // sha256(concat(summary_key in seq order))
    - status: TEXT = 'not_sent'
    - created_at: TIMESTAMPTZ = now()

- deliveries
    - day
    - subscriber_id
    - status
    - provider_message_id
    - sent_at
    - error

Flows

Users

subscribe

  1. New user talks to API gateway, specifically POST /subscribe {email}
  2. The request is received on Email service
  3. Email service will talk to KMS and compute it's HMAC
  4. Email service looks up the computed HMAC on RDB
    • if it exists, do nothing
    • else, follow next step
  5. Talk to KMS, get DEK + wrapped_DEK. Also get CMK version from KMS
  6. Produce random iv
  7. Use DEK and iv to perform AES-GCM encryption on the email
  8. Add new subscriber to the RDB
    • email_ciphertext, email_iv, wrapped_dek, email_hmac_sha256 (UNIQUE), created_at

unsubscribe

  1. Email contains token for unsubscribe request
  2. Token is created as following
    1. payload = base64url("sid=<uuid>&purpose=unsub")
    2. mac = KMS.GenerateMac(hmac_key, payload))
    3. token = payload + "." + mac
  3. The request is received on Email service
  4. Get subscriber.id, and lookup RDB to get email_ciphertext
  5. Apply HMAC(hmac_key, payload) and compare to validate if the request is from expected user
  6. If it matches, set the status of that subscriber to unsubscribed

Coordination

  • fetch(day) — discover + store raw meeting content (no LLM here)
  • summarize(day) — enqueue per-meeting summary tasks; a worker does the LLM calls
  • assemble_report(day) — build the day’s report from stored meeting summaries
  • send(day) — deliver at most once per subscriber
fetch
Responsibility: Discover meetings for day, store raw content + content_hash.

Input: day.
Reads: government pages.
Writes: sittings, meetings.

Flow (idempotent):
1. Build index_url; UPSERT sittings(day, index_url, source_id).
2. Parse meeting links in page order → assign seq (10,20,30…).
3. For each link: fetch detail → normalize → content_hash.
4. UPSERT meetings (sitting_day, url) with seq, title, chamber, content_hash, raw_path.
If content_hash changed, set summary_md=NULL (forces re-summary).

Outputs/logs: { meetings_upserted, content_changed }
summarize
Responsibility: Ensure each meeting needing a summary gets one (use a queue worker).

Input: day.
Reads: meetings.
Writes: meetings.summary_md, meetings.summary_key.

Enqueue payload: { meeting_id, content_hash, prompt_version, model }

Flow (idempotent):
1. Recompute summary_key; if unchanged → skip.
2. Build prompt from raw (via raw_path), call LLM (caps).
3. Write summary_md, summary_key, summarized_at.

Key: summ:{meeting_id}:{summary_key}
assemble_report
Responsibility: Build/refresh the day’s report from whatever summaries exist now.

Input: day.
Reads: meetings.
Writes: reports.

Flow:

1. SELECT * FROM meetings WHERE sitting_day=$day AND summary_md IS NOT NULL ORDER BY seq;
2. If none → no-op. Else concatenate headings + summary_md.
3. bundle_key = sha256(concat(summary_key in seq order)); if unchanged → no-op.
4. UPSERT reports(day, summary_md, bundle_key, status='not_sent').

Note: If later summaries arrive, rerun; this safely overwrites reports(day).
send
Responsibility: Send at most once per subscriber for day.

Input: day.
Reads: reports, subscribers, deliveries.
Writes: deliveries.

Precheck: require reports(day) to exist; otherwise no-op.

Per recipient:

1. Decrypt email (envelope decryption), send via SES with List-Unsubscribe + one-click.

2. Insert deliveries(day, subscriber_id, 'sent', provider_id, now(), NULL) (UNIQUE enforces at-most-once).
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?