MARS FLAG Advent Calendar 2025

株式会社マーズフラッグ SP部

簡易的なSite Checkerを作ろう！ - SEO スコアリング編 -

Last updated at 2025-12-22Posted at 2025-12-22

この記事は「簡易的なSite Checkerを作ろう！」シリーズの第2回です

Playwright vs Cheerio、チューニング・工夫
SEO スコアリング編（本記事）
フロントエンド編 ─ Canvas でサイト構造を可視化（近日公開）

はじめに

こんにちは、マーズフラッグの小池です。
普段はフロントエンドエンジニアとして働いていますが、最近はバックエンドにも興味が出てきて、個人開発で挑戦しています。

前回の記事簡易的なSite Checkerを作ろう！ - Playwright vs Cheerio、チューニングと工夫 -の続きです。

今回は、クローラで取得したサイト情報をもとに、SEO の観点でスコアリングしていく部分を書いていきます。

データベース設計

まず、SEO スコアリングに必要なデータ構造を考えました。
dbdiagram を使って一から設計しています。

全体像（簡易版）

┌─────────────┐     ┌───────────────┐     ┌─────────────────┐
│  projects   │────▶│ crawl_results │────▶│   crawl_data    │
└─────────────┘     └───────────────┘     └─────────────────┘
                           │
                           ▼
                    ┌─────────────────┐     ┌──────────────────┐
                    │seo_check_results│────▶│ seo_meta_details │
                    └─────────────────┘     └──────────────────┘

DBスキーマの詳細（クリックで展開）

// ログインユーザーのテーブル
Table profiles {
  id uuid [primary key, note: 'Supabase Auth UUID']
  name text
  avatar_url text
  created_at timestamp [default: 'now()']
  updated_at timestamp [default: 'now()']
}

/************************************
* プロジェクト
************************************/
Table projects {
  id uuid [primary key, default: 'gen_random_uuid()']
  user_id uuid [ref: > profiles.id, not null]
  name text [not null]
  description text
  site_url text [not null]
  max_pages integer [default: 100]
  is_active boolean [default: true]
  created_at timestamp [default: 'now()']
  updated_at timestamp [default: 'now()']
}

/************************************
* クローラ
************************************/
// クローラの設定
Table crawler_configs {
  id uuid [primary key, default: 'gen_random_uuid()']
  project_id uuid [ref: > projects.id, not null]
  max_pages integer
  max_depth integer [default: 3]
  exclude_patterns text[] [note: '[/admin/*, *.pdf]']
  include_patterns text[]
  follow_external_links boolean [default: false]
  respect_robots_txt boolean [default: true]
  delay_between_requests integer [default: 1000, note: 'ミリ秒']
  user_agent text [default: 'SiteChecker Bot 1.0']
  created_at timestamp [default: 'now()']
  updated_at timestamp [default: 'now()']
}

// projectから参照する結果
Table crawl_results {
  id uuid [primary key, default: 'gen_random_uuid()']
  project_id uuid [ref: > projects.id, not null]
  user_id uuid [ref: > profiles.id, not null]
  site_url text [not null]
  sitemap_data jsonb [note: 'VueFlow用のサイトマップデータ']
  status text [default: 'in_progress', note: 'in_progress, completed, failed']
  total_pages integer [default: 0]
  successful_pages integer [default: 0]
  failed_pages integer [default: 0]
  is_latest boolean [default: true]
  started_at timestamp [default: 'now()']
  completed_at timestamp
  created_at timestamp [default: 'now()']
  updated_at timestamp [default: 'now()']
}

// ページごとの履歴
// 同じ時にクロールしたものをまとめてcrawl_resultsにまとめる
Table crawl_data {
  id uuid [primary key, default: 'gen_random_uuid()']
  crawl_results_id uuid [ref: > crawl_results.id, not null]
  page_url text [not null]
  raw_html text [note: 'SEO分析に使用する生HTML']
  status_code integer
  error_message text
  created_at timestamp [default: 'now()']
}

// backendで使用。クロールジョブ
Table crawl_jobs {
  id uuid [primary key, default: 'gen_random_uuid()']
  user_id uuid [ref: > profiles.id, not null]
  project_id uuid [ref: > projects.id, not null]
  crawl_results_id uuid [ref: > crawl_results.id]
  status text [default: 'pending', note: 'pending, running, completed, failed, cancelled']
  progress integer [default: 0]
  error_message text
  max_pages integer
  started_at timestamp
  completed_at timestamp
  created_at timestamp [default: 'now()']
}

/************************************
* SEOチェック
************************************/
// SEOチェック結果サマリー
Table seo_check_results {
  id uuid [primary key, default: 'gen_random_uuid()']
  project_id uuid [ref: > projects.id, not null]
  crawl_results_id uuid [ref: > crawl_results.id, not null]
  total_score integer
  meta_score integer
  improvement_suggestions text
  checked_at timestamp [default: 'now()']
}

// メタデータのSEOチェック詳細
Table seo_meta_details {
  id uuid [primary key, default: 'gen_random_uuid()']
  seo_check_results_id uuid [ref: > seo_check_results.id, not null]
  page_url varchar(500) [not null]
  title_text varchar(255)
  title_length integer
  title_has_keywords boolean
  meta_description_text text
  meta_description_length integer
  canonical_url varchar(500)
  og_tags jsonb [note: '{og_title, og_description, og_image}']
  twitter_cards jsonb [note: '{twitter_title, twitter_description, twitter_image}']
  keywords text[]
  status_code integer
  score integer [note: '100点満点のSEOスコア']
  created_at timestamp [default: 'now()']
}

// SEOチェック設定
Table seo_check_configs {
  id uuid [primary key, default: 'gen_random_uuid()']
  project_id uuid [ref: > projects.id, not null]
  check_meta boolean [default: true]
  title_min_length integer [default: 30]
  title_max_length integer [default: 60]
  description_min_length integer [default: 120]
  description_max_length integer [default: 160]
  created_at timestamp [default: 'now()']
  updated_at timestamp [default: 'now()']
}

// backendで使用。SEOチェックジョブ
Table seo_check_jobs {
  id uuid [primary key, default: 'gen_random_uuid()']
  user_id uuid [ref: > profiles.id, not null]
  project_id uuid [ref: > projects.id, not null]
  crawl_results_id uuid [ref: > crawl_results.id]
  status text [default: 'pending', note: 'pending, running, completed, failed, cancelled']
  progress integer [default: 0]
  error_message text
  started_at timestamp
  completed_at timestamp
  created_at timestamp [default: 'now()']
}

悩んだポイント：crawl_results と crawl_data の分離

最初は1つのテーブルにまとめようと考えましたが、以下の理由で分離しました。

問題：raw_html の肥大化

crawl_data テーブルには SEO 分析用の生 HTML が入ります。
1ページあたり数十KB〜数百KBになることもあり、ページ数が増えるとテーブルが急速に肥大化します。

解決：役割で分離

テーブル	役割	データ量
`crawl_results`	クロール結果のサマリー（画面表示用）	軽量
`crawl_data`	ページごとの生データ（分析用）	重い

-- crawl_results: サマリー情報
Table crawl_results {
  id uuid [primary key]
  project_id uuid
  status text [note: 'in_progress, completed, failed']
  total_pages integer
  successful_pages integer
  failed_pages integer
  -- ...
}

-- crawl_data: ページごとの生データ
Table crawl_data {
  id uuid [primary key]
  crawl_results_id uuid [ref: > crawl_results.id]
  page_url text
  raw_html text [note: '← これが重い']
  status_code integer
  -- ...
}

こうすることで、画面表示時は軽量な crawl_results だけを取得し、SEO 分析時のみ crawl_data にアクセスする設計にできました。

SEO スコアリングのロジック

なぜ減点方式にしたか

スコアリング方式には「加点方式」と「減点方式」があります。

方式	考え方
加点方式	0点からスタート、良い点を加点
減点方式	100点からスタート、悪い点を減点

私は 減点方式 を採用しました。理由は2つあります。

1. 満点の定義が明確

加点方式だと「満点は何点なのか？」という問題に直面します。
チェック項目を増やすたびに満点が変わってしまい、過去のスコアと比較しづらくなります。

2. プロダクトの考え方にフィット

世に出すプロダクトは、本来100点満点であるべきです。
「ここが足りないから減点」という指摘の方が、改善ポイントが明確になると考えました。

チェック項目と配点

※ 以下の説明は私が考えた内容です。感覚で決めました。

7つの項目をチェックし、100点満点でスコアリングしています。

チェック項目	減点	減点条件
title タグ	-20	未設定の場合
meta description	-20	未設定の場合
OGP (og:title)	-20	未設定の場合
Twitter Cards	-20	twitter:title 未設定の場合
canonical URL	-5	未設定の場合
meta keywords	-5	未設定、またはタイトルに含まれていない場合
ステータスコード	-10	200 以外の場合

配点の考え方：

title / description / OGP / Twitter Cards（各-20）: SEO において最も重要な要素
canonical（-5）: 重要だが、未設定でも致命的ではない
keywords（-5）: 現在の SEO では重要度が下がっている
ステータスコード（-10）: 200 以外は問題だが、リダイレクトの場合もある

実装

JSDOM で HTML をパース

クロール済みの raw_html を JSDOM でパースし、各要素を取得します。

import { JSDOM } from "jsdom";

// rawHTMLを解析してSEOメタ情報を抽出
for (const item of rawHTMLsAndPageUrl) {
  const { pageUrl, rawHTML } = item;
  const dom = new JSDOM(rawHTML);
  const doc = dom.window.document;

  // 各チェック関数を実行...
}

各チェック関数

実際に使用しているチェック関数を紹介します。

// title タグのチェック
async function checkMetaTitle(doc: Document) {
  const titleElement = doc.querySelector("title");
  const titleText = titleElement?.textContent || "";

  return {
    title_text: titleText,
    title_length: titleText.length,
    title_has_keywords: checkTitleHasMetaKeywords(doc, titleText),
  };
}

// タイトルに meta keywords が含まれているかチェック
function checkTitleHasMetaKeywords(doc: Document, title: string): boolean {
  const metaKeywords = doc
    .querySelector('meta[name="keywords"]')
    ?.getAttribute("content");
  if (!metaKeywords) return false;

  const keywords = metaKeywords.split(",").map((k) => k.trim().toLowerCase());
  const titleLower = title.toLowerCase();

  return keywords.some((keyword) => titleLower.includes(keyword));
}

// meta description のチェック
async function checkMetaDescription(doc: Document) {
  const metaDescription = doc.querySelector('meta[name="description"]');
  const descriptionText = metaDescription?.getAttribute("content") || "";

  return {
    meta_description_text: descriptionText,
    meta_description_length: descriptionText.length,
  };
}

// Canonical URL のチェック
async function checkCanonicalUrl(doc: Document) {
  const canonicalLink = doc.querySelector('link[rel="canonical"]');
  return {
    canonical_url: canonicalLink?.getAttribute("href") || "",
  };
}

// OGP のチェック
async function checkOpenGraphTags(doc: Document) {
  return {
    og_title: doc.querySelector('meta[property="og:title"]')?.getAttribute("content") || "",
    og_description: doc.querySelector('meta[property="og:description"]')?.getAttribute("content") || "",
    og_image: doc.querySelector('meta[property="og:image"]')?.getAttribute("content") || "",
  };
}

// Twitter Cards のチェック
async function checkTwitterCards(doc: Document) {
  return {
    twitter_cards: {
      twitter_title: doc.querySelector('meta[name="twitter:title"]')?.getAttribute("content") || "",
      twitter_description: doc.querySelector('meta[name="twitter:description"]')?.getAttribute("content") || "",
      twitter_image: doc.querySelector('meta[name="twitter:image"]')?.getAttribute("content") || "",
    },
  };
}

// keywords の抽出
function extractKeywords(doc: Document) {
  const metaKeywords = doc
    .querySelector('meta[name="keywords"]')
    ?.getAttribute("content");

  if (!metaKeywords) return { keywords: [] };

  const keywordList = metaKeywords
    .split(",")
    .map((keyword) => keyword.trim().toLowerCase())
    .filter((keyword) => keyword.length > 0);

  return { keywords: keywordList };
}

スコア計算

function calculateScore({
  titleCheckResult,
  descriptionCheckResult,
  canonicalUrlCheckResult,
  ogTagsCheckResult,
  twitterCardsCheckResult,
  keywordsCheckResult,
  statusCodeCheckResult,
}) {
  let score = 100;

  // タイトルのチェック
  if (!titleCheckResult.title_text) score -= 20;

  // デスクリプションのチェック
  if (!descriptionCheckResult.meta_description_text) score -= 20;

  // Canonical URL のチェック
  if (!canonicalUrlCheckResult.canonical_url) score -= 5;

  // OGP のチェック
  if (!ogTagsCheckResult.og_title) score -= 20;

  // Twitter Cards のチェック
  if (!twitterCardsCheckResult.twitter_cards.twitter_title) score -= 20;

  // キーワードのチェック
  if (keywordsCheckResult.keywords.length === 0) {
    score -= 5;
  } else {
    const titleLower = titleCheckResult.title_text.toLowerCase();
    const hasKeywords = keywordsCheckResult.keywords.some((keyword) =>
      titleLower.includes(keyword)
    );
    if (!hasKeywords) score -= 5;
  }

  // ステータスコードのチェック
  if (statusCodeCheckResult.status_code && statusCodeCheckResult.status_code !== 200) {
    score -= 10;
  }

  return Math.max(score, 0); // 0点以下にはならない
}

結果を Supabase に保存

各ページの SEO チェック結果は seo_meta_details テーブルに保存しています。

const { data, error } = await supabase
  .from("seo_meta_details")
  .insert({
    seo_check_results_id: seoCheckResultId,
    page_url: pageUrl,
    title_text: titleCheckResult.title_text,
    title_length: titleCheckResult.title_length,
    title_has_keywords: titleCheckResult.title_has_keywords,
    meta_description_text: descriptionCheckResult.meta_description_text,
    meta_description_length: descriptionCheckResult.meta_description_length,
    canonical_url: canonicalUrlCheckResult.canonical_url,
    og_tags: ogTagsCheckResult,
    twitter_cards: twitterCardsCheckResult.twitter_cards,
    keywords: keywordsCheckResult.keywords,
    status_code: statusCode,
    score: score,
  })
  .select()
  .single();

最後に、全ページのスコアを合算して平均を出し、seo_check_results テーブルに保存します。

// 平均スコアの計算
const totalScore = seoMetaResults.reduce((sum, result) => sum + result.score, 0);
const averageScore = Math.round(totalScore / seoMetaResults.length);

// seo_check_results に保存
await supabase
  .from("seo_check_results")
  .update({
    total_score: averageScore,
    meta_score: averageScore,
    improvement_suggestions: generateImprovementMessage(averageScore),
    checked_at: new Date().toISOString(),
  })
  .eq("id", seoCheckResultId);

クロール完了を Webhook で受け取る

クローラの処理が完了したら、自動的に SEO チェックを開始したいですよね。
これを実現するために Supabase Database Webhook を使用しました。

全体の流れ

┌─────────────────────────────────────────────────────────────────┐
│ crawler-backend                                                 │
│                                                                 │
│  crawl_results.status = "completed" に UPDATE                   │
└─────────────────────────────────────────────────────────────────┘
                          ↓
          Supabase Database Webhook（トリガー）
          UPDATE を検知 → ngrok URL に POST
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ ngrok（トンネル）                                                │
│                                                                 │
│  https://xxxx.ngrok.io → localhost:5050 に転送                  │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ seo-checker-backend (localhost:5050)                            │
│                                                                 │
│  POST /completed-crawler を受信                                 │
│  status === "completed" を確認                                  │
│  SEO チェックジョブをキューに追加                                │
└─────────────────────────────────────────────────────────────────┘

Supabase Dashboard での設定

Supabase のダッシュボードから Database Webhook を設定します。

設定場所： Supabase Dashboard → Integrations → Database Webhooks

項目	設定値
Table	`crawl_results`
Events	`UPDATE`
HTTP Method	`POST`
URL	`https://<ngrok-url>/completed-crawler`

Webhook が発火すると、Supabase が以下のような JSON を POST します：

{
  "type": "UPDATE",
  "table": "crawl_results",
  "record": {
    "id": "uuid",
    "user_id": "uuid",
    "project_id": "uuid",
    "status": "completed",
    ...
  },
  "old_record": { ... }
}

なぜ Supabase Database Webhook を使うか

最初は crawler-backend から直接 seo-checker-backend を呼び出そうかと思いました。
でも、それだとサービス間が密結合になってしまいます。

Webhook を使うことで：

疎結合: crawler-backend が seo-checker-backend の存在を知らなくてよい
信頼性: Supabase 側でリトライ機能がある
拡張性: 将来的に他のサービス（Slack通知など）も追加しやすい

という利点があります。

ローカル開発での課題：ngrok

ローカル開発時、Supabase から localhost:5050 に直接アクセスできません。
そこで ngrok を使ってトンネルを作成します。

ngrok に登録すると、固定のドメインを発行してもらえます。
以下のコマンドで起動すると、そのドメイン経由でローカルサーバーにアクセスできるようになります。

ngrok http --domain= 5050

起動後、発行されたドメイン（例: https://xxxx-xx-xx.ngrok-free.app）を Supabase の Webhook URL に設定すれば、ローカル環境でも Webhook を受け取れます。

Supabase Webhook
    ↓ POST
https://xxxx-xx-xx.ngrok-free.app/completed-crawler
    ↓ トンネル
localhost:5050/completed-crawler

Webhook エンドポイントの実装

app.post("/completed-crawler", async (req, res) => {
  const { record } = req.body;
  
  // status が completed の場合のみ処理
  if (record.status !== "completed") {
    return res.status(200).json({ message: "Skipped" });
  }
  
  // SEO チェックジョブをキューに追加
  await seoCheckQueue.addJob({
    userId: record.user_id,
    projectId: record.project_id,
    crawlResultDataId: record.id,
    seoCheckResultId: seoCheckResultId,
  });
  
  res.status(200).json({ message: "SEO check job queued" });
});

キュー管理の仕組み

SEO チェックは時間がかかる処理なので、キューで管理しています。
「Redis とか Bull 使った方がいいかな？」と思ったのですが、今回は簡易版ということで、Supabase テーブル + インメモリ管理のハイブリッド方式でシンプルに実装しました。

なぜ Redis を使わなかったか

正直、Redis + Bull を使いたい気持ちはありました。笑
でも、以下の理由から今回は見送りました：

インフラ構成をシンプルに保ちたかった
Supabase だけで完結させたかった
同時実行数は1で十分だった

将来的にスケールが必要になったら Redis + Bull への移行を検討します。

状態管理

状態保存場所	用途
Supabase `seo_check_jobs`	ジョブ一覧・履歴（永続化、UI表示）
インメモリ (`isProcessing`)	重複処理防止
インメモリ (`processingJobs`)	並行処理管理

class SupabaseQueue extends EventEmitter {
  private isProcessing = false;           // キュー処理中フラグ
  private concurrency = 1;                // 同時実行数（現在は1）
  private processingJobs = new Set<string>(); // 処理中のジョブID
}

ジョブのライフサイクル

pending → running → completed / failed

ステータス	意味
`pending`	キュー待ち
`running`	処理中
`completed`	完了
`failed`	失敗

重複チェック（冪等性）

Webhook は複数回発火する可能性があります。
「同じジョブが二重に作成される」のは避けたかったので、2段階で重複チェックしています。

1. ジョブ追加時

async addJob(data) {
  // 重複チェック: 同じ crawl_results_id で既存のジョブがないか確認
  const { data: existingJob } = await supabase
    .from("seo_check_jobs")
    .select("id, status")
    .eq("crawl_results_id", data.crawlResultDataId)
    .eq("user_id", data.userId)
    .maybeSingle();

  // pending / running 状態のジョブがあればスキップ
  if (existingJob?.status === "pending" || existingJob?.status === "running") {
    console.log("既存のアクティブなジョブがあるため、新規作成をスキップ");
    return existingJob.id;
  }

  // 新規ジョブを追加
  const { data: job } = await supabase
    .from("seo_check_jobs")
    .insert([{ status: "pending", ... }])
    .select()
    .single();

  this.processQueue();
  return job.id;
}

2. キュー処理時

// 重複処理防止: 既に処理中でないかチェック
if (this.processingJobs.has(job.id)) {
  console.warn(`ジョブ ${job.id} は既に処理中です - スキップ`);
  continue;
}

// pending 状態の場合のみ running に更新
const { error } = await supabase
  .from("seo_check_jobs")
  .update({ status: "running", started_at: new Date().toISOString() })
  .eq("id", job.id)
  .eq("status", "pending"); // ← ここがポイント

これで、Webhook が複数回発火しても安心です。

キュー処理の流れ

┌─────────────────────────────────────────────────────┐
│ processQueue()                                      │
│                                                     │
│ 1. isProcessing チェック（重複起動防止）              │
│ 2. pending ジョブを1件取得（created_at昇順）         │
│ 3. status を running に更新                         │
│ 4. processJob() を非同期で実行                      │
└─────────────────────────────────────────────────────┘
          ↓
┌─────────────────────────────────────────────────────┐
│ processJob()                                        │
│                                                     │
│ 1. crawl_data をバッチ取得（10件ずつ）               │
│ 2. SEOチェック実行                                  │
│ 3. 結果を seo_check_results に保存                  │
│ 4. status を completed / failed に更新              │
│ 5. 1秒後に processQueue() を再呼び出し              │
└─────────────────────────────────────────────────────┘

crawl_data のバッチ取得

raw_html は1ページあたり数十KB〜数百KBになるので、一度に全件取得するとメモリを圧迫します。
そこで、Supabase の range() を使って10件ずつ取得しています。

const batchSize = 10;
const totalBatches = Math.ceil(totalRecords / batchSize);

for (let batchIndex = 0; batchIndex < totalBatches; batchIndex++) {
  const offset = batchIndex * batchSize;

  const { data, error } = await supabase
    .from("crawl_data")
    .select("*")
    .eq("crawl_results_id", job.crawl_results_id)
    .order("created_at", { ascending: true })
    .range(offset, offset + batchSize - 1);

  if (data && data.length > 0) {
    crawlData = crawlData.concat(data);
  }

  // バッチ処理の進行状況を更新
  const batchProgress = 10 + Math.floor(((batchIndex + 1) / totalBatches) * 30);
  await updateProgress(Math.min(batchProgress, 40));

  // API レート制限を避けるため、少し待機
  if (batchIndex < totalBatches - 1) {
    await new Promise((resolve) => setTimeout(resolve, 100));
  }
}

これで、100ページあっても10回に分けて処理できます。

進捗管理

UI にリアルタイムで進捗を表示するため、段階的に進捗を更新しています。

進捗	処理内容
10%	開始
10-40%	crawl_data のバッチ取得
50%	データ取得完了
60%	SEO チェック実行完了
80%	結果保存中
100%	完了

こんな感じでチェックステータスとして出すことができます。supabaseのrealtimeを使用して、リアルタイムで終了を検知することも可能にしてみました。

定期チェック（自己回復）

サーバーが再起動した場合、pending のまま残ったジョブがあるかもしれません。
そこで、30秒ごとに定期チェックを入れています。

startPeriodicCheck() {
  setInterval(() => {
    this.processQueue();
  }, 30000); // 30秒ごと
}

// サーバー起動時に定期チェック開始
seoCheckQueue.startPeriodicCheck();

これで、取りこぼしを防げます。

今後の拡張案

現状のロジックはシンプルなので、以下のような拡張を考えています。

拡張項目	内容
文字数チェック	title は 30-60文字、description は 120-160文字が理想
h1 タグの重複チェック	1ページに複数の h1 があると SEO に悪影響
画像の alt 属性チェック	alt 未設定の画像をリストアップ
内部リンク切れチェック	404 を返すリンクを検出
構造化データチェック	JSON-LD の有無をチェック

まとめ

今回は、クロールしたデータを使った SEO スコアリングについて書きました。
実際にできた画面のスクショがこちらになります。

ogpの画像も確認できるようなUIにしてみたりしました。

学んだこと：

テーブル設計は「データの性質」と「アクセスパターン」を考慮する
減点方式は「満点の定義が明確」「改善ポイントがわかりやすい」
Supabase Database Webhook でサービス間連携ができる
ngrok はローカル開発の強い味方
Redis 使わなくてもシンプルなキュー管理はできる

次回は、クロール結果を Canvas でサイト構造を可視化 する部分を書いていきます。

最後まで読んでいただきありがとうございました！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up