ろうとるがPythonを扱う、、（その39：Qiita記事をPDFへ）

Posted at 2026-04-04

PythonによるQiita記事の一括PDF化

自分のQiita記事をローカルPCなどに保存したく、ほぼChatGPTに丸投げして、Pythonコードを作成した結果の備忘録。

Qiita API

Qiita APIを使うらしい。具体的には下記（”WatashiNoId”は適宜置換）。

https://qiita.com/api/v2/users/WatashiNoId/items
（「WatashiNoId」のところはユーザーIDが入る）

この結果として、下記が得られる。記事一つのみ、改行されておらず、非常に見にくいが勘弁。

[{"rendered_body":"\u003ch1 data-sourcepos=\"1:1-1:44\"\u003e\n\u003cspan id=\"sipサーバーをたてるその１\" class=\"fragment\"\u003e\u003c/span\u003e\u003ca href=\"#sip%E3%82%B5%E3%83%BC%E3%83%90%E3%83%BC%E3%82%92%E3%81%9F%E3%81%A6%E3%82%8B%E3%81%9D%E3%81%AE%EF%BC%91\"\u003e\u003ci class=\"fa fa-link\"\u003e\u003c/i\u003e\u003c/a\u003eSIPサーバーをたてる（その１）\u003c/h1\u003e\n\u003cp  ............. }]

なお、Ubuntu上で実行している。

ソースコード

NG1

1回目の丸投げ結果コード。

import requests
import markdown
import pdfkit
import os

USER = "qiita_user_id"  # ←ここを変更
OUT_DIR = "qiita_pdf"
os.makedirs(OUT_DIR, exist_ok=True)

page = 1
per_page = 100

while True:
    url = f"https://qiita.com/api/v2/users/{USER}/items"
    res = requests.get(url, params={"page": page, "per_page": per_page})

    if res.status_code != 200:
        break

    items = res.json()
    if not items:
        break

    for item in items:
        title = item["title"].replace("/", "_")
        body_md = item["body"]

        html = markdown.markdown(body_md, extensions=["fenced_code"])
        html_full = f"""
        <html>
        <head>
        <meta charset="utf-8">
        <style>
        body {{ font-family: sans-serif; }}
        pre {{ background: #f5f5f5; padding: 10px; }}
        </style>
        </head>
        <body>
        <h1>{title}</h1>
        {html}
        </body>
        </html>
        """

        pdf_path = os.path.join(OUT_DIR, f"{title}.pdf")
        pdfkit.from_string(html_full, pdf_path)

        print(f"Generated: {pdf_path}")

    page += 1

このコードでは、HTMLからPDFを作成する”wkhtmltopdf”を使うのであるが、”apt install -y wkhtmltopdf”にてインストールできず、ChatGPTの言われるがままに、

wget https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6-1/wkhtmltox_0.12.6-1.bookworm_amd64.deb

も実施したが、見つからずであった。

NG2

WeasyPrintを使えと言われた。事前準備。

$ apt install -y \
  libcairo2 \
  libpango-1.0-0 \
  libpangocairo-1.0-0 \
  libgdk-pixbuf-2.0-0 \
  libffi-dev \
  fonts-noto-cjk
$ pip install weasyprint

コードは下記。

import requests
import markdown
from weasyprint import HTML
import os
import re

USER = "WatashiNoId"
OUT_DIR = "qiita_pdf"
os.makedirs(OUT_DIR, exist_ok=True)

def safe_filename(name):
    return re.sub(r'[\\/:*?"<>|]', '_', name)

page = 1
per_page = 100

while True:
    url = f"https://qiita.com/api/v2/users/{USER}/items"
    res = requests.get(url, params={"page": page, "per_page": per_page})

    if res.status_code != 200:
        break

    items = res.json()
    if not items:
        break

    for item in items:
        title = item["title"]
        filename = safe_filename(title)
        body_md = item["body"]

        html = markdown.markdown(
            body_md,
            extensions=["fenced_code", "tables"]
        )

        html_full = f"""
        <html>
        <head>
        <meta charset="utf-8">
        <style>
        body {{
            font-family: "Noto Sans CJK JP", sans-serif;
            line-height: 1.6;
        }}
        h1 {{
            border-bottom: 2px solid #ddd;
            padding-bottom: 4px;
        }}
        pre {{
            background: #f5f5f5;
            padding: 10px;
            overflow-x: auto;
        }}
        code {{
            font-family: monospace;
        }}
        </style>
        </head>
        <body>
        <h1>{title}</h1>
        {html}
        </body>
        </html>
        """

        pdf_path = os.path.join(OUT_DIR, f"{filename}.pdf")

        HTML(
            string=html_full,
            base_url="."
        ).write_pdf(pdf_path)

        print(f"Generated: {pdf_path}")

    page += 1

PDF化はできたのであるが、1行が長いもの、イメージの横幅が大きいものなどで、右端が切れてしまっていた。

最終形に至るまで

ここから何度もChatGPTとの格闘、都度出てきたコードが適切でなかった。修正を繰り返した主な点は下記。

Qiitaに合わせたCSS
Pythonコード記載を適切に見える形へ
写真のイメージ化（対応）
実行時Warning対応

OK（今のところの最終形ソースコード）

あるフォルダー内に、記事タイトル名のPDFファイルを作成する。なお、ChatGPTに解説してもらった内容をコード内に記載。

# -*- coding: utf-8 -*-
import os
import re
import base64
import requests
import urllib.parse
from weasyprint import HTML

# ===== 設定 =====
QIITA_USER = "WatashiNoId" # QiitaのユーザID
OUT_DIR = "qiita_pdf"      # 出力PDF用フォルダー
PER_PAGE = 100
API_URL = f"https://qiita.com/api/v2/users/{QIITA_USER}/items"

os.makedirs(OUT_DIR, exist_ok=True)

# ===== CSS（簡易Qiita風）=====
QIITA_CSS = """
@page { size: A4; margin: 20mm 15mm; }
body {
  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI",
               "Noto Sans CJK JP", Meiryo, sans-serif;
  font-size: 11pt;
  line-height: 1.6;
  color: #222;
}
h1 {
  font-size: 20pt;
  border-bottom: 2px solid #e1e4e8;
  padding-bottom: 6px;
}
img {
  max-width: 100%;
  height: auto;
  display: block;
  margin: 10px 0;
}
pre {
  background: #f6f8fa;
  padding: 10px;
  white-space: pre-wrap;
  word-break: break-word;
}
code {
  font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
  background: #f6f8fa;
  padding: 0.1em 0.3em;
  border-radius: 3px;
}
table {
  width: 100%;
  border-collapse: collapse;
  table-layout: fixed;
}
th, td {
  border: 1px solid #d0d7de;
  padding: 6px;
  word-break: break-word;
}
"""

# ===== Utility =====
def safe_filename(name):  # ファイル名として使用不可文字の置き換え
    name = re.sub(r'[\\/:*?"<>|]', "_", name)
    return name[:200]     # 念のため、ファイル名長さを200文字に制限

# ===== Qiita imgix → 元S3 URLに戻す =====
# Qiita特有の「imgixでラップされた画像URL」を、実体の画像URLに戻すための処理
def normalize_qiita_image_url(url):
    url = url.replace("&amp;", "&")

    if "qiita-user-contents.imgix.net/" in url:
        encoded = url.split("qiita-user-contents.imgix.net/", 1)[1]
        decoded = urllib.parse.unquote(encoded)  # URL復元
        decoded = decoded.split("?", 1)[0]       # 不要な?以降を削除
        return decoded

    return url

# ===== 画像をdata URIに変換 =====
def download_image_to_data_uri(url, timeout=10):
    try:
        url = normalize_qiita_image_url(url)

        headers = {
            "User-Agent": "Mozilla/5.0"
        }
        r = requests.get(url, headers=headers, timeout=timeout)  # 画像のダウンロード
        r.raise_for_status()

        content_type = r.headers.get("Content-Type", "image/png")
        b64 = base64.b64encode(r.content).decode("ascii")  # WeasyPrint向けにBase64に変換
        return f"data:{content_type};base64,{b64}"

    except Exception as e:
        print(f"  [warn] image failed: {url} -> {e}")
        return None

# ===== 全記事取得 =====
items = []
page = 1

while True:
    params = {"page": page, "per_page": PER_PAGE}
    print(f"Fetching page {page} ...", end=" ")
    r = requests.get(API_URL, params=params)
    r.raise_for_status()
    batch = r.json()
    print(len(batch))

    if not batch:
        break

    items.extend(batch)
    page += 1

print(f"Total items: {len(items)}")

# ===== メイン処理 =====
for item in items:  # Qiita APIで返される辞書要素を取り出し
    title = item.get("title", "untitled")
    url = item.get("url", "")  # 記事URL取得
    created = item.get("created_at", "")[:10]

    body_html = item.get("rendered_body", "")  # Qiitaが生成したHTML本文に関連するらしい

    # --- <img>タグのsrcをdata URIに置換 ---
    # HTML内の<img>タグを1つずつ処理して、画像URLをdata URIに差し替える
    def replace_img(match):
        tag = match.group(0)
        m = re.search(r'src\s*=\s*["\']([^"\']+)["\']', tag, re.IGNORECASE)
        if not m:
            return tag

        src = m.group(1)
        if src.startswith("data:"):
            return tag

        data_uri = download_image_to_data_uri(src)
        if not data_uri:
            return tag

        return re.sub(
            r'src\s*=\s*["\'][^"\']+["\']',
            f'src="{data_uri}"',
            tag,
            flags=re.IGNORECASE
        )

    body_html = re.sub(
        r'<img\b[^>]*>',
        replace_img,
        body_html,
        flags=re.IGNORECASE | re.DOTALL
    )  # HTML本文中のすべての<img>タグに対して replace_img()を適用

    # ===== HTML組み立て =====
    html = f"""
<html>
<head>
  <meta charset="utf-8">
  <style>{QIITA_CSS}</style>
</head>
<body>
  <h1>{title}</h1>
  <p><a href="{url}">{url}</a><br>{created}</p>
  {body_html}
</body>
</html>
"""

    pdf_path = os.path.join(OUT_DIR, safe_filename(title) + ".pdf")

    try:
        HTML(string=html, base_url=".").write_pdf(pdf_path)  # HTML文字列をWeasyPrintに渡して、PDFファイルとして保存
        print(f"Generated: {pdf_path}")
    except Exception as e:
        print(f"[error] {title}: {e}")

EOF

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up