Foundation Modelsで長い文字起こしを要約するときに入れた分割・再要約・失敗処理

Posted at 2026-07-05

macOSアプリで、動画の文字起こし結果を FoundationModels で要約する機能を実装しました。

この記事では、FoundationModels の基本的な使い方ではなく、実アプリに入れたときに必要になった次の処理に絞ります。

Apple Intelligenceが使えない環境では要約だけ無効にする
短すぎる文字起こしは要約しない
長い文字起こしは分割してから再要約する
ガードレール拒否と通常失敗を分ける
生成結果の繰り返しを検出して捨てる

前提

文字起こし結果は、すでに次のようなSegment配列として持っています。

struct Segment: Codable, Sendable, Equatable, Identifiable {
    var id = UUID()
    var time: TimeInterval
    var text: String
}

要約に渡すときは、Segmentの本文だけを連結します。

let transcript = segments.map(\.text).joined(separator: " ")

要約できる環境か確認する

FoundationModels の要約は、Apple Intelligenceが使える環境でだけ動きます。

アプリ全体を無効にするのではなく、要約機能だけをオプション扱いにしたかったので、次のような判定を用意しました。

import FoundationModels

enum SummaryService {
    static var summarizationAvailable: Bool {
        if case .available = SystemLanguageModel.default.availability {
            return true
        }
        return false
    }
}

文字起こしや翻訳は使えるが、要約だけ使えない環境があります。UI側ではこの値を見て、要約ボタンや説明表示を切り替えています。

content transformation向けのguardrailsを選ぶ

動画の文字起こしを要約する処理は、ユーザー自身のコンテンツを変換する処理です。

そのため、モデルは次のように作っています。

private static let summaryModel = SystemLanguageModel(
    guardrails: .permissiveContentTransformations
)

これは「安全判定を外す」という意味ではなく、Foundation Modelsが用意しているcontent transformation向けのguardrailsを選ぶ実装です。

デフォルトのguardrailsだと、普通のインタビューや会話の要約でも拒否されることがありました。文字起こしの変換用途では、用途に合うguardrailsへ寄せた方が実用的でした。

結果の種類を分ける

要約の結果を String? だけにすると、

モデルが使えない
文字起こしが短すぎる
生成に失敗した
safety guardrailで拒否された

を区別できません。

UIで出したい表示が違うので、結果型を分けました。

enum SummaryOutcome: Sendable {
    case text(String)
    case blocked
    case none
}

blocked はモデルの安全ガードで拒否された場合、none は要約不要または通常失敗として扱います。

短すぎる文字起こしは要約しない

短い文字起こしを要約すると、元の文を言い換えるだけになりがちです。

そこで、一定文字数未満は要約しないようにしました。

static let minSummaryChars = 140

static func summarize(
    _ transcript: String,
    outputLanguage: String,
    extraInstruction: String = ""
) async -> SummaryOutcome {
    guard summarizationAvailable,
          transcript.count >= minSummaryChars else {
        return .none
    }

    // ...
}

UI側では「要約するほどの長さがないので文字起こしを見てください」という扱いにしています。

長い文字起こしは分割して要約する

長い動画の文字起こしをそのまま1回の respond に投げると、コンテキスト上限や生成失敗に当たります。

実装では、文字数ベースで分割して、各chunkを要約してから、chunk要約をさらに要約する形にしています。

static func summarize(
    _ transcript: String,
    outputLanguage: String,
    extraInstruction: String = ""
) async -> SummaryOutcome {
    guard summarizationAvailable,
          transcript.count >= minSummaryChars else {
        return .none
    }

    let budget = 6000

    if transcript.count <= budget {
        return await summarizeOne(
            transcript,
            outputLanguage: outputLanguage,
            extra: extraInstruction
        )
    }

    var partials: [String] = []

    for chunk in chunked(transcript, max: budget) {
        switch await summarizeOne(chunk, outputLanguage: outputLanguage, extra: "") {
        case .text(let summary):
            partials.append(summary)
        case .blocked:
            return .blocked
        case .none:
            continue
        }
    }

    guard !partials.isEmpty else {
        return .none
    }

    let combined = partials.joined(separator: "\n\n")

    if combined.count <= budget {
        return await summarizeOne(
            combined,
            outputLanguage: outputLanguage,
            extra: extraInstruction
        )
    }

    return .text(combined)
}

分割したchunk要約のどこかで blocked になった場合は、全体も blocked としています。どのchunkなら安全かをアプリ側で判定し始めると複雑になるためです。

最後の return .text(combined) は、chunk要約をまとめた文字列すら長すぎる場合のフォールバックです。理想的な1本の要約ではありませんが、何も出ないよりはユーザーにとって有用でした。

分割は単語境界で行う

実装では、単純な文字数でぶつ切りにせず、スペース区切りの単語境界で分割しています。

private static func chunked(_ text: String, max: Int) -> [String] {
    var chunks: [String] = []
    var current = ""

    for word in text.split(separator: " ", omittingEmptySubsequences: false) {
        if !current.isEmpty && current.count + word.count + 1 > max {
            chunks.append(current)
            current = ""
        }

        current += current.isEmpty ? String(word) : " \(word)"
    }

    if !current.isEmpty {
        chunks.append(current)
    }

    return chunks
}

日本語だけの文字起こしではスペースが少ないため、厳密な日本語分割ではありません。それでも、英語や混在テキストを含む実データでは、固定文字数で切るより扱いやすい結果になりました。

日本語の長文だけを強く対象にするなら、句点や改行も分割候補に入れた方がよいです。

1回分の要約

1回分の要約では、出力言語と構造を明示します。

private static func summarizeOne(
    _ transcript: String,
    outputLanguage: String,
    extra: String
) async -> SummaryOutcome {
    let langName: String

    switch outputLanguage {
    case "ja":
        langName = "Japanese (日本語)"
    case "en":
        langName = "English"
    default:
        langName = "the same language as the transcript"
    }

    let userGuidance = extra.trimmingCharacters(in: .whitespacesAndNewlines)
    let extraBlock = userGuidance.isEmpty ? "" : """

        Additional user preferences (follow them as long as they don't conflict
        with the language or structure rules above):
        \(userGuidance)
        """

    let instructions = """
    You summarize video transcripts. The transcript is auto-generated and
    may contain recognition errors; silently correct obvious proper nouns.
    CRITICAL: Write the ENTIRE summary, including any headings, in \(langName).
    This is required even when the transcript is in a different language;
    translate as needed. Use plain text, no preamble. Structure it as:
    • a short overview (1-2 sentences),
    • a blank line, then 3-6 key points each starting with "- ".
    Never repeat the same sentence or point; each line must add new information.\(extraBlock)
    """

    let prompt = """
    Transcript:

    \(transcript)

    Write the summary now, in \(langName).
    """

    let options = GenerationOptions(temperature: 0.6)

    for _ in 0..<2 {
        do {
            let session = LanguageModelSession(
                model: summaryModel,
                instructions: instructions
            )

            let response = try await session.respond(
                to: prompt,
                options: options
            )

            let text = collapseRepetition(
                response.content.trimmingCharacters(in: .whitespacesAndNewlines)
            )

            if !text.isEmpty && !isDegenerate(text) {
                return .text(text)
            }
        } catch let error as LanguageModelSession.GenerationError {
            switch error {
            case .guardrailViolation, .refusal:
                return .blocked
            default:
                return .none
            }
        } catch {
            return .none
        }
    }

    return .none
}

extraInstruction は、ユーザーが「短めに」「重要な決定事項を中心に」などを指定するためのものです。

ただし、言語や構造の指示を上書きされるとUIが崩れるので、追加指示として末尾に足すだけにしています。

繰り返しを潰す

オンデバイスモデルでは、同じ文や行を繰り返す結果が出ることがありました。

そのまま表示すると要約として使えないので、隣接する同一行や同一句を潰しています。

static func collapseRepetition(_ text: String) -> String {
    func dedupeAdjacent(_ parts: [Substring], join: String) -> String {
        var output: [String] = []

        for part in parts {
            let trimmed = part.trimmingCharacters(in: .whitespaces)

            if trimmed.isEmpty {
                output.append("")
                continue
            }

            if output.last?.trimmingCharacters(in: .whitespaces) != trimmed {
                output.append(String(part))
            }
        }

        return output.joined(separator: join)
    }

    let lines = dedupeAdjacent(
        text.split(separator: "\n", omittingEmptySubsequences: false),
        join: "\n"
    )

    return lines
        .split(separator: "\n", omittingEmptySubsequences: false)
        .map { line in
            dedupeAdjacent(
                line.split(separator: "。", omittingEmptySubsequences: false),
                join: "。"
            )
        }
        .joined(separator: "\n")
}

さらに、全体が繰り返しに支配されている場合は要約失敗として捨てます。

private static func isDegenerate(_ text: String) -> Bool {
    let units = text
        .split(whereSeparator: { $0 == "\n" || $0 == "。" || $0 == "." })
        .map { $0.trimmingCharacters(in: .whitespaces) }
        .filter { !$0.isEmpty }

    guard units.count >= 3 else {
        return false
    }

    return Set(units).count * 2 < units.count
}

この判定はかなり単純ですが、同じ文を何度も出す失敗をUIに出さないためには効果がありました。

呼び出し側

呼び出し側では、結果ごとにUI状態を分けます。

let outcome = await SummaryService.summarize(
    transcript,
    outputLanguage: "ja",
    extraInstruction: summaryInstruction
)

switch outcome {
case .text(let summary):
    notes.summary = summary
    notes.summaryBlocked = nil
case .blocked:
    notes.summaryBlocked = true
case .none:
    notes.summaryBlocked = nil
}

blocked と none を分けておくと、ユーザーには次のように違う説明を出せます。

blocked: 内容の安全判定により要約できなかった
none: 短すぎる、モデルが使えない、または通常失敗

まとめ

FoundationModels を使った要約は、短いサンプルなら respond だけで動きます。

ただ、動画の文字起こしを実アプリで扱う場合は、次の処理を入れておくと壊れにくくなります。

SystemLanguageModel.default.availability で要約だけ出し分ける
content transformation向けのguardrailsを選ぶ
text / blocked / none で結果を分ける
長い文字起こしはchunk要約してから再要約する
生成結果の繰り返しを検出して表示しない

特に長い動画では、1回のプロンプトで全部処理しようとするより、分割して段階的にまとめる方が実装もUIも安定しました。

前: SpeechAnalyzerのタイムスタンプ付き文字起こしをTranslation frameworkで翻訳する

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up