More than 3 years have passed since last update.

【powershell】MS Wordを開かずに内容を取得する

Last updated at 2020-11-17Posted at 2019-09-23

背景

日々の業務で大量の Microsoft Word ファイルを横断検索する必要がありましたが、COM オブジェクトを操作しての文書操作には次のような問題点があります。

遅い（いったん Word 自体を起動して内容を読み込む必要がある）
実行後にプロセスを解放する処理が面倒

そこで、.docx は実質的に圧縮ファイルで、内部の document.xml に内容が書かれているという記事（Qiita: Wordさんは今日もおつかれです）を拝見し、色々と試行錯誤してみました。

［注意］下記内容は .docx 形式のファイルのみを対象としています。

ポイント

.docx は実質的に .zip
文章の本体は zip 内部の document.xml に含まれている
r が VBA で言うところの range 、 p が paragraph に相当する模様（全体の構造は未解明……）

コード

Add-Type -AssemblyName System.IO.Compression.Filesystem
function Get-TextOfDocxDocumant {
    <#
        .EXAMPLE
        Get-TextOfDocxDocumant .\test.docx
    #>
    param (
        [string]$path
    )

    try {
        $fullPath = (Resolve-Path -Path $path).Path
        $compressed = [IO.Compression.Zipfile]::OpenRead($fullPath)

        $target = $compressed.Entries | Where-Object {$_.Fullname -eq "word/document.xml"}
        $stream = $target.Open()
        $reader = New-Object IO.StreamReader($stream)
        $content = $reader.ReadToEnd()

        $reader.Close()
        $stream.Close()
        $compressed.Dispose()

        $m = [regex]::Matches($content, "<w:p.*?>.*?</w:p>")
        return [PSCustomObject]@{
            Status = "OK"
            Lines = @($m.value -replace "<.+?>", "")
        }
    }
    catch {
        return [PSCustomObject]@{
            Status = "FILEOPENED"
            Lines = @()
        }
    }
}

使い方

PS > (Get-TextOfDocxDocumant -path .\hogehoge.docx).Lines
ほげほげ
ふがふが
……

今回の学び

xml としてパースするよりも正規表現で抽出したほうが楽
以下、xml構造と格闘した際の覚書き
- 本文中に空白文字が含まれていると、そこで xml 構造が区切られて space 要素が Preserve になる。space の有無での場合分けが必要
- Preserve の場合は get_innerText() で内容を取得できる
- 本文と表は別に構成されている

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up