More than 3 years have passed since last update.

【PowerShell】iTextSharp で PDF を結合・ページ抽出する

Last updated at 2021-04-11Posted at 2020-09-27

以前 pdftk による PDF の処理について記事を書きましたが、今回はその中の結合処理とページ抽出処理を iTextSharp で実装してみました。処理エラーを正しくスクリプトのエラーとして受け取ることできるので挙動が予測しやすく安全です。

環境：

PS> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.0.3
PSEdition                      Core
GitCommitId                    7.0.3
OS                             Microsoft Windows 10.0.18362
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

事前準備： iTextSharp.dll の入手

nuget ページの Download package から入手できる .nupkg を解凍して lib 内に入っている dll を使用します。

ここでは $PROFILE と同じディレクトリに用意した lib に itextsharp.dll を配置し、下記のようにして読み込んでおきます。

Add-Type -Path ($PROFILE | Split-Path -Parent | Join-Path -ChildPath "lib\itextsharp.dll")

PDF 結合

ls | Invoke-PdfConc hogehoge のようにパイプしたファイルを一つのファイルに結合します。前もってソートなどした結果をパイプで渡すことで結合順は自由自在です。

function Invoke-PdfConc {
    param(
        [string]$outName = "concatenated_output"
    )

    $fullpath = Join-Path -Path $PWD.Path -ChildPath "$($outName).pdf"
    if (Test-Path $fullpath) {
        "'{0}.pdf' already exists!" -f $outName | Write-Error
        return
    }

    $pdfs = @($input | Where-Object Extension -eq ".pdf")
    if ($pdfs.Count -le 1) {
        return
    }

    $filestream = New-Object System.IO.FileStream($fullpath, [System.IO.FileMode]::Create)
    $document = New-Object iTextSharp.text.Document
    $pdfCopy = New-Object iTextSharp.text.pdf.PdfSmartCopy($document, $fileStream)
    $document.Open()

    "Concatanating as '{0}.pdf':" -f  $outName | Write-Host -ForegroundColor Cyan
    $pdfs | ForEach-Object {
        " + {0}" -f $_.Name | Write-Host -ForegroundColor Cyan
        $reader = New-Object iTextSharp.text.pdf.PdfReader($_.Fullname)
        $pdfCopy.AddDocument($reader)
        $reader.Close()
    }

    $pdfCopy.Close()
    $document.Close()
    $filestream.Close()
}

PDF から範囲抽出

こちらは Invoke-PdfExtract -path hogehoge.pdf -from 3 -to 6 -outName out などと開始・終了ページを指定して PDF を範囲抽出します。

function Invoke-PdfExtract {
    <#
        .EXAMPLE
            pdfExtract -path .\hugahuga.pdf -from 8 -to 20 # => hugahuga_08-20.pdf
    #>
    param (
        [parameter(Mandatory)][string]$path
        ,[int]$from = 1
        ,[int]$to
        ,[string]$outName
    )

    if ($to -and ($from -gt $to)) {
        Write-Error "Invalid range!"
        return
    }
    if (-not (Test-Path $path)) {
        Write-Error "Invalid file path!"
        return
    }
    $pdfItem = Get-Item $path
    if ($pdfItem.Extension -ne ".pdf") {
        Write-Error "Non-pdf file!"
        return
    }

    $pdfReader = New-Object iTextSharp.text.pdf.PdfReader($pdfItem.Fullname)
    if (-not $to) {
        $to = $pdfReader.NumberOfPages
    }
    elseif ($to -gt $pdfReader.NumberOfPages) {
        Write-Error "Range out of max page!"
        $pdfReader.Close()
        return
    }

    $outName = ($outName)?
        $outName + ".pdf" :
        "{0}_{1:d2}-{2:d2}.pdf" -f $pdfItem.Basename, $from, $to

    $outFullpath = Join-Path -Path $PWD.Path -ChildPath $outName
    if (Test-Path $outFullpath) {
        "'{0}' already exists!" -f $outName | Write-Error
        $pdfReader.Close()
        return
    }

    $document = New-Object iTextSharp.text.Document
    $filestream = New-Object System.IO.FileStream($outFullpath, [System.IO.FileMode]::Create)
    $pdfCopy = New-Object iTextSharp.text.pdf.PdfSmartCopy($document, $filestream)
    $document.Open()

    foreach ($page in $from..$to) {
        $pdfcopy.AddPage($pdfcopy.GetImportedPage($pdfReader, $page));
    }

    $document.Close()
    $pdfCopy.Close()
    $filestream.Close()
    $pdfReader.Close()

    "Extracted page {0}~{1} as '{2}'" -f $from, $to, $outName | Write-Host -ForegroundColor Cyan
}

複数ファイルに対して同一範囲を抽出するといった場面は少ないと思い、パイプ入力は想定していません。必要になれば Foreach-Object で次のように対処できます。

# 渡したファイルの各2～10ページを ［ファイル名］_02-10.pdf として抽出
PS > ls -file | % {Invoke-PdfExtract -path $_.fullname -from 2 -to 10}

コード全体

Add-Type -Path ($PROFILE | Split-Path -Parent | Join-Path -ChildPath "lib\itextsharp.dll")

function Invoke-PdfConc {
    param(
        [string]$outName = "concatenated_output"
    )

    $fullpath = Join-Path -Path $PWD.Path -ChildPath "$($outName).pdf"
    if (Test-Path $fullpath) {
        "'{0}.pdf' already exists!" -f $outName | Write-Error
        return
    }

    $pdfs = @($input | Where-Object Extension -eq ".pdf")
    if ($pdfs.Count -le 1) {
        return
    }

    $filestream = New-Object System.IO.FileStream($fullpath, [System.IO.FileMode]::Create)
    $document = New-Object iTextSharp.text.Document
    $pdfCopy = New-Object iTextSharp.text.pdf.PdfSmartCopy($document, $fileStream)
    $document.Open()

    "Concatanating as '{0}.pdf':" -f  $outName | Write-Host -ForegroundColor Cyan
    $pdfs | ForEach-Object {
        " + {0}" -f $_.Name | Write-Host -ForegroundColor Cyan
        $reader = New-Object iTextSharp.text.pdf.PdfReader($_.Fullname)
        $pdfCopy.AddDocument($reader)
        $reader.Close()
    }

    $pdfCopy.Close()
    $document.Close()
    $filestream.Close()
}

function Invoke-PdfExtract {
    <#
        .EXAMPLE
            pdfExtract -path .\hugahuga.pdf -from 8 -to 20 # => hugahuga_08-20.pdf
    #>
    param (
        [parameter(Mandatory)][string]$path
        ,[int]$from = 1
        ,[int]$to
        ,[string]$outName
    )

    if ($to -and ($from -gt $to)) {
        Write-Error "Invalid range!"
        return
    }
    if (-not (Test-Path $path)) {
        Write-Error "Invalid file path!"
        return
    }
    $pdfItem = Get-Item $path
    if ($pdfItem.Extension -ne ".pdf") {
        Write-Error "Non-pdf file!"
        return
    }

    $pdfReader = New-Object iTextSharp.text.pdf.PdfReader($pdfItem.Fullname)
    if (-not $to) {
        $to = $pdfReader.NumberOfPages
    }
    elseif ($to -gt $pdfReader.NumberOfPages) {
        Write-Error "Range out of max page!"
        $pdfReader.Close()
        return
    }

    $outName = ($outName)?
        $outName + ".pdf" :
        "{0}_{1:d2}-{2:d2}.pdf" -f $pdfItem.Basename, $from, $to

    $outFullpath = Join-Path -Path $PWD.Path -ChildPath $outName
    if (Test-Path $outFullpath) {
        "'{0}' already exists!" -f $outName | Write-Error
        $pdfReader.Close()
        return
    }

    $document = New-Object iTextSharp.text.Document
    $filestream = New-Object System.IO.FileStream($outFullpath, [System.IO.FileMode]::Create)
    $pdfCopy = New-Object iTextSharp.text.pdf.PdfSmartCopy($document, $filestream)
    $document.Open()

    foreach ($page in $from..$to) {
        $pdfcopy.AddPage($pdfcopy.GetImportedPage($pdfReader, $page));
    }

    $document.Close()
    $pdfCopy.Close()
    $filestream.Close()
    $pdfReader.Close()

    "Extracted page {0}~{1} as '{2}'" -f $from, $to, $outName | Write-Host -ForegroundColor Cyan
}

iText7 ……？（余談）

前述の nuget ページにも書いてありますが、 iTextSharp は開発が終了していて現在は iText7 が最新バージョンのようです。

iText 7 was built on nearly a decade of lessons learned from iText 5 (iTextSharp) development. It is a simpler, more performant and extensible library that is ready to handle the increased challenges of today's document workflows, one add-on at a time…

要するに iTextSharp の問題点を踏まえて改良を施したモダンなライブラリだよ、ということですが依存関係が複雑なようで現時点では歯が立ちませんでした。精進します（下記は個人的な備忘録）。

BouncyCastle.Crypto.dll を始めとした nuget ページからダウンロードできないライブラリも入手する必要あり（参考サイト）。
同じく iText7 を使用している PSWritePDF のコードを参考にしてみても dll のバージョンが違うようでなぜかエラー……。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up