Powershellオタク向けAdobe Developer ConsoleのPDF Services APIを使ったPDFのテキスト抽出

Last updated at 2024-08-09Posted at 2024-08-08

目的、背景など

PDFから文字列をPowershellを使って抽出したい。
それだけ。

作業環境とか

PS D:\hoge\powershell> $PSVersionTable

Name                           Value                   
----                           -----                   
PSVersion                      5.1.22621.3880          
PSEdition                      Desktop                 
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...} 
BuildVersion                   10.0.22621.3880         
CLRVersion                     4.0.30319.42000         
WSManStackVersion              3.0                     
PSRemotingProtocolVersion      2.3                     
SerializationVersion           1.1.0.1

Adobe Developer Console：https://developer.adobe.com/
Access Tokenの取得：https://developer.adobe.com/document-services/docs/overview/pdf-services-api/gettingstarted/
Adobe APIの取り扱い諸々：https://developer.adobe.com/document-services/docs/apis/

用意するもの

CLIENT ID
CLIENT Secret
解析対象PDF

CLIENT ID、CLIENT SecretはAdobeにユーザー登録しDeveloper ConsoleからPDF Services APIのプロジェクトを作成し、くりでんしゃるなんとかを弄っていれば手に入ります。
詳しいやり方は歴戦の有志の記事が参考になると思います。
https://qiita.com/ryos_adobe/items/7fb1041f908dd6139ca7

また、費用やクレカの登録は不要であり、毎月500ドキュメントまで読み込みが可能だそうです。

殴り書き成果物

連続で打ち込めば多分行ける。と思う。
まずはAPIを使うためのAccess Tokenを取得します。
取得したCLIENT ID、CLIENT Secretを変数に格納します。

hoge.ps1

# Accessトークンの取得
$ClientId = "subarasikihittukararudo"
$ClientSecret = "shougekinoaruberuto"
$Header = @{}
$Header['Content-Type'] = 'application/x-www-form-urlencoded'
$bodys = @{}
$bodys['client_id'] = $ClientId
$bodys['client_secret'] = $ClientSecret

$url = 'https://pdf-services.adobe.io/token'
$response = Invoke-RestMethod -Uri $url -Method POST -Headers $requestHeader -Body $bodys
$AccessToken = $response.access_token

解析したいPDFをアップロードするためのURLを取得します。
ここでレスポンスに含まれるuploadUriとassetIDを変数に格納しておきます。

hoge.ps1

# 解析対象PDFのアップロードURL取得
$Header = @{}
$Header['x-api-key'] = $ClientId
$Header['Content-Type'] = 'application/json'
$Header['Authorization'] = "Bearer " + $AccessToken

$bodys = @{}
$bodys["mediaType"] = "application/pdf"
$bodys = $bodys | ConvertTo-Json -Depth 100
$bodys = [Text.Encoding]::UTF8.GetBytes($bodys) # この1行がないと文字化けする

$url = 'https://pdf-services.adobe.io/assets'
$response = Invoke-WebRequest -Uri $url -Method POST -Headers $Header -Body $bodys -Verbose
$RESCON = $response.Content | ConvertFrom-Json
$UploadUrl = $RESCON.uploadUri
$assetID = $RESCON.assetID

解析したいPDFをバイナリデータにしてアップロードします。
先ほど取得したuploadUriにPUTリクエストをします。
今回解析対象にするPDFはAdobeのProgram Filesに保存されているPDFを使っています。
サイズは182Kですので、後にアップロード確認に使います。

hoge.ps1

# 解析対象PDFをなんやかんやしてバイナリデータにした状態でアップロードURLにPUTリクエスト
$filePath = 'C:\hoge\powershell\Click on Change to select default PDF handler.pdf'
$ContenType = 'application/pdf'
$boundary = [System.Guid]::NewGuid().ToString() 
$file= Get-Item $filePath 
$fileBinary = [IO.File]::ReadAllBytes($filePath)
$enc = [System.Text.Encoding]::GetEncoding("iso-8859-1")
$fileContent = $enc.GetString($fileBinary)

$LF = "`r`n"
$fileNameBytes =  $enc.GetString([System.Text.Encoding]::UTF8.GetBytes($file.Name))
$requestBody = (
    "--$boundary",
    "Content-Disposition: form-data; name=`"resourceName`"; filename=`"$($fileNameBytes)`"",
    "Content-Type: $ContenType$LF",
    $fileContent,
    "--$boundary--$LF"
) -join $LF

$RESPONSE = Invoke-RestMethod -Uri $UploadUrl -Method PUT -body $requestBody -ContentType $ContenType -Verbose

正常にアップロードされたかどうか確認します。

hoge.ps1

# PDFアップロード確認
$Header = @{}
$Header['x-api-key'] = $ClientId
$Header['Content-Type'] = 'application/json'
$Header['Authorization'] = "Bearer " + $AccessToken

$url = 'https://pdf-services.adobe.io/assets/'
$url += $assetID
$url += '/metadata'
$response = Invoke-WebRequest -Uri $url -Method GET -Headers $Header -Verbose
$RESCON = $response.Content | ConvertFrom-Json

レスポンスはこんな感じ。
おおむねサイズは一致していますね。

PS D:\labo\powershell\ps1\AdobeAPI> $RESCON
entity                                      type              size
------                                      ----              ----
D21368CD6HOGEHOGE495E55@techacct.adobe.com application/pdf 187069

アップロードURLと一緒に取得したassetIDを使って、
PDFの解析を開始します。
公式のリファレンスを見ているとJob IDと呼ばれる単語が出てくるんですが、
これはレスポンスのHeaderのx_request_idのことだと思われます。
違っていたらごめんなさい。~~リファレンスが悪いよリファレンスが。~~

hoge.ps1

# PDF解析開始
$Header = @{}
$Header['x-api-key'] = $ClientId
$Header['Content-Type'] = 'application/json'
$Header['Authorization'] = "Bearer " + $AccessToken

$bodys = @{}
$bodys['assetID'] = $assetID
$bodys = $bodys | ConvertTo-Json -Depth 100
$bodys = [Text.Encoding]::UTF8.GetBytes($bodys) # この1行がないと文字化けする

$url = 'https://pdf-services.adobe.io/operation/extractpdf'
$response = Invoke-WebRequest -Uri $url -Method POST -Headers $Header -Body $bodys -Verbose
$RESCON = $response.Content | ConvertFrom-Json
$x_request_id = $response.Headers.'x-request-id' #多分これがjobIDのこと

ちょっと休憩

hoge.ps1

# 解析に少し時間がかかるみたいなのでちょっと間を置く
Start-Sleep -Seconds 5

ステータス確認のAPIで解析結果のダウンロードURLを取得します。
ここでx_request_idを使用します。

hoge.ps1

# PDF解析ジョブを読み込んで解析結果のダウンロードURLを取得
$Header = @{}
$Header['x-api-key'] = $ClientId
$Header['Authorization'] = "Bearer " + $AccessToken

$url = 'https://pdf-services.adobe.io/operation/extractpdf/'
$url += $x_request_id
$url += '/status'
$response = Invoke-WebRequest -Uri $url -Method GET -Headers $Header -Verbose
$RESCON = $response.Content | ConvertFrom-Json
$downloadUrl = $RESCON.content.downloadUri

仕上げです。
ダウンロードURLにリクエストを飛ばすと、
レスポンスとしてjsonファイルの中身が取得できます。
あとは各自のやり方で保存するなりしてください。

hoge.ps1

# PDF解析結果のダウンロード
$Header = @{}
$Header['x-api-key'] = $ClientId
$Header['Content-Type'] = 'application/json'
$Header['Authorization'] = "Bearer " + $AccessToken

$url = $downloadUrl
$response = Invoke-WebRequest -Uri $url
$RESCON = $response.Content | ConvertFrom-Json

$yyyymmdd_hhmmss = Get-Date -Format 'yyyyMMdd_HHmmss'
$exportFilename = $file.Name -replace '.PDF','.json'
$data_path = "D:\hoge\powershell\result"
$data_path = $data_path + "\" + $yyyymmdd_hhmmss
New-Item -Force -Path $data_path -ItemType Directory
$data_path = $data_path + "\" + $exportFilename

$RESCON | ConvertTo-Json | Out-File $data_path -Encoding default

今回解析に使ったPDFと実際に取得したjsonを見比べてみましょう。
マーカーで色付けした部分が抽出できていることがわかります。
BoundsというのがPDF内の座標のことなんだと思います。

要求したリクエストの回数は以下の画面で見られるみたいです。

末筆

今回はオーソドックスな利用を投稿しました。
他にも様々なオプションがあるので、興味のある方はぜひほじくってみてください。
もっとPowershellの記事が増えるといいですね。
~~Pythonなんか無くてもこれくらいPowershellでもできんだよ！~~

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up