JavaでPDFファイルの文字列を読み取る

Posted at 2020-01-31

概要

PDFファイルから文字列を抽出する。

※PDFファイル内の画像データは本件では扱わない

実行環境

OS: Windows 7
言語: Java

Javaの準備

mavenプロジェクトを作成し、pom.xmlに以下を追加

<dependency>
	<groupId>org.apache.pdfbox</groupId>
	<artifactId>pdfbox</artifactId>
	<version>2.0.8</version>
</dependency>

実装

try {
    File file = new File("test.pdf");
    PDDocument document = PDDocument.load(file);

    //　文字列のみ抽出
    PDFTextStripper pdfStripper = new PDFTextStripper();

    // 見た目通り(左上から右下にかけて）の順番で文字列を読み取る設定
    pdfStripper.setSortByPosition(true);
    // pdfからテキスト抽出
    text = pdfStripper.getText(document);

    document.close();

} catch (Exception e) {
    e.printStackTrace();
}

出力した文字列は、ページ番号や空白（半角スペース、全角スペース、タブ）等が含まれるため、一度クレンジング処理をかけると解析が楽になる

※縦書きのPDFはこの方法では上手く行かない

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up