More than 1 year has passed since last update.

C#VB.NETでPDF からテキストの抽出

Last updated at 2024-02-01Posted at 2024-01-26

はじめに

現代のソフトウェア開発において、PDF文書の処理とテキストの抽出がますます一般的な要件となっています。PDF文書からデータを取得したり、内容を分析したり、自動化処理を行ったりする際には、その中からテキスト情報を抽出することが重要なタスクとなります。本記事では、C#/VB.NETプログラミング言語を使用してPDF文書からテキストを抽出する方法について紹介します。

ツール

Visual Studio 2022
Free Spire.PDF for .NET

このライブラリは、無料でPDFァイルを作成、編集、変換することができますが、ページ数の制限があります。または、有料版の製品の無料トライアルを申し込むこともできます。

有料版 : Spire.PDF for .NET

インストール方法

1.Free Spire.PDF for .NETをダウンロードします。
2.Visual Studioで新しいプロジェクトを作成します。
3.「Solution Explorer」>「References」を右クリックし、「Add Reference」>「Browse」を選択します。
4.BINフォルダ内のdllファイルを見つけて、「OK」をクリックします。

ページからすべてのテキストの抽出

説明

まず、PDFファイルを読み込み、インデックスで抽出するページを取得します。次に、「PdfTextExtractor」と「PdfTextExtractOptions」のオブジェクトを作成します。ページ全体のテキストを抽出する場合は、「PdfTextExtractOptions」の「IsExtractAllText」プロパティを「True」に設定してください。最後に、「PdfTextExtractor」の「ExtractText()」メソッドを使用してテキストを抽出します。

サンプルコード

using System.IO;
using Spire.Pdf;
using Spire.Pdf.Texts;

namespace ExtractTextFromPage
{
    class Program
    {
        static void Main(string[] args)
        {
            //PdfDocument オブジェクトを作成する
            PdfDocument doc = new PdfDocument();

            //PDFファイルを読み込む
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Sample.pdf");

            //最初のページを取得する
            PdfPageBase page = doc.Pages[0];
      
            //PdfTextExtractot オブジェクトを作成する
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            //PdfTextExtractOptions オブジェクトを作成する
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

            //isExtractAllTextをtrue に設定する
            extractOptions.IsExtractAllText = true;

            //ページからテキストを抽出する
            string text = textExtractor.ExtractText(extractOptions);
            File.WriteAllText("C:\Users\Administrator\Desktop\Result-1.txt", text);
        }
    }
}

VB.NET

Imports System
Imports System.IO
Imports Spire.Pdf
Imports Spire.Pdf.Texts
 
Namespace ExtractTextFromPage
    Class Program
        Shared  Sub Main(ByVal args() As String)
            'PdfDocument オブジェクトを作成する
            Dim doc As PdfDocument =  New PdfDocument() 
 
            'PDFファイルを読み込む
            doc.LoadFromFile("C:\Users\Administrator\Desktop\Sample.pdf")
 
            '最初のページを取得する
            Dim page As PdfPageBase =  doc.Pages(0) 
 
            'PdfTextExtractot オブジェクトを作成する
            Dim textExtractor As PdfTextExtractor =  New PdfTextExtractor(page) 
 
            'PdfTextExtractOptions オブジェクトを作成する
            Dim extractOptions As PdfTextExtractOptions =  New PdfTextExtractOptions() 
 
            'isExtractAllTextをtrue に設定する
            extractOptions.IsExtractAllText = True
 
            'ページからテキストを抽出する
            Dim text As String =  textExtractor.ExtractText(extractOptions) 
            File.WriteAllText("C:\Users\Administrator\Desktop\Result-1.txt", text)
        End Sub
    End Class
End Namespace

ページからテキストの一部の抽出

説明

上述方法と同様の手順です。ただし、全文を抽出する代わりに、特定の範囲内のテキストを抽出する必要があるため、「PdfTextExtractOptions」の「ExtractArea」プロパティを使用して範囲を指定する必要があります。

サンプルコード

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;
using System.Drawing;

namespace ExtractTextFromRectangleArea
{
    class Program
    {
        static void Main(string[] args)
        {
            //PdfDocument オブジェクトを作成する
            PdfDocument doc = new PdfDocument();

            //PDFファイルを読み込む
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Sample.pdf");

            //2 ページ目を取得する
            PdfPageBase page = doc.Pages[1];

            //PdfTextExtractot オブジェクトを作成する
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            //PdfTextExtractOptions オブジェクトを作成する
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

            //長方形の領域を設定する
            extractOptions.ExtractArea = new RectangleF(0, 0, 900, 150);

            //この長方形からテキストを抽出する
            string text = textExtractor.ExtractText(extractOptions);
            File.WriteAllText("C:\\Users\\Administrator\\Desktop\\Result-2.txt", text);
        }
    }
}

VB.NET

Imports Spire.Pdf
Imports Spire.Pdf.Texts
Imports System.IO
Imports System.Drawing
 
Namespace ExtractTextFromRectangleArea
    Class Program
        Shared  Sub Main(ByVal args() As String)
            'PdfDocument オブジェクトを作成する
            Dim doc As PdfDocument =  New PdfDocument() 
 
            'PDFファイルを読み込む
            doc.LoadFromFile("C:\Users\Administrator\Desktop\Terms of Service.pdf")
 
            '2 ページ目を取得する
            Dim page As PdfPageBase =  doc.Pages(1) 
 
            'PdfTextExtractot オブジェクトを作成する
            Dim textExtractor As PdfTextExtractor =  New PdfTextExtractor(page) 
 
            'PdfTextExtractOptions オブジェクトを作成する
            Dim extractOptions As PdfTextExtractOptions =  New PdfTextExtractOptions() 
 
            '長方形の領域を設定する
            extractOptions.ExtractArea = New RectangleF(0, 0, 890, 170)
 
            'この長方形からテキストを抽出する 
            Dim text As String =  textExtractor.ExtractText(extractOptions) 
            File.WriteAllText("Extracted.txt", text)
        End Sub
    End Class
End Namespace

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up