More than 1 year has passed since last update.

VBAでitextやAcrobatを使わずにPDFからテキストを取得する

Last updated at 2023-08-30Posted at 2023-04-14

VBAが使えるならWORDが使えるのでは

エクセル上で裏でWORDを立ち上げ　PDF変換しエクセルに持ってくるイメージ

iTextSharp.dll　の導入は結果、管理者権限必要だし

Acrobatはインストール必要だし経費かかるし
ワードでPDFをテキスト化してエクセルに持ってきています。
ワードの参照設定が必要です。
Microsoft Word {ver} Object Library

'******** 要参照設定 Microsoft Word {ver} Object Library *******
Sub get_text_from_pdf()
    '**** 宣言＆清掃 ****
    Dim fd As Office.FileDialog
    Dim wordApp As Word.Application
    Dim outputFolder As String
    Dim extractedTextFileName As String
    Cells.ClearContents


    '**** ファイル選択ダイアログ ****
    Set fd = Application.FileDialog(msoFileDialogFilePicker)
    fd.AllowMultiSelect = False
    fd.Filters.Clear
    fd.Filters.Add "PDFファイル", "*.pdf"
    If fd.Show = -1 Then
        pdfFileName = fd.SelectedItems(1)
        outputFolder = Left(pdfFileName, InStrRev(pdfFileName, "\"))
        extractedTextFileName = "ExtractedText" & Format(Now, "yyyymmddhhnnss") & ".txt"
    Else
        Exit Sub
    End If
    
    
    '**** ワード非表示起動 *****
    Set wordApp = CreateObject("Word.Application")
    wordApp.Visible = False

    
    '**** テキストを抽出しファイルに保存する ****
    With wordApp
        .Documents.Add
        .Documents.Open Filename:=pdfFileName, ReadOnly:=True
        .ActiveDocument.SaveAs2 Filename:=outputFolder & extractedTextFileName, FileFormat:=wdFormatText
    End With
    wordApp.Quit

    
    '**** テキストを開き1行ずつ読み込み、セルに書き込む ****
    RowNum = 3
    Open outputFolder & extractedTextFileName For Input As #1
        Do While Not EOF(1)
            Line Input #1, TextLine
            Cells(RowNum, 1).Value = TextLine
            RowNum = RowNum + 1
        Loop
    Close #1

End Sub

念のため

テキストファイルが産まれてしまいます。

不要であればDeleteロジック入れてみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up