2
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

RubyでPDFをCSVライクにパースする

Posted at

Rubyを使ってPDFをCSVっぽくパースする方法を整理します。

pdf-reader

利用するgemは pdf-reader です。

以下のコマンドでインストールします。

gem install pdf-reader

pdf-reader の使い方は、Github上の README.md を参照してください。

実装例

実装例のソースコードは次のとおりです。(Githubはこちら

parser.rb
require 'pdf-reader'

File.open('example.pdf', 'rb') do |io|
  reader = PDF::Reader.new(io)
  pages = []

  # Parsing PDF
  reader.pages.each do |page|
    rows = []
    # Separating a whole text
    t = page.text.split("\n")

    t.each do |s|
      # Formatting
      ary = s.split("\s\s")
      ary.delete_if { |str| str.nil? || str.empty? }
      ary.each(&:strip!)
      next if ary.empty?

      rows << ary
    end
    pages << rows
  end

  # Showing parsed data
  pages.each do |page|
    page.each { |rows| p rows }
  end
end

ポイント

  • pdf-readerpage.text メソッドにより、「PDF→テキスト(塊)」に変換できます。
  • 「改行」で split して「テキスト(塊)→1行ごと」に切り分けます。
  • さらに「空白2文字」で split して「1行ごと→項目ごと」に切り分けます。

パース結果

パースした結果をコンソールに出力してみます。

ruby parser.rb
["Invoice"]
["DATE: [Enter date]"]
["INVOICE Invoice Number"]
["FROM:", "[Company Name]", "TO:", "[Client Name]"]
["[Email]", "[Client Email Address"]
["[Address 1]il Address", "[Address 1]"]
["[Address 2]", "[Address 2]ress 1"]
["[Phone]s 2", "[Phone] Address 2"]
["TERMS: T[Payment Terms]"]
["DUE:", "Due Datet Due Date]"]
["Item Description", "Quantity", "Price", "Amount"]
["Item1", "1.00"]
["$ 10.00", "$ 10.00"]
["Item2", "2.00"]
["$ 10.00", "$ 20.00"]
["Item3"]
["1.00", "$ 20.00", "$ 20.00"]
["Item4", "5.00"]
["$ 5.00", "$ 25.00"]
["Item5", "1.00"]
["$ 8.00", "$ 8.00"]
["$ 0.00"]
["$ 0.00"]
["$ 0.00"]
["$ 0.00"]
["$ 0.00"]
["Subtotal", "$ 83.00"]
["Tax"]
["BALANCE DUE", "$ 83.00"]
["Notes"]
["EClick here to add notes or terms of service.here"]

所どころ、行落ちしてしまっている部分がありますが、概ねCSVライクにパースできています。

2
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?