Qiita Teams that are logged in
You are not logged in to any team

Log in to Qiita Team
Community
OrganizationAdvent CalendarQiitadon (β)
Service
Qiita JobsQiita ZineQiita Blog
2
Help us understand the problem. What is going on with this article?
@kskinaba

RubyでPDFをCSVライクにパースする

More than 1 year has passed since last update.

Rubyを使ってPDFをCSVっぽくパースする方法を整理します。

pdf-reader

利用するgemは pdf-reader です。

以下のコマンドでインストールします。

gem install pdf-reader

pdf-reader の使い方は、Github上の README.md を参照してください。

実装例

実装例のソースコードは次のとおりです。(Githubはこちら

parser.rb
require 'pdf-reader'

File.open('example.pdf', 'rb') do |io|
  reader = PDF::Reader.new(io)
  pages = []

  # Parsing PDF
  reader.pages.each do |page|
    rows = []
    # Separating a whole text
    t = page.text.split("\n")

    t.each do |s|
      # Formatting
      ary = s.split("\s\s")
      ary.delete_if { |str| str.nil? || str.empty? }
      ary.each(&:strip!)
      next if ary.empty?

      rows << ary
    end
    pages << rows
  end

  # Showing parsed data
  pages.each do |page|
    page.each { |rows| p rows }
  end
end

ポイント

  • pdf-readerpage.text メソッドにより、「PDF→テキスト(塊)」に変換できます。
  • 「改行」で split して「テキスト(塊)→1行ごと」に切り分けます。
  • さらに「空白2文字」で split して「1行ごと→項目ごと」に切り分けます。

パース結果

パースした結果をコンソールに出力してみます。

ruby parser.rb
["Invoice"]
["DATE: [Enter date]"]
["INVOICE Invoice Number"]
["FROM:", "[Company Name]", "TO:", "[Client Name]"]
["[Email]", "[Client Email Address"]
["[Address 1]il Address", "[Address 1]"]
["[Address 2]", "[Address 2]ress 1"]
["[Phone]s 2", "[Phone] Address 2"]
["TERMS: T[Payment Terms]"]
["DUE:", "Due Datet Due Date]"]
["Item Description", "Quantity", "Price", "Amount"]
["Item1", "1.00"]
["$ 10.00", "$ 10.00"]
["Item2", "2.00"]
["$ 10.00", "$ 20.00"]
["Item3"]
["1.00", "$ 20.00", "$ 20.00"]
["Item4", "5.00"]
["$ 5.00", "$ 25.00"]
["Item5", "1.00"]
["$ 8.00", "$ 8.00"]
["$ 0.00"]
["$ 0.00"]
["$ 0.00"]
["$ 0.00"]
["$ 0.00"]
["Subtotal", "$ 83.00"]
["Tax"]
["BALANCE DUE", "$ 83.00"]
["Notes"]
["EClick here to add notes or terms of service.here"]

所どころ、行落ちしてしまっている部分がありますが、概ねCSVライクにパースできています。

2
Help us understand the problem. What is going on with this article?
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
kskinaba
Shopify Public App Developer. Feel free to contact me at my GitHub email address.

Comments

No comments
Sign up for free and join this conversation.
Sign Up
If you already have a Qiita account Login
2
Help us understand the problem. What is going on with this article?