Help us understand the problem. What is going on with this article?

[Ruby]確定申告の時期だからAmazonから注文履歴をスクレイピングしてみるメモ

More than 3 years have passed since last update.

はじめに

確定申告用にAmazonでの買い物履歴が欲しかったのだが、Amazon.jpではCSVとかがなさそうなので、自動で取ってくるスクリプトを書いてみました。
一度に5件(?)以上買い物していると「◯個すべての商品を表示」みたいなリンクが出るのですが、そこは対応していません。

Rubyは初心者なので、注文履歴をゲットしてくる処理と、HTMLを読み込んでCSVに変換する処理をわけて書いています。

準備

mechanizeとnokogiriを使用するのでインストールしておきます。

$ gem install mechanize
$ gem install nokogiri

HTMLの取得

下記のソースでHTMLを取ってきます。

get.rb
require 'open-uri'
require 'mechanize'
require 'kconv'

agent = Mechanize.new
agent.user_agent = 'Mac Safari'

url = 'https://www.amazon.co.jp/ap/signin?_encoding=UTF8&openid.assoc_handle=jpflex&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.co.jp%2F%3Ftag%3Dhydraamazonav-22%26hvadid%3D39595899217%26hvpos%3D1t1%26hvexid%3D%26hvnetw%3Dg%26hvrand%3D8895839300291415748%26hvpone%3D%26hvptwo%3D%26hvqmt%3De%26hvdev%3Dc%26ref%3Dnav_custrec_signin'

page = agent.get(url)

login_form = page.forms_with(:name => 'signIn').first
login_form.fields_with(:name => 'email').first.value = "Amazon登録メールアドレス"
login_form.fields_with(:name => 'password').first.value = "Amazonパスワード"
page2 = login_form.click_button

this_year =2015 #任意の集計開始年
charset = nil
index = 1

url = "https://www.amazon.co.jp/gp/your-account/order-history?opt=ab&digitalOrders=1&unifiedOrders=1&returnTo=&__mk_ja_JP=カタカナ&orderFilter=year-#{this_year}"
puts 'get', url
page = agent.get(url)
puts page.title
has_emphasis = agent.page.search('//a[@class="a-size-medium a-link-emphasis"]')
if !has_emphasis.empty? then
  puts "page1 に未取得のデータがあります。"
end

s = agent.page.body.to_s
File.open("#{this_year}-#{index}.html", "w"){ |f|
  f.puts s
}

while true do
  href = agent.page.search('//ul[@class="a-pagination"]/li[@class="a-last"]/a/@href').text
  puts 'get', 'https://www.amazon.co.jp' + href
  button = agent.page.link_with(:href => href)
  if button == nil then
    break
  end
  page = button.click

  has_emphasis = agent.page.search('//a[@class="a-size-medium a-link-emphasis"]')
  if !has_emphasis.empty? then
    puts "page#{index+1} に未取得のデータがあります。"
  end

  index += 1
  s = agent.page.body.to_s
  File.open("#{this_year}-#{index}.html", "w"){ |f|
    f.puts s
  }
end

$ ruby get.rbで実行すると、2015-01.htmlなどのファイルが作成されます。

CSVへ変換

csv.rb
require 'open-uri'
require 'nokogiri'
require 'csv'
require 'kconv'
require 'date'

this_year =2015 # 集計する年
format = '%Y年%m月%d日'

current_dir = Dir.pwd
dataList = []
num = 0

while true
  num += 1
  filename = "#{this_year}-#{num}.html"
  if ! File.exist?(filename) then
    break
  end

  puts "Loading... #{filename}"
  html = Nokogiri::HTML(File.open(filename))

  orders = html.xpath('//div[@class="a-box-group a-spacing-base order"]')

  orders.each_with_index.map do | row, index |
    if row.respond_to?('xpath') then

      order_number = row.xpath('div[1]/div/div/div/div[2]/div[1]/span[2]').text.strip
      puts "order: #{order_number}"

      date_label = row.xpath('div[1]/div/div/div/div[1]/div/div[1]/div[1]/span[@class="a-color-secondary label"]').text.strip
      date_value = row.xpath('div[1]/div/div/div/div[1]/div/div[1]/div[2]/span[@class="a-color-secondary value"]').text.strip
      date = DateTime.strptime(date_value, format)

      sum_label = row.xpath('div[1]/div/div/div/div[1]/div/div[2]/div[1]/span[@class="a-color-secondary label"]').text.strip
      sum_value = row.xpath('div[1]/div/div/div/div[1]/div/div[2]/div[2]/span[@class="a-color-secondary value"]').text.strip.gsub(/¥ */, "").gsub(/,/, "").to_i

      to_label = row.xpath('div[1]/div/div/div/div[1]/div/div[3]/div[1]/span[@class="a-color-secondary label"]').text.strip
      to_value = row.xpath('div[1]/div/div/div/div[1]/div/div[3]/div[2]/span[@class="a-color-secondary value"]').text.strip

      lefts = row.xpath('.//div[@class="a-fixed-left-grid-inner"]')
      lefts.each do | node |
        title = node.xpath('div[2]/div/a[@class="a-link-normal"]').text.strip
        if title == nil then
          next
        end
        if title.length == 0 then
          title = node.xpath('div[2]/div[1]').text.strip # Androd app
        end

        img = node.xpath('div[1]/div[1]/a/img/@src').text.strip
        author = node.xpath('div[2]/div/span[@class="a-size-small"]').text.strip
        seller = node.xpath('div[2]/div/span[@class="a-size-small a-color-secondary"]').text.strip.gsub(/([\s| | ]+)/," ")
        price = node.xpath('div[2]/div/span[@class="a-size-small a-color-price"]').text.strip.gsub(/¥ */, "").gsub(/,/, "").to_i
        if price == 0 then
          price = sum_value
        end
        ver = node.xpath('div[2]/div/span[@class="a-size-small a-color-secondary a-text-bold"]').text.strip  # Kindle?

        hash = Hash.new
        hash['order'] = order_number
        hash['date'] = date
        hash['title'] = title
        hash['author'] = author
        hash['price'] = price
        hash['ver'] = ver
        hash['img'] = img
        hash['seller'] = seller        

        if hash.size > 0 then
          dataList << hash
        end
      end

    end

  end

end

dataList = dataList.sort_by { |hash| hash['date'] }

CSV.open('amazon.csv','wb') do |csv|
  csv << ['date','order', 'title','author','price', 'kindle']

  dataList.each do |hash|
    date = hash['date'].strftime('%Y年%m月%d日')
    order = hash['order']
    title = hash['title']
    author = hash['author']
    kindle = hash['ver']
    price = hash['price']

    csv << [date, order, title, author, price, kindle]
  end

end

$ ruby csv.rbで上記のスクリプトを実行すると、amazon.csvというファイルが作成されます。

注意

個人的にはそこまで正確な金額が必要なかったので、きちんと取得していない部分もありますのでご注意ください。

参考

[Mechanize]リンクをクリックする方法いろいろ
RubyのMechanizeとNokogiriで読書メーターをスクレイピング

個人的覚書

スクレイピング対象のHTMLページの構造を把握するには、Chromeなどで構造をみるのが良い。

また、XPATHを知りたい場合には、右クリックして「Copy」->「Copy XPATH」を駆使すると比較的簡単。

JunSuzukiJapan
基本、未来の自分あてに備忘録として書いてます。
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした