More than 1 year has passed since last update.

Nokogiriを使って絵文字unicodeとキーワードのJSONを作る

Posted at 2022-07-04

はじめに

ふと、絵文字のunicodeと絵文字のキーワード（検索とか予測変換に使う単語）が組み合わさったJSONが欲しくなりました。
こんな感じのJSON

[
  {
    "hex": "1F347",
    "annotations": [
      "ぶどう",
      "グレープ",
      "果物"
    ]
  },
  {
    "hex": "1F348",
    "annotations": [
      "メロン",
      "果物",
      "野菜"
    ]
  },
  {
    "hex": "1F349",
    "annotations": [
      "スイカ",
      "果物",
      "野菜"
    ]
  }
]

なのでNokogiriを使ってスクレイピングしてJSONを作りたいと思います。

やること

スクレイピングする対象のサイトはUnicodeコンソーシアムのCJK AnnotationsというHTMLページです。
このページはUnicodeコンソーシアムがこの絵文字を検索するときや予測変換で出すときはこの単語を使いましょう！という感じで定義してくれているものです。
絵文字、Unicode、そして絵文字を表す単語が英語、中国語、日本語、韓国語と表形式になっているのでUnicodeの列と日本語の列を取ってきてJSONにします。

コード

require 'nokogiri'
require 'open-uri'
require 'json'

def main
  doc = Nokogiri::HTML(URI.open('https://unicode-org.github.io/cldr-staging/charts/latest/annotations/cjk.html'))

  data_json = []
  trs = doc.css('.body tr')

  # テーブルの行でループ
  trs.each do |tr|
    hash = Hash.new([])
    td_element_ary = tr.children()

    # 行のカラムでループを回す
    td_element_ary.each do |td|
      # sourceをクラスで持っているカラムがHEXを持っているので値を取り出す。
      if td.classes.include?("source")
        hash[:hex] = td.children.pop.content
      end
    end

    # tdのclass名が同じなので日本語のカラムをインデックスで指定する
    # 中国語の翻訳がない行ではカラムが2少ない
    if td_element_ary.length == 7
      next if td_element_ary[5].children[1].nil?

      # [*, 太字アノテーション, <br>, 細字アノテーション]のようになっている
      # 後ろの細字のアノテーションをpopで取り出す
      str = td_element_ary[5].children.pop.content
      annotation_ary = mold(str)

      hash[:annotations] = [td_element_ary[5].children[1].content] + annotation_ary #太字 + 後ろのやつ
    elsif td_element_ary.length == 9
      next if td_element_ary[7].children[1].nil?

      # 後ろの細字のアノテーション
      str = td_element_ary[7].children.pop.content
      annotation_ary = mold(str)

      hash[:annotations] = [td_element_ary[7].children[1].content] + annotation_ary #太字 + 後ろのやつ
    else
      hash[:annotations] = ['hogehoge']
    end
    data_json.push(hash)
  end

  File.open("emoji-data.json", "w") do |file|
    JSON.dump(data_json, file)
  end
end

def mold(str)
  return [] if str.nil?

  #  "| ミクロ | 単位" こんな感じなので余分なところを取り除く
  str.split('| ').map(&:strip).reject(&:empty?)
end

main

終わりに

このスクリプトでほしい形でJSONを作ることができました。
完成した後に見つけたのですが,すでにいい感じのJSONがGithubに挙がっていました。
よく探したつもりだったのですが。。。
スクレイピングの経験になったのでよし。

いい感じのJSONのGithub

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up