More than 5 years have passed since last update.

READYFOR株式会社

Tweet + elasticsearch + kibana

Posted at 2016-01-10

環境構築

環境構築手順

データを準備

tweet取得

https://github.com/Tatzyr/alltweets
他人のtweetは、3000件まで取れるみたい
話題のbecky_bekikoを取得してみる

$ git clone https://github.com/Tatzyr/alltweets.git
$ cd alltweets
$ ./bin/setup
$ ./exe/alltweets becky_bekiko --retweets --json
$ ls alltweets_becky_bekiko.json
alltweets_becky_bekiko.json

データ整形

整形せずにそのまま突っ込みたかったが、geo_pointが配列を受け入れてくれなかった。ドキュメントには"location": [ -71.34, 41.12 ]はOKそうなのだが。
どうせ整形するなら扱いやすいデータ構造にjqコマンドで整形した。

$ cat alltweets_becky_bekiko.json | jq ' .[] |
{
  id:                     .id_str,
  created_at:             .created_at,
  text:                   .text,
  analyze_text:           .text,
  source:                 .source,
  in_reply: {
    status:               .in_reply_to_status_id_str,
    user_id:              .in_reply_to_user_id_str,
    screen_name:          .in_reply_to_screen_name
  },
  user: {
    id:                   .user.id_str,
    name:                 .user.name,
    screen_name:          .user.screen_name,
    location:             .user.location,
    description:          .user.description,
    url:                  .user.url,
    followers_count:      .user.followers_count,
    friends_count:        .user.friends_count,
    listed_count:         .user.listed_count,
    created_at:           .user.created_at,
    favourites_count:     .user.favourites_count,
    statuses_count:       .user.statuses_count,
    lang:                 .user.lang,
    profile_image_url:    .user.profile_image_url_https,
    profile_banner_url:   .user.profile_banner_url
  },
  geo: {
    lat:                  (.geo.coordinates[0] // 0),
    lon:                  (.geo.coordinates[1] // 0)
  },
  place: {
    id:                   .place.id,
    url:                  .place.url,
    place_type:           .place.place_type,
    name:                 .place.name,
    full_name:            .place.full_name,
    country_code:         .place.country_code,
    country:              .place.country
  },
  retweet_count:          .retweet_count,
  favorite_count:         .favorite_count,
  lang:                   .lang,
  hashtags:               .entities.hashtags,
  user_mentions:          .entities.user_mentions,
  media:                  .entities.media,
  retweeted_status: {
    created_at:           .retweeted_status.created_at,
    id:                   .retweeted_status.id_str,
    text:                 .retweeted_status.text,
    source:               .retweeted_status.source,
    in_reply: {
      status:             .retweeted_status.in_reply_to_status_id_str,
      user_id:            .retweeted_status.in_reply_to_user_id_str,
      screen_name:        .retweeted_status.in_reply_to_screen_name
    },
    user: {
      id:                 .retweeted_status.user.id_str,
      name:               .retweeted_status.user.name,
      screen_name:        .retweeted_status.user.screen_name,
      location:           .retweeted_status.user.location,
      description:        .retweeted_status.user.description,
      url:                .retweeted_status.user.url,
      followers_count:    .retweeted_status.user.followers_count,
      friends_count:      .retweeted_status.user.friends_count,
      listed_count:       .retweeted_status.user.listed_count,
      created_at:         .retweeted_status.user.created_at,
      favourites_count:   .retweeted_status.user.favourites_count,
      statuses_count:     .retweeted_status.user.statuses_count,
      lang:               .retweeted_status.user.lang,
      profile_image_url:  .retweeted_status.user.profile_image_url_https,
      profile_banner_url: .retweeted_status.user.profile_banner_url
    }
  }
}
' > becky_bekiko.json

bulkに整形

bulk用のフォーマットに整形する。
雑な整形プログラム

require 'json'

File.open('./becky_bekiko.json', 'w') do |f|
  open('./bulk_becky_bekiko.json').each do |line|
    json=JSON.load(line)
    index = { "index" => { "_index" => "twitter", "_type" => "tweet", "_id" => json["id"]} }
    f.puts index.to_json
    f.puts line
  end
end

$ ruby to_bulk.rb
$ ls bulk_becky_bekiko.json
bulk_becky_bekiko.json

stream2esを使えば整形せずにいけるとあったが、_idのmappingsで"reason": "Failed to parse mapping [tweet]: _id is not configurable"と怒られたのであきらめた。

IndexとMappingの作成

リファレンス、このサイトを参考にした

$ curl -XPOST 'http://192.168.33.10:9200/twitter' -d '
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "ngram_tokenizer",
          "filter": ["cjk_width", "lowercase"]
        },
        "facet_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "keyword",
          "filter": ["cjk_width", "lowercase"]
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 3,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "tweet": {
      "properties": {
        "created_at": {
          "type": "date",
          "format" : "EE MMM d HH:mm:ss Z yyyy"
        },
        "text": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "analyze_text": {
          "type": "string",
          "index": "not_analyzed"
        },
        "source": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "facet_analyzer"
        },
        "in_reply": {
          "type": "object",
          "properties": {
            "status": {
              "type": "long"
            },
            "user_id": {
              "type": "long"
            },
            "screen_name": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "user": {
          "type": "object",
          "properties": {
            "id": {
              "type": "long"
            },
            "name": {
              "type": "string",
              "index": "not_analyzed"
            },
            "screen_name": {
              "type": "string",
              "index": "not_analyzed"
            },
            "location": {
              "type": "string"
            },
            "description": {
              "type": "string",
              "index": "analyzed",
              "analyzer": "ngram_analyzer"
            },
            "url": {
              "type": "string"
            },
            "followers_count": {
              "type": "long"
            },
            "friends_count": {
              "type": "long"
            },
            "listed_count": {
              "type": "long"
            },
            "created_at": {
              "type": "date",
              "format" : "EE MMM d HH:mm:ss Z yyyy"
            },
            "favourites_count": {
              "type": "long"
            },
            "statuses_count": {
              "type": "long"
            },
            "lang": {
              "type": "string"
            },
            "profile_image_url": {
              "type": "string"
            },
            "profile_banner_url": {
              "type": "string"
            }
          }
        },
        "geo": {
          "type": "geo_point",
          "ignore_malformed": true
        },
        "place": {
          "type": "object",
          "properties": {
            "id": {
              "type": "string"
            },
            "url": {
              "type": "string"
            },
            "place_type": {
              "type": "string"
            },
            "name": {
              "type": "string"
            },
            "full_name": {
              "type": "string",
              "index": "analyzed",
              "analyzer": "ngram_analyzer"
            },
            "country_code": {
              "type": "string"
            },
            "country": {
              "type": "string"
            }
          }
        },
        "retweet_count": {
          "type": "long"
        },
        "favorite_count": {
          "type": "long"
        },
        "lang": {
          "type": "string"
        },
        "hashtags": {
          "properties": {
            "text": {
              "type": "string",
              "index": "analyzed",
              "analyzer": "facet_analyzer"
            }
          }
        },
        "user_mentions": {
          "properties": {
            "screen_name": {
              "type": "string",
              "index": "not_analyzed"
            },
            "name": {
              "type": "string",
              "index": "not_analyzed"
            },
            "id": {
              "type": "long"
            }
          }
        },
        "media": {
          "properties": {
            "id": {
              "type": "long"
            },
            "media_url_https": {
              "type": "string"
            },
            "expanded_url": {
              "type": "string"
            },
            "type": {
              "type": "string"
            }
          }
        },
        "retweeted_status": {
          "type": "object",
          "properties": {
            "created_at": {
              "type": "date",
              "format" : "EE MMM d HH:mm:ss Z yyyy"
            },
            "id": {
              "type": "long"
            },
            "text": {
              "type": "string",
              "index": "analyzed",
              "analyzer": "ngram_analyzer"
            },
            "source": {
              "type": "string",
              "index": "analyzed",
              "analyzer": "facet_analyzer"
            },
            "in_reply": {
              "type": "object",
              "properties": {
                "status": {
                  "type": "long"
                },
                "user_id": {
                  "type": "long"
                },
                "screen_name": {
                  "type": "string",
                  "index": "not_analyzed"
                }
              }
            },
            "user": {
              "type": "object",
              "properties": {
                "id": {
                  "type": "long"
                },
                "name": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "screen_name": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "location": {
                  "type": "string"
                },
                "description": {
                  "type": "string",
                  "index": "analyzed",
                  "analyzer": "ngram_analyzer"
                },
                "url": {
                  "type": "string"
                },
                "followers_count": {
                  "type": "long"
                },
                "friends_count": {
                  "type": "long"
                },
                "listed_count": {
                  "type": "long"
                },
                "created_at": {
                  "type": "date",
                  "format" : "EE MMM d HH:mm:ss Z yyyy"
                },
                "favourites_count": {
                  "type": "long"
                },
                "statuses_count": {
                  "type": "long"
                },
                "lang": {
                  "type": "string"
                },
                "profile_image_url": {
                  "type": "string"
                },
                "profile_banner_url": {
                  "type": "string"
                }
              }
            }
          }
        }
      }
    }
  }
}
'

データの登録

$ curl -s -XPOST 192.168.33.10:9200/_bulk --data-binary "@bulk_becky_bekiko.json"; echo

kibana

設定

「Index name or patter」に「twitter」
「Time-field name」に「created_at」

Discoverページ

「Discover」を選択

Geo

「Visualize」⇒「Tile map」
geo_pointにnull("ignore_malformed": true)は受け入れてもらえなかったので、適当な初期値のgeoです

円グラフ

「Visualize」⇒「Pie chart」

ハッシュタグ数

Field「hashtags.text」

メンションしたユーザの多い数

Field「in_reply.screen_name」

RTしたユーザの多い数

Field「retweeted_status.user.name」

縦棒グラフ

「Visualize」⇒「Vertical bar chart」

Fav数の多いtweet

ゲスの極み乙女。さんの新譜を聴く時のあのドキドキ感はなんていうんだろう。もう、恋のドキドキと同じなんですよね。胸がきゅーんって。そのあとにすきすきすきぃー！って。今年、ライブを拝見するのが目標です。はぁ。。。幸せなためいき。。。
— ベッキー♪♯ (@Becky_bekiko) June 26, 2015

まとめ

データ整形とmappingsを整理するのが大変だった
日本語検索の精度が適当なので、analysisの理解深めたい

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up