Posted at

Tweet + elasticsearch + kibana

More than 3 years have passed since last update.


環境構築


データを準備


tweet取得

$ git clone https://github.com/Tatzyr/alltweets.git

$ cd alltweets
$ ./bin/setup
$ ./exe/alltweets becky_bekiko --retweets --json
$ ls alltweets_becky_bekiko.json
alltweets_becky_bekiko.json


データ整形

整形せずにそのまま突っ込みたかったが、geo_pointが配列を受け入れてくれなかった。ドキュメントには"location": [ -71.34, 41.12 ]はOKそうなのだが。

どうせ整形するなら扱いやすいデータ構造にjqコマンドで整形した。

$ cat alltweets_becky_bekiko.json | jq ' .[] |

{
id: .id_str,
created_at: .created_at,
text: .text,
analyze_text: .text,
source: .source,
in_reply: {
status: .in_reply_to_status_id_str,
user_id: .in_reply_to_user_id_str,
screen_name: .in_reply_to_screen_name
},
user: {
id: .user.id_str,
name: .user.name,
screen_name: .user.screen_name,
location: .user.location,
description: .user.description,
url: .user.url,
followers_count: .user.followers_count,
friends_count: .user.friends_count,
listed_count: .user.listed_count,
created_at: .user.created_at,
favourites_count: .user.favourites_count,
statuses_count: .user.statuses_count,
lang: .user.lang,
profile_image_url: .user.profile_image_url_https,
profile_banner_url: .user.profile_banner_url
},
geo: {
lat: (.geo.coordinates[0] // 0),
lon: (.geo.coordinates[1] // 0)
},
place: {
id: .place.id,
url: .place.url,
place_type: .place.place_type,
name: .place.name,
full_name: .place.full_name,
country_code: .place.country_code,
country: .place.country
},
retweet_count: .retweet_count,
favorite_count: .favorite_count,
lang: .lang,
hashtags: .entities.hashtags,
user_mentions: .entities.user_mentions,
media: .entities.media,
retweeted_status: {
created_at: .retweeted_status.created_at,
id: .retweeted_status.id_str,
text: .retweeted_status.text,
source: .retweeted_status.source,
in_reply: {
status: .retweeted_status.in_reply_to_status_id_str,
user_id: .retweeted_status.in_reply_to_user_id_str,
screen_name: .retweeted_status.in_reply_to_screen_name
},
user: {
id: .retweeted_status.user.id_str,
name: .retweeted_status.user.name,
screen_name: .retweeted_status.user.screen_name,
location: .retweeted_status.user.location,
description: .retweeted_status.user.description,
url: .retweeted_status.user.url,
followers_count: .retweeted_status.user.followers_count,
friends_count: .retweeted_status.user.friends_count,
listed_count: .retweeted_status.user.listed_count,
created_at: .retweeted_status.user.created_at,
favourites_count: .retweeted_status.user.favourites_count,
statuses_count: .retweeted_status.user.statuses_count,
lang: .retweeted_status.user.lang,
profile_image_url: .retweeted_status.user.profile_image_url_https,
profile_banner_url: .retweeted_status.user.profile_banner_url
}
}
}
' > becky_bekiko.json


bulkに整形

require 'json'

File.open('./becky_bekiko.json', 'w') do |f|
open('./bulk_becky_bekiko.json').each do |line|
json=JSON.load(line)
index = { "index" => { "_index" => "twitter", "_type" => "tweet", "_id" => json["id"]} }
f.puts index.to_json
f.puts line
end
end

$ ruby to_bulk.rb

$ ls bulk_becky_bekiko.json
bulk_becky_bekiko.json



  • stream2esを使えば整形せずにいけるとあったが、_idのmappingsで"reason": "Failed to parse mapping [tweet]: _id is not configurable"と怒られたのであきらめた。


IndexとMappingの作成

$ curl -XPOST 'http://192.168.33.10:9200/twitter' -d '

{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "ngram_tokenizer",
"filter": ["cjk_width", "lowercase"]
},
"facet_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "keyword",
"filter": ["cjk_width", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 2,
"max_gram": 3,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"tweet": {
"properties": {
"created_at": {
"type": "date",
"format" : "EE MMM d HH:mm:ss Z yyyy"
},
"text": {
"type": "string",
"index": "analyzed",
"analyzer": "ngram_analyzer"
},
"analyze_text": {
"type": "string",
"index": "not_analyzed"
},
"source": {
"type": "string",
"index": "analyzed",
"analyzer": "facet_analyzer"
},
"in_reply": {
"type": "object",
"properties": {
"status": {
"type": "long"
},
"user_id": {
"type": "long"
},
"screen_name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"user": {
"type": "object",
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"screen_name": {
"type": "string",
"index": "not_analyzed"
},
"location": {
"type": "string"
},
"description": {
"type": "string",
"index": "analyzed",
"analyzer": "ngram_analyzer"
},
"url": {
"type": "string"
},
"followers_count": {
"type": "long"
},
"friends_count": {
"type": "long"
},
"listed_count": {
"type": "long"
},
"created_at": {
"type": "date",
"format" : "EE MMM d HH:mm:ss Z yyyy"
},
"favourites_count": {
"type": "long"
},
"statuses_count": {
"type": "long"
},
"lang": {
"type": "string"
},
"profile_image_url": {
"type": "string"
},
"profile_banner_url": {
"type": "string"
}
}
},
"geo": {
"type": "geo_point",
"ignore_malformed": true
},
"place": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"url": {
"type": "string"
},
"place_type": {
"type": "string"
},
"name": {
"type": "string"
},
"full_name": {
"type": "string",
"index": "analyzed",
"analyzer": "ngram_analyzer"
},
"country_code": {
"type": "string"
},
"country": {
"type": "string"
}
}
},
"retweet_count": {
"type": "long"
},
"favorite_count": {
"type": "long"
},
"lang": {
"type": "string"
},
"hashtags": {
"properties": {
"text": {
"type": "string",
"index": "analyzed",
"analyzer": "facet_analyzer"
}
}
},
"user_mentions": {
"properties": {
"screen_name": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "long"
}
}
},
"media": {
"properties": {
"id": {
"type": "long"
},
"media_url_https": {
"type": "string"
},
"expanded_url": {
"type": "string"
},
"type": {
"type": "string"
}
}
},
"retweeted_status": {
"type": "object",
"properties": {
"created_at": {
"type": "date",
"format" : "EE MMM d HH:mm:ss Z yyyy"
},
"id": {
"type": "long"
},
"text": {
"type": "string",
"index": "analyzed",
"analyzer": "ngram_analyzer"
},
"source": {
"type": "string",
"index": "analyzed",
"analyzer": "facet_analyzer"
},
"in_reply": {
"type": "object",
"properties": {
"status": {
"type": "long"
},
"user_id": {
"type": "long"
},
"screen_name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"user": {
"type": "object",
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"screen_name": {
"type": "string",
"index": "not_analyzed"
},
"location": {
"type": "string"
},
"description": {
"type": "string",
"index": "analyzed",
"analyzer": "ngram_analyzer"
},
"url": {
"type": "string"
},
"followers_count": {
"type": "long"
},
"friends_count": {
"type": "long"
},
"listed_count": {
"type": "long"
},
"created_at": {
"type": "date",
"format" : "EE MMM d HH:mm:ss Z yyyy"
},
"favourites_count": {
"type": "long"
},
"statuses_count": {
"type": "long"
},
"lang": {
"type": "string"
},
"profile_image_url": {
"type": "string"
},
"profile_banner_url": {
"type": "string"
}
}
}
}
}
}
}
}
}
'


データの登録

$ curl -s -XPOST 192.168.33.10:9200/_bulk --data-binary "@bulk_becky_bekiko.json"; echo


kibana


設定


  • 「Index name or patter」に「twitter」

  • 「Time-field name」に「created_at」

1.png


Discoverページ


  • 「Discover」を選択

2.png


Geo


  • 「Visualize」⇒「Tile map」

  • geo_pointにnull("ignore_malformed": true)は受け入れてもらえなかったので、適当な初期値のgeoです

3.png


円グラフ


  • 「Visualize」⇒「Pie chart」


ハッシュタグ数


  • Field「hashtags.text」

4.png


メンションしたユーザの多い数


  • Field「in_reply.screen_name」

5.png


RTしたユーザの多い数


  • Field「retweeted_status.user.name」

6.png


縦棒グラフ


  • 「Visualize」⇒「Vertical bar chart」


Fav数の多いtweet

7.png


まとめ


  • データ整形とmappingsを整理するのが大変だった

  • 日本語検索の精度が適当なので、analysisの理解深めたい