twitterでユーザの類似度を求める #Ruby

授業で協調フィルタリングを利用したシステムを作成したので，メモ．

言語はRuby,利用したのはtwitter gemと日本語形態素解析器kuromoji
今回はツイートに含まれる語と，ツイートの時間を基に類似度計算を行った．

twitterからデータを取ってくるにはOAUTH認証する．
細かいことは他のところを参照してもらうとして，REST APIで認証するとメソッドが使えるようになる．

user_timelineメソッドを利用して，あるユーザのツイートを取得する．

tweets=client.user_timeline "#{ARGV[1].to_s}",:count=>200

これで200件ツイートを持ってこれる．

kuromojiはJavaの形態素解析器であるので，Rubyではそのまま使えない．
よって，RubyとJavaの橋渡しをするrjb(Ruby Java Bridge)を利用する．
あまり使い込んでないので分からないが，JavaとRubyではメソッドの表記の仕方が微妙に異なるため大文字やアンダースコアなどが自動で書き換わっているようだ．

ここの流れは"rubyから形態素解析ライブラリkuromojiを使う"を利用させてもらった．

Rjb::load('kuromoji-0.7.7/lib/kuromoji-0.7.7.jar')
Tokenizer = Rjb::import('org.atilika.kuromoji.Tokenizer')
@tknizer = Tokenizer.builder.build

次に文章の分割と対象にするキーワードの抽出．
まず，対象とするのは名詞のみに限定する．
それから，記号や英単語なんかも単語としてみなしてしまうので，それらも除外．

で，一番のミソが抽出した単語のカウント．
特に，他のユーザとのカウント数の比較も必要なのでハッシュを利用した．
キーを単語として，値を出現回数にしている．

def tokenize2 sentence,h
  list = @tknizer.tokenize(sentence)
  list.extend JavaIterator
  list.each do |x|
        if x.surface_form =~ /^\w$/ then  #アルファベット1字の除去
      next
    end
        if x.surface_form =~ /^\d/ then  #先頭が数字
            next
        end
        if x.surface_form =~ /\.|\@|\// then  #記号の除外
            next
        end
        if x.surface_form =~ /^([ぁ-ん]|[ァ-ヴ])$/ then #ひらがな一文字のみの除外
            next
        end
    puts x.surface_form
    puts x.part_of_speech
    if /名詞/ =~ x.all_features then
      h.store("#{x.surface_form}",h.fetch("#{x.surface_form}",0)+1)
        end
  end
end

あとは，時間や曜日を同じようにカウントして，比較．

usermodering

require 'rjb'
require 'twitter'
require './access_token.rb'

module  JavaIterator
  def each
    i = self.iterator
    while i.has_next
      yield i.next
    end
  end
end

client=Twitter::REST::Client.new do |config|
  config.consumer_key=CONSUMER_KEY
  config.consumer_secret=CONSUMER_SECRET
  config.oauth_token=ACCESS_TOKEN
  config.oauth_token_secret=ACCESS_TOKEN_SECRET
end

Rjb::load('kuromoji-0.7.7/lib/kuromoji-0.7.7.jar')

Tokenizer = Rjb::import('org.atilika.kuromoji.Tokenizer')

@tknizer = Tokenizer.builder.build

def tokenize sentence
  list = @tknizer.tokenize(sentence)
  list.extend JavaIterator
  list.each do |x|
    print x.surface_form
    print " : "
    puts x.all_features
  end
end

def tokenize2 sentence,h
  list = @tknizer.tokenize(sentence)
  list.extend JavaIterator
  list.each do |x|
        if x.surface_form =~ /^\w$/ then  #アルファベット1字の除去
      next
    end
        if x.surface_form =~ /^\d/ then  #先頭が数字
            next
        end
        if x.surface_form =~ /\.|\@|\// then  #記号の除外
            next
        end
        if x.surface_form =~ /^([ぁ-ん]|[ァ-ヴ])$/ then #ひらがな一文字のみの除外
            next
        end
    puts x.surface_form
    puts x.part_of_speech
    if /名詞/ =~ x.all_features then
      h.store("#{x.surface_form}",h.fetch("#{x.surface_form}",0)+1)
        end
  end
end

unless(ARGV[0]) then
    raise IndexError.new("Please input one or some twitter user name ") #=> IndexError: index out of range
end

unless(ARGV[1]) then
    ARGV[1]="hoge"
end


tweets=client.user_timeline "#{ARGV[1].to_s}",:count=>200
ftweets=client.user_timeline "#{ARGV[0].to_s}",:count=>200


$userword=Hash.new 0
$usertime=Hash.new 0
$userwday=Hash.new 0

$friendword=Hash.new 0
$friendtime=Hash.new 0
$friendwday=Hash.new 0

tweets.each {|tweet|
    $usertime[tweet.created_at.hour]+=1
    $userwday[tweet.created_at.strftime("%a")]+=1

    if tweet.text =~ /^RT/ then
        next
    end
    puts tweet.text
    tokenize tweet.text
    tokenize2 tweet.text,$userword
  p "====="
}


ftweets.each {|tweet|
    $friendtime[tweet.created_at.hour]+=1
    $friendwday[tweet.created_at.strftime("%a")]+=1

    if tweet.text =~ /^RT/ then
        next
    end
    puts tweet.text
    tokenize tweet.text
    tokenize2 tweet.text,$friendword
  p "====="
}

p "============="

p $userword.to_a.sort{|a, b|
  (b[1] <=> a[1]) * 2 + (a[0] <=> b[0])
}
p $usertime
p $userwday

p "============="


p $friendword.to_a.sort{|a, b|
  (b[1] <=> a[1]) * 2 + (a[0] <=> b[0])
}
p $friendtime
p $friendwday

p "============="
#正規化
$usersum=0
$userword.values.each{|a|
    $usersum=$usersum+a
}
$userword.each{|key,value|
    $userword.store(key,value/$usersum.to_f)
}

$friendsum=0
$friendword.values.each{|a|
    $friendsum=$friendsum+a
}
$friendword.each{|key,value|
    $friendword.store(key,value/$friendsum.to_f)
}

#スコアの計算
$wordscore=0.0
$userword.each{|key,value|
    $wordscore+=value*$friendword.fetch(key,0.0)*50000.0
}
$timescore=0.0
$usertime.each{|key,value|
    $timescore+=value*$friendtime.fetch(key,0.0)/40.0
}
$wdayscore=0.0
$userwday.each{|key,value|
    $wdayscore+=value*$friendwday.fetch(key,0.0)/400.0
}

puts "friendship score is " + ($wordscore+$timescore+$wdayscore).to_s
puts "word score " + ($wordscore).to_s
puts "time score " + ($timescore).to_s
puts "wday score "+ ($wdayscore).to_s