More than 5 years have passed since last update.

ちょっとしたテキスト解析勉強会用にまとめた情報＋α

Last updated at 2016-04-22Posted at 2016-04-22

参考サイト・資料

はじめての「R」
http://www.slideshare.net/m884/japan-r-15432969

Rでテキストマイングをやっているサイト
https://www.karada-good.net/analyticsr/r-10/

テキストマイング、R,統計解析を含め様々な内容を解説してくれている素晴らしいサイト
https://www1.doshisha.ac.jp/~mjin/R/

Rの超細かい関数リファレンス
https://cran.r-project.org/doc/contrib/manuals-jp/Mase-Rstatman.pdf

本

「Rによるテキストマイニング入門」石田基広
かなりわかりやすそうな感じです。実践的に手を動かしながら読む本です。

データ集め

特定の他人のTwitterを３０００件くらいは取れる
http://www.oshiete-kun.net/archives/2014/08/_twimem.html

Rに内蔵　フィッシャーのアイリスデータ

アイリスという花のデータ

data(iris)
str(iris)
summary(iris)
hist(iris$Sepal.Length) ：ヒストグラム

Rの操作メモ

ライブラリーのインストール

install.packages("XLConnect", dependencies = TRUE)

RMecab(形態素解析に必須。テキスト解析に必須)

# RMeCab Windows バイナリファイル
# バージョンは0.99991
install.packages("http://web.ias.tokushima-u.ac.jp/linguistik/RMeCab/RMeCab_0.99991.zip",
                 repos = NULL, method = "libcurl")
 
# RMeCab - OS X 10.9 Maverick バイナリファイル
# バージョンは0.99991
install.packages("http://web.ias.tokushima-u.ac.jp/linguistik/RMeCab/RMeCab_0.99991.tgz",
                 repos = NULL, method = "libcurl")
 
# Githubからインストール
# バージョンは0.99993
install.packages("devtools")
devtools::install_github("IshidaMotohiro/RMeCab")

tcltkがロードできませんでした

Error :  .onLoad は loadNamespace()（'tcltk' に対する）の中で失敗しました、詳細は: 
  call: fun(libname, pkgname) 
  error: X11 library is missing: install XQuartz from xquartz.macosforge.org 
 エラー:  ‘tcltk’ に対するパッケージもしくは名前空間のロードが失敗しました

よくわからないので無視して使わないことにする

Rの関数知らないのでわからない！！調べるの時間かかる！！

誰もが通る宿命

マニュアルっぽいのをネットや本から拾ってくる。

Rを使っていこう

やってみる。ひたすら試す。

一行毎に処理して出力させてみたい
参考：http://takenaka-akio.org/doc/r_auto/chapter_07_apply.html

apply(hazuma,1,nchar)

１行毎に文字数を計算

ファイルに書き込み出力

write.csv(hazuma,file='write.csv')

1行1列目

hazuma[1,1]
[1] RT @aiuemai: 映画を観るまでは我慢…と思っていたのに＜ゲンロンβ1＞の渡邉大輔【ポスト・シネマ・クリティーク#4 】「キャメラアイの複数化鈴木卓爾監督『ジョギング渡り鳥』」を読む。面白い。前回は『牡蠣工場』が取り上げられていたし、目が離せない連載。https://t… Wed Apr 20 05:36:25 +0000 2016
2784 Levels: ・・・・ https://t.co/OksFa7LaRG Thu Mar 03 06:23:01 +0000 2016 ...

RMecab関係：文章の形態素解析

res <- docMatrix("R-scripts/data", pos = c("名詞","形容詞","助詞") )

RMeCabC(hazuma[1,1])

[[1]]
  名詞 
"Back" 

[[2]]
名詞 
"to" 

[[3]]
 名詞 
"Top" 

[[4]]
名詞 
 "^" 

[[5]]
名詞 
"RT" 

[[6]]
名詞 
 "@" 

[[7]]
      名詞 
"masumoto" 

[[8]]
名詞 
"_:" 

[[9]]
      名詞 
"ゲンロン" 

[[10]]
記号 
"β" 

[[11]]
名詞 
 "1" 

[[12]]
助詞 
"は" 

[[13]]
記号 
"、" 

[[14]]
  名詞 
"小松" 

[[15]]
名詞 
"理" 

[[16]]
名詞 
"虔" 

[[17]]
  名詞 
"さん" 

[[18]]
助詞 
"の" 

[[19]]
  名詞 
"日本" 

[[20]]
名詞 
"酒" 

[[21]]
  名詞 
"紹介" 

[[22]]
助詞 
"が" 

[[23]]
        形容詞 
"素晴らしかっ" 

[[24]]
助動詞 
  "た"

名詞のみを取り出す

res <- RMeCabC(hazuma[1,1])
res2 <- unlist(res)
res2[names(res2) == "名詞"]

RmecabText

setwd("~/R-scripts")
library("RMeCab")
RMecabText("hazuma.txt")
エラー終了する。
テキストが多いから？？

頻度分析！！RMeCabFreqを使ってみる

library("RMeCab")
setwd("~/R-scripts")
res <- RMeCabFreq("data/hazuma.txt")

res2 <- res[res$Info1=="名詞" & res$Freq > 10,]
	* 注意　最後の"," が無いとエラーになる。
write.csv(res,"freq.csv")

wordcloud

library("RColorBrewer")
library("wordcloud")
par(family = "HiraKakuProN-W3") #文字化け対策

Col <- brewer.pal(9, "BuGn") #文字色の指定
Col <- Col[-(1:3)] #見やすく薄い色を削除

words<- read.csv("freq2-2.csv")

wordcloud(words[,1], words[, 2], scale=c(6,.2),
      random.order = T, rot.per = .15, colors = Col)

参考
https://www.karada-good.net/analyticsr/r-10/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up