More than 3 years have passed since last update.

【R】textrecipesでテキストマイニング

Last updated at 2021-09-15Posted at 2020-07-27

0.初めに

twitterから流れてきた記事のRMarkdownの中に書かれていたlibrary。

テキストの処理がdplyrのような作法で書かれている。
以下、ソースはRmarkdown中のもの。

library(textrecipes)
complaints_rec <-
  recipe(product ~ date_received + tags + text,
    data = complaints_train
  ) %>%
  step_date(date_received, features = c("month", "dow"), role = "dates") %>%
  step_rm(date_received) %>%
  step_dummy(has_role("dates")) %>%
  step_unknown(tags) %>%
  step_dummy(tags) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_ngram(text, num_tokens = 2, min_num_tokens = 1) %>%
  step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>%
  step_tfidf(text)
complaints_rec

step_tokenize(tokenize？)とかstep_ngram（ngram作成？）とかstep_stopwords（stopwords除去？）とかstep_tfidf(tfidf計算？)とか・・・
こんな都合のよい関数ないでしょーと思っていたら実際にあった！
最近作られたものの様子。

textrecipes

Rでテキストマイニングなら
tmとかでしょうか。
自分はRMeCabで後はLDAとかwordcloudとか必要なパッケージを組み合わせてという感じでした。

1.関数

公式,github参照が一番。
丁寧にInput/Outputの形式を整えてくれていました。

関数名	In	Out	説明	備考
step_tokenize()	character	tokenlist()	tokenizeしてくれる。
step_untokenize()	tokenlist()	character	tokenize後のリストを連結して文章に戻す
step_lemma()	tokenlist()	tokenlist()	lemma
step_stem()	tokenlist()	tokenlist()	stemming
step_stopwords()	tokenlist()	tokenlist()	stopwords適用	デフォルトはsnowball。かすたむしたい場合は、custom_stopword_source = c()でリストで指定
step_dummy()	a factor variable	binary dummy variables	カテゴ→バイナリ
step_pos_filter()	tokenlist()	tokenlist()	posfilter
step_ngram()	tokenlist()	tokenlist()	ngram生成
step_tokenfilter()	tokenlist()	tokenlist()	名前とおり
step_tokenmerge()	tokenlist()	tokenlist()
step_tfidf()	tokenlist()	numeric
step_tf()	tokenlist()	numeric
step_texthash()	tokenlist()	numeric
step_word_embeddings()	tokenlist()	numeric	word_embeddingしてくれる.	訓練元のデータはどこ？とか疑問が残るので要確認
step_textfeature()	character	numeric	デフォルトでtextfeaturesパッケージの関数のリストに従って特徴度を返す？	よくわからない
step_sequence_onehot()	character	numeric	one-hotベクトルにしてくれる。	これは便利！
step_lda()	character	numeric	Lda次元推定値を計算	試してない
step_text_normalization()	character	character		normalization_formで、"nfc"（標準）, "nfd", "nfkd", "nfkc", or "nfkc_casefold".指定

textrecipes includes a little departure in design from recipes, in the sense that it allows for some input and output to be in the form of list columns. To avoid confusion, here is a table of steps with their expected input and output respectively. Notice how you need to end with numeric for future analysis to work.

とのとおり、tidymodels、recipesという大もとのライブラリから派生している様子。こちらも見ておかないと。

2.実際触ってみた。

日本語ならばはじめに分かち書きしないといけないので、RMeCabを利用。
pythonに比べて処理速度がというような意見もあるけれど、そんなに遅くないのでは？
※大量テキストはまだ未実験。
jptext作成の部分はあれこれ弄ったら冗長っぽい処理になってしまったので、あまり参考になさらずに。。単に元の文章とRMacab適用後の分かち文章を取得したいだけなので。そのうち直したい。↓こういう感じ
RStudio Version 1.3.959　で実行、RmecabをWindows64bitで実行させるのに少し苦労した。

txttable<-
  c("【請求項１】 内燃機関と、 フレームと、 前記内燃機関により消費されるべき燃料を保管するための燃料タンクと、 前記内燃機関と前記燃料タンクとに接続された容器と、 を備え、前記容器は、前記燃料タンクから蒸発した燃料を蓄えるように形成されている、オートバイであって、 前記容器は、使用位置では少なくとも部分的に、前記燃料タンクによって覆い隠されている、オートバイにおいて、 前記燃料タンクは、前記フレームに対して、整備位置と前記使用位置との間で旋回可能に支承されていることを特徴とする、オートバイ。",
・・・（""区切りで100文くらい)
  ) %>%
  as.data.frame()
colnames(txttable)<-"text"


txttable2<-RMeCabDF(txttable)

jptext<-purrr::pmap_df(list(nv=txttable2,
                         title = txttable$text),
                    function(nv,title){
                      tibble(title=title,
                             word=stringr::str_c(nv) ,
                             hinshi=names(nv))
                    }) %>%
  group_by(title) %>%
  mutate(wakati = paste(word, collapse = " ")) %>%
  distinct(wakati)
  
test_rec <- recipe(~ wakati, data = jptext) %>%
  step_tokenize(wakati) %>% # Tokenizes to words by default
  step_stopwords(wakati,custom_stopword_source = c("か","おり","ある","および","え","いる","あり","あっ", "いずれ")) %>% # Uses the english snowball list by default
  step_tokenfilter(wakati, max_tokens = 300) %>%
  step_tfidf(wakati)

test_obj <- test_rec %>%
  prep()

str(bake(test_obj, jptext), list.len = 10)

結果

ちゃんと日本語でもtfidf処理されてますね。
あとよく見るとRMecabで対応する機能が結構そろっている！

str(bake(test_obj, jptext), list.len = 10)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	66 obs. of  300 variables:
 $ tfidf_wakati_ａ            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tfidf_wakati_ｂｓ          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tfidf_wakati_アーム        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tfidf_wakati_アクチュエータ: num  0 0 0 0 0 0 0 0 0 0 ...
 $ tfidf_wakati_アクティブ    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tfidf_wakati_エミッション  : num  0 0 0 0 0 ...
 $ tfidf_wakati_オートバイ    : num  0.0391 0.0442 0.0304 0.0218 0.0194 ...
 $ tfidf_wakati_オフ          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tfidf_wakati_おり          : num  0 0.071 0 0 0.0313 ...
 $ tfidf_wakati_か            : num  0 0 0 0.0197 0.0176 ...
  [list output truncated]

参考

公式
github
実践データサイエンス
→textrecipesを使った例あり。

20210915追記：
Rでもsentencepieceを使えるパッケージがあったのでRMeCabの代わりになりそう。

library(tokenizers.bpe)
library(sentencepiece)

dl <- sentencepiece_download_model("Japanese", vocab_size = 50000)
model <- sentencepiece_load_model(dl$file_model)

txt <- "すもももももももものうち"
entencepiece_encode(model, txt, type = "subwords")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up