{quanteda}の紹介

Posted at 2015-12-19

始めに

　この記事はR AdventCalendar 20日目の記事です。
　言語処理以外の記事を書こうと思ったのですが、諸々の進捗が許してくれませんでした。ということで、今回も言語処理系のお話です。
　本日は{quanteda}というテキスト処理・解析のためのRパッケージについてご紹介します。

{quanteda}とは

A fast, flexible toolset for for the management, processing, and quantitative analysis of textual data in R.

　テキストファイルからのコーパス作成からトークン化・ステミング、N-gramや類似度に可読性指標の計算など、言語処理タスクを手早く手軽にしやすくするためのRパッケージです（ただし、英語文書の解析がメインです）。
　下記がパッケージのリンクです。上がCRANで下がGitHubです。
　quanteda: Quantitative Analysis of Textual Data
　kbenoit/quanteda

　テキスト処理には{stringr}と{stringi}を用いており、{data.table}による大規模な文書のインデックス化と{Matrix}による疎行列化がされています。同じようなテキスト処理用フレームワークに{tm}もありますが、{tm}で作成したオブジェクトも一部利用できます。

{quanteda}で何ができるか？

コーパス管理
- テキストファイル処理（個別ファイルからディレクトリ単位でのテキストの読み込みや、ヘルパー関数の定義）
- ある言語単位（文や文書など）での集計や抽出、リサンプリング（ノンパラメトリックブートスラッピング）
- KeyWords In Context (KWIC)
言語処理ツール
- タイプ数やN-gramなどのテキストの素性量抽出や、ユーザー定義の辞書やシソーラス作成
- ステミングやランダム選択、文書頻度、単語頻度などによる素性の絞り込み
- 英単語用前処理
- 可読性指標や語彙の多様性指標、連語分析、類似度算出
- TFやTF-IDF以外にもコレスポンデンス分析やWordfish modelによる素性の重み付け
- （トピックモデルやNaive Bayesやk-nearest neighbourは現在は未実装）

Rで実行する

　定義されている定数や関数を、一部ですが試してみます。

quanteda-misc.R

library(dplyr)
library(quanteda)

# The stopword list are SMART English stopwords from the SMART information retrieval system 
> quanteda::stopwords(kind = "english") %>%
+   head(n = 5)
[1] "i"      "me"     "my"     "myself" "we"    

# 英語の音節
# ex・am・ple
> quanteda::syllables(x = "example")
[1] 3
# sta・tis・ti・cal
> quanteda::syllables(x = "statistical")
[1] 4

# {SnowballC}によるポーターステミング
> quanteda::wordstem(x = c("win", "winning", "wins", "won", "winner"), language = "porter")
[1] "win"    "win"    "win"    "won"    "winner"

# N-gram
# 日本語もわかち書きしてもおかしい結果を出すケースがある。
> quanteda::ngrams(
+   text = c("すもも も もも も もも の うち", "吾輩 は 猫 で ある 。", "Why are you using SJIS ?"),
+   n = 2, concatenator = "-"
+ )
[[1]]
[1] "すもも-も" "も-も"     "も-も"     "も-も"     "も-も"     "も-も"     "も-の"     "の-うち"  

[[2]]
[1] "吾輩-は" "は-猫"   "猫-で"   "で-ある" "ある-。"

[[3]]
[1] "Why-are"    "are-you"    "you-using"  "using-SJIS" "SJIS-?"

コーパス管理

quanteda-corpus.R

library(dplyr)
library(tm)
library(quanteda)

# {tm}のデータセットcrudeを利用
data(crude)
crude_coupus <- quanteda::corpus(x = crude, encTo = "UTF-8")
> class(x = crude_coupus)
[1] "corpus" "list"  
> summary(object = crude_coupus)
Corpus consisting of 20 documents.

 Text Types Tokens Sentences
  127    58     92        12
  144   227    443        52
  191    43     55         9
  194    52     69        10
  211    64     93        12
  236   237    458        57
  237   236    431        60
  242   110    154        18
  246   181    324        44
  248   187    344        41
  273   192    373        44
  349    67     92        12
  352    70    105        12
  353    72     98        12
  368    72    109        14
  489    94    148        17
  502   121    198        22
  543    51     83        10
  704   142    281        33
  708    38     53         8

Source:  Converted from tm VCorpus 'crude'.
Created: Sat Dec 19 15:52:18 2015.
Notes:   .

# コーパスの一部を取り出す
> stringr::str_split(string = quanteda::texts(x = crude_coupus)[1], pattern = "\n")
[[1]]
 [1] "Diamond Shamrock Corp said that"                                 
 [2] "effective today it had cut its contract prices for crude oil by" 
 [3] "1.50 dlrs a barrel."                                             
 [4] "    The reduction brings its posted price for West Texas"        
 [5] "Intermediate to 16.00 dlrs a barrel, the copany said."           
 [6] "    \"The price reduction today was made in the light of falling"
 [7] "oil product prices and a weak crude oil market,\" a company"     
 [8] "spokeswoman said."                                               
 [9] "    Diamond is the latest in a line of U.S. oil companies that"  
[10] "have cut its contract, or posted, prices over the last two days" 
[11] "citing weak oil markets."                                        
[12] " Reuter"                                                         

# 様々な処理ができるトークン化
> tokenized_crude_coupus <- quanteda::tokenize(
+   x = crude_coupus, what = "sentence",
+   removeNumbers = FALSE, removePunct = FALSE, removeSeparators = TRUE, removeTwitter = FALSE,
+   ngrams = 1,
+   verbose = TRUE
+ )
Starting tokenization...
  ...preserving Twitter characters (#, @)...total elapsed: 0 seconds.
  ...tokenizing texts
   ...separating into sentences....total elapsed:  0.008 seconds.
  ...replacing Twitter characters (#, @)...total elapsed: 0 seconds.
  ...replacing names...total elapsed:  0.001 seconds.
Finished tokenizing and cleaning 20 texts.

> class(x = tokenized_crude_coupus)
[1] "tokenizedTexts" "list"          

# 各文書の可読性指標
> quanteda::readability(x = crude_coupus, measure = "all") %>% 
+   head(n = 2)
         ARI  ARI.NRI ARI.simple   Bormuth Bormuth.GP  Coleman Coleman.C2 Coleman.Liau
127 3.495942 2.536232   47.97101 -1.090237    2955495 41.47391   53.22391     59.82560
144 5.018712 4.274058   50.91878 -1.283181    5139826 35.80508   46.19447     53.42044
    Coleman.Liau.grade Coleman.Liau.short Dale.Chall Dale.Chall.old Dale.Chall.PSK Danielson.Bryan
127           6.671495           6.671304   25.66652      -30.60794      -30.94298        4.752439
144           8.426537           8.426546   22.30909      -33.48056      -33.80707        5.066302
    Danielson.Bryan.2 Dickes.Steiwer      DRP      ELF Farr.Jenkins.Paterson   Flesch Flesch.PSK
127          84.80234       199.5486 209.0237 2.583333             -40.28935 80.42942   4.773458
144          82.90171       211.3242 228.3181 3.384615             -41.08444 65.27241   5.608429
    Flesch.Kincaid      FOG  FOG.PSK   FOG.NRI  FORCAST FORCAST.RGL    Fucks Linsear.Write      LIW
127       3.945652 6.544928 2.918473  4.058333 10.70652    10.20717 34.33333   -0.41666667 23.97101
144       6.271552 9.096180 3.691583 33.514038 11.36569    10.93226 40.13462   -0.01923077 31.54406
         nWS    nWS.2    nWS.3    nWS.4      RIX     SMOG   SMOG.C SMOG.simple  SMOG.de   Spache
127 4.010888 4.434446 2.922622 2.729354 1.250000 7.793538 7.872500    7.472136 2.472136 5.686667
144 5.781312 6.126124 4.722261 4.472010 1.961538 9.417115 9.300595    9.028777 4.028777 5.336328
    Spache.old   Strain Traenkle.Bailer Traenkle.Bailer.2 Wheeler.Smith
127   6.220000 3.225000       -246.3363         -232.0354      25.83333
144   5.864591 4.015385       -277.8625         -258.7959      33.84615

# 連語とそのスコア
> quanteda::collocations(x = crude_coupus, method = "all", size = 2)
        word1  word2 word3 count           G2           X2           pmi       dice
   1:     mln    bpd          14 1.213317e+02 1.270675e+03  4.5168446312 0.58333333
   2:       a barrel          14 1.030286e+02 5.989800e+02  3.7767238174 0.30434783
   3:    dlrs      a          13 8.190325e+01 4.497635e+02  3.5759259844 0.29545455
   4: billion riyals           6 7.976580e+01 2.700747e+03  6.1100805691 0.85714286
   5:   crude    oil          13 7.626258e+01 3.749103e+02  3.4020303680 0.26530612
  ---                                                                              
2589:      of      a           2 1.117296e-02 1.143404e-02  0.0738518082 0.02409639
2590:      in    its           1 2.194860e-03 2.220715e-03  0.0462953604 0.01587302
2591:    said      a           1 2.517813e-05 1.800434e-05  0.0041718876 0.01652893
2592:     was     to           1 1.370888e-05 6.505725e-06 -0.0024948038 0.01273885
2593:     and    the           4 1.070793e-05 3.510900e-06  0.0009024332 0.03030303

# KWIC (KeyWord In Context)
> quanteda::kwic(x = crude_coupus, word = "will", window = 3)
                             preword       word                postword
[144, 266]                     "They       will not meet now           
[144, 327]           next two months       will be critical for\nOPEC's
[144, 355] eight weeks\nsince buyers       will come back into         
  [191, 9]    the\ncontract price it       will pay for crude          
 [194, 10]         contract price it       will pay for all            
[236, 163]           the ability, it       will do so," the            
[236, 450]         such pressure. It       will continue through March 
 [242, 31]         over whether OPEC       will succeed in halting\nthe
 [248, 56]             Accord and it       will never sell its         
[273, 263]             a weak market will\ncome this month, when       
  [349, 8]     six Gulf\nArab states       will meet in Bahrain        
 [352, 56]             accord and it       will never sell its         
 [704, 19]  the energy\ncomplex that      will  increase the use       
 [704, 35]          April one, NYMEX      will  allow oil traders      
 [704, 69]                     "This      will  change the way         
 [704, 92]           Foreign traders      will  be able to             
[704, 121]      The expanded program      "will serve the industry     
[704, 257]     of the EFP\nprovision      will  add to globalization

言語処理ツール

quanteda-dfm.R

library(dplyr)
library(quanteda)

# ストップワードと"will"を除去し、{quanteda}の関数が適用しやすいdfmオブジェクトへ変換
> crude_coupus_ignore_stopwords <- quanteda::dfm(
+   x = crude_coupus,
+   ignoredFeatures = c("will", quanteda::stopwords(kind = "english")),
+   stem = TRUE, matrixType = "sparse"
+ )
Creating a dfm from a corpus ...
   ... lowercasing
   ... tokenizing
   ... indexing 20 documents
   ... indexing 1,052 feature types
   ... removed 89 features, from 175 supplied feature types
   ... stemming features (English), trimmed 169 feature variants
   ... created a 20 x 794 sparse dfm
   ... complete. 
Elapsed time: 0.056 seconds.

> class(x = crude_coupus_ignore_stopwords)
[1] "dfmSparse"
attr(,"package")
[1] "quanta"

# summary(object = crude_coups)のときよりも減っている
> quanteda::ntype(x = crude_coupus_ignore_stopwords)
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
 36 158  31  33  37 157 179  70 121 128 134  49  49  47  48  60  78  33 101  24 
> quanteda::ntoken(x = crude_coupus_ignore_stopwords)
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
 58 262  38  44  54 263 268  86 188 209 242  65  68  60  66  86 113  54 172  35 

# dfmオブジェクトと型変換
# dfmSparse
> str(object = crude_coupus_ignore_stopwords)
Formal class 'dfmSparse' [package "quanteda"] with 9 slots
  ..@ settings :List of 1
  .. ..$ : NULL
  ..@ weighting: chr "frequency"
  ..@ smooth   : num 0
  ..@ Dim      : int [1:2] 20 794
  ..@ Dimnames :List of 2
  .. ..$ docs    : chr [1:20] "127" "144" "191" "194" ...
  .. ..$ features: chr [1:794] "diamond" "shamrock" "corp" "said" ...
  ..@ i        : int [1:15880] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p        : int [1:795] 0 20 40 60 80 100 120 140 160 180 ...
  ..@ x        : num [1:15880] 2 0 0 0 0 0 0 0 0 0 ...
  ..@ factors  : list()

# austin's wfm format
> str(object = quanteda::as.wfm(x = crude_coupus_ignore_stopwords))
 wfm [1:20, 1:794] 2 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ docs : chr [1:20] "127" "144" "191" "194" ...
  ..$ words: chr [1:794] "diamond" "shamrock" "corp" "said" ...
 - attr(*, "class")= chr [1:2] "wfm" "matrix"
# tm's DocumentTermMatrix format
> str(object = quanteda::as.DocumentTermMatrix(x = crude_coupus_ignore_stopwords))
List of 6
 $ i       : int [1:15880] 1 2 3 4 5 6 7 8 9 10 ...
 $ j       : int [1:15880] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:15880] 2 0 0 0 0 0 0 0 0 0 ...
 $ nrow    : int 20
 $ ncol    : int 794
 $ dimnames:List of 2
  ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
  ..$ Terms: chr [1:794] "diamond" "shamrock" "corp" "said" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
# stm package format
> str(object = quanteda::convert(crude_coupus_ignore_stopwords, to = "stm"))
List of 3
 $ documents:List of 20
  ..$ 127: int [1:2, 1:36] 1 2 2 1 3 1 4 3 5 1 ...
  ..$ 144: int [1:2, 1:158] 4 11 7 1 9 6 11 12 24 6 ...
  ..$ 191: int [1:2, 1:31] 4 1 5 1 6 1 8 1 9 2 ...
  ..$ 194: int [1:2, 1:33] 4 1 5 1 6 1 8 1 9 2 ...
  ..$ 211: int [1:2, 1:37] 4 3 11 1 12 2 21 1 36 1 ...
  ..$ 236: int [1:2, 1:157] 4 10 6 1 9 8 10 2 11 7 ...
  ..$ 237: int [1:2, 1:179] 4 1 8 1 9 1 11 3 12 1 ...
  ..$ 242: int [1:2, 1:70] 4 3 9 2 11 3 26 2 36 1 ...
  ..$ 246: int [1:2, 1:121] 4 5 6 1 9 2 11 5 13 1 ...
  ..$ 248: int [1:2, 1:128] 4 7 9 10 11 9 12 4 13 3 ...
  ..$ 273: int [1:2, 1:134] 3 3 4 8 7 1 9 5 10 5 ...
  ..$ 349: int [1:2, 1:49] 4 1 6 1 9 1 10 2 11 4 ...
  ..$ 352: int [1:2, 1:49] 4 2 7 1 9 5 11 5 13 1 ...
  ..$ 353: int [1:2, 1:47] 4 1 9 2 10 2 11 4 13 1 ...
  ..$ 368: int [1:2, 1:48] 4 3 6 1 11 3 30 2 36 1 ...
  ..$ 489: int [1:2, 1:60] 4 2 9 3 11 4 12 1 13 3 ...
  ..$ 502: int [1:2, 1:78] 4 2 9 3 11 5 12 1 13 3 ...
  ..$ 543: int [1:2, 1:33] 3 1 4 4 5 1 7 1 9 3 ...
  ..$ 704: int [1:2, 1:101] 4 4 5 2 8 1 9 3 11 3 ...
  ..$ 708: int [1:2, 1:24] 4 1 10 1 11 1 13 2 24 1 ...
 $ vocab    : chr [1:794] "13-member" "13-nation" "200-foot" "20s" ...
 $ meta     : NULL
# topicmodels package format
> str(object = quanteda::quantedaformat2dtm(x = crude_coupus_ignore_stopwords))
List of 6
 $ i       : int [1:15880] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:15880] 1 2 3 4 5 6 7 8 9 10 ...
 $ v       : int [1:15880] 2 1 1 3 1 2 2 2 5 2 ...
 $ nrow    : int 20
 $ ncol    : int 794
 $ dimnames:List of 2
  ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
  ..$ Terms: chr [1:794] "diamond" "shamrock" "corp" "said" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
# lda package format
> str(object = quanteda::dfm2ldaformat(x = crude_coupus_ignore_stopwords))
List of 2
 $ documents:List of 20
  ..$ 127: int [1:2, 1:794] 0 2 1 1 2 1 3 3 4 1 ...
  ..$ 144: int [1:2, 1:794] 0 0 1 0 2 0 3 11 4 0 ...
  ..$ 191: int [1:2, 1:794] 0 0 1 0 2 0 3 1 4 1 ...
  ..$ 194: int [1:2, 1:794] 0 0 1 0 2 0 3 1 4 1 ...
  ..$ 211: int [1:2, 1:794] 0 0 1 0 2 0 3 3 4 0 ...
  ..$ 236: int [1:2, 1:794] 0 0 1 0 2 0 3 10 4 0 ...
  ..$ 237: int [1:2, 1:794] 0 0 1 0 2 0 3 1 4 0 ...
  ..$ 242: int [1:2, 1:794] 0 0 1 0 2 0 3 3 4 0 ...
  ..$ 246: int [1:2, 1:794] 0 0 1 0 2 0 3 5 4 0 ...
  ..$ 248: int [1:2, 1:794] 0 0 1 0 2 0 3 7 4 0 ...
  ..$ 273: int [1:2, 1:794] 0 0 1 0 2 3 3 8 4 0 ...
  ..$ 349: int [1:2, 1:794] 0 0 1 0 2 0 3 1 4 0 ...
  ..$ 352: int [1:2, 1:794] 0 0 1 0 2 0 3 2 4 0 ...
  ..$ 353: int [1:2, 1:794] 0 0 1 0 2 0 3 1 4 0 ...
  ..$ 368: int [1:2, 1:794] 0 0 1 0 2 0 3 3 4 0 ...
  ..$ 489: int [1:2, 1:794] 0 0 1 0 2 0 3 2 4 0 ...
  ..$ 502: int [1:2, 1:794] 0 0 1 0 2 0 3 2 4 0 ...
  ..$ 543: int [1:2, 1:794] 0 0 1 0 2 1 3 4 4 1 ...
  ..$ 704: int [1:2, 1:794] 0 0 1 0 2 0 3 4 4 2 ...
  ..$ 708: int [1:2, 1:794] 0 0 1 0 2 0 3 1 4 0 ...
 $ vocab    : chr [1:794] "diamond" "shamrock" "corp" "said" ...

# サンプリング
> set.seed(seed = 71)
# corpusオブジェクト
> summary(quanteda::sample(x = crude_coupus, size = 5))
Corpus consisting of 5 documents.

 Text Types Tokens Sentences
  237   236    431        60
  273   192    373        44
  236   237    458        57
  194    52     69        10
  543    51     83        10

Source:  Converted from tm VCorpus 'crude'.
Created: Sat Dec 19 15:52:18 2015.
Notes:   .

# dfmオブジェクトでfeature単位
> summary(quanteda::sample(x = crude_coupus_ignore_stopwords, size = 5, what = "features"))
20 x 5 sparse Matrix of class "dfmSparse", with 100 entries 
     i j x
1    1 1 0
2    2 1 0
3    3 1 0
4    4 1 0
5    5 1 0
6    6 1 0
7    7 1 0
8    8 1 0
9    9 1 0
10  10 1 0
11  11 1 0
12  12 1 0
13  13 1 0
14  14 1 0
15  15 1 0
16  16 1 0
17  17 1 0
18  18 1 0
19  19 1 1
20  20 1 0
21   1 2 0
22   2 2 0
23   3 2 0
24   4 2 0
25   5 2 0
26   6 2 0
27   7 2 0
28   8 2 0
29   9 2 1
30  10 2 0
31  11 2 0
32  12 2 0
33  13 2 0
34  14 2 0
35  15 2 0
36  16 2 0
37  17 2 0
38  18 2 0
39  19 2 0
40  20 2 0
41   1 3 0
42   2 3 0
43   3 3 0
44   4 3 0
45   5 3 0
46   6 3 0
47   7 3 0
48   8 3 0
49   9 3 0
50  10 3 0
51  11 3 0
52  12 3 0
53  13 3 0
54  14 3 0
55  15 3 0
56  16 3 1
57  17 3 1
58  18 3 0
59  19 3 0
60  20 3 0
61   1 4 0
62   2 4 0
63   3 4 0
64   4 4 0
65   5 4 0
66   6 4 1
67   7 4 0
68   8 4 0
69   9 4 0
70  10 4 0
71  11 4 0
72  12 4 0
73  13 4 0
74  14 4 0
75  15 4 0
76  16 4 0
77  17 4 0
78  18 4 0
79  19 4 0
80  20 4 0
81   1 5 0
82   2 5 0
83   3 5 0
84   4 5 0
85   5 5 0
86   6 5 0
87   7 5 3
88   8 5 0
89   9 5 0
90  10 5 0
91  11 5 2
92  12 5 0
93  13 5 0
94  14 5 0
95  15 5 0
96  16 5 0
97  17 5 0
98  18 5 0
99  19 5 0
100 20 5 0

# 頻度ベースで上位の素性を抽出（10回以上出現して、2文書以上に出現するものに限定）
> quanteda::topfeatures(
+   x = quanteda::trim(x = crude_coupus_ignore_stopwords, minCount = 10, minDoc = 2)
+ )
Features occurring less than 10 times: 758 
Features occurring in fewer than 2 documents: 489 
   oil   said  price   opec    mln market barrel   last   dlrs    bpd 
    85     73     63     47     31     30     26     24     23     23 
# 10回以上出現する素性をサンプリング
> quanteda::trim(x = crude_coupus_ignore_stopwords, minCount = 10, nsample = 5)
Features occurring less than 10 times: 758 
Retaining a random sample of 5 words
Document-feature matrix of: 20 documents, 5 features.
20 x 5 sparse Matrix of class "dfmSparse"
     features
docs  dlrs offici accord s sheikh
  127    2      0      0 0      0
  144    0      0      0 0      0
  191    1      0      0 0      0
  194    2      0      0 0      0
  211    2      0      0 0      0
  236    2      5      0 5      3
  237    1      1      0 6      0
  242    0      1      0 0      0
  246    0      0      0 0      5
  248    4      1      5 0      2
  273    2      4      1 0      0
  349    0      3      0 0      0
  352    0      1      2 0      0
  353    0      0      0 0      1
  368    0      0      0 0      0
  489    1      0      0 0      0
  502    1      0      0 0      0
  543    5      0      0 0      0
  704    0      1      4 0      0
  708    0      0      0 0      0

# 頻度ベースで上位10個の素性を抽出
> quanteda::topfeatures(x = crude_coupus_ignore_stopwords, n = 10, decreasing = TRUE)
   oil   said  price   opec    mln market barrel   last   dlrs    bpd 
    85     73     63     47     31     30     26     24     23     23 
> plot(crude_coupus_ignore_stopwords, min.freq = 10, random.order = FALSE)

# 素性の重み付けを変える
> log_tf <- quanteda::weight(x = crude_coupus_ignore_stopwords, type = "logFreq")
> wfidf <- quanteda::weight(x = log_tf, type = "tfidf", normalize = FALSE)
# "weighting"の属性値が変わっている
> str(object = wfidf)
 dfm [1:20, 1:794] 3.29 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ docs    : chr [1:20] "127" "144" "191" "194" ...
  ..$ features: chr [1:794] "diamond" "shamrock" "corp" "said" ...
 - attr(*, "class")= chr [1:2] "dfm" "matrix"
 - attr(*, "weighting")= chr "tfidf"

# 重みと順番が変わっている
> quanteda::topfeatures(x = log_tf, n = 10, decreasing = TRUE)
     oil     said    price     opec   reuter   barrel      mln   market     last     dlrs 
30.84361 27.17000 24.34084 15.40452 14.26841 13.51072 12.57072 12.04984 11.71896 11.66685 
> quanteda::topfeatures(x = wfidf, n = 10, decreasing = TRUE)
      bpd      opec       mln    market     saudi    kuwait         s    govern    accord     crude 
11.205873 10.677598 10.037813  9.621890  9.519591  8.690813  8.606302  8.606302  8.357742  8.354686 

# 定義されている重み付けの全種類を試す
> sapply(
+   X = mapply(
+     x = rep(x = list(crude_coupus_ignore_stopwords), each = 5),
+     type = c("frequency", "relFreq", "relMaxFreq", "logFreq", "tfidf"),
+     FUN = quanteda::weight,
+     smooth = 0
+   ),
+   FUN = quanteda::topfeatures,
+   n = 10, decreasing = TRUE
+ ) %>%
+   t
            oil       said      price       opec        mln     market     barrel       last       dlrs
[1,] 85.0000000 73.0000000 63.0000000 47.0000000 31.0000000 30.0000000 26.0000000 24.0000000 23.0000000
[2,]  0.8284160  0.6418656  0.6004227  0.3132729  0.3015126  0.2992712  0.2950993  0.2611242  0.2413641
[3,] 14.3948413 11.3892857 10.3662698  5.7662698  4.9095238  4.8055556  4.6305556  4.6083333  4.2900794
[4,] 30.8436127 27.1700007 24.3408417 15.4045172 14.2684087 13.5107230 12.5707158 12.0498398 11.7189635
[5,]  0.3027746  0.3012118  0.2501508  0.2173108  0.2089926  0.2064206  0.1968457  0.1927311  0.1922584
            bpd
[1,] 23.0000000
[2,]  0.2178314
[3,]  3.8777778
[4,] 11.6668475
[5,]  0.1892054

# 類似度を計算
> quanteda::similarity(
+   x = crude_coupus_ignore_stopwords, 
+   selection = c("oil", "price"), n = 10, margin = "features",
+   method = "cosine"
+ )
$price
    oil    said  reuter compani     cut    last  barrel  market  effect   bring 
 0.8969  0.7757  0.7409  0.7337  0.7294  0.7152  0.7045  0.6904  0.6751  0.6717 

$oil
 price reuter   said barrel market   last   opec  crude    cut  today 
0.8969 0.8531 0.8507 0.7137 0.6845 0.6765 0.6443 0.6283 0.6163 0.6094 

# 語彙の多様性
# 文書単位で計算
> mapply(
+   x = rep(x = list(crude_coupus_ignore_stopwords), each = 7),
+   type = c("TTR", "C", "R", "CTTR", "U", "S", "Maas"),
+   FUN = quanteda::lexdiv,
+   log.base = 10
+ )
         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
127 0.6206897 0.6206897 0.6206897 0.6206897 0.6206897 0.6206897 0.6206897
144 0.6030534 0.6030534 0.6030534 0.6030534 0.6030534 0.6030534 0.6030534
191 0.8157895 0.8157895 0.8157895 0.8157895 0.8157895 0.8157895 0.8157895
194 0.7500000 0.7500000 0.7500000 0.7500000 0.7500000 0.7500000 0.7500000
211 0.6851852 0.6851852 0.6851852 0.6851852 0.6851852 0.6851852 0.6851852
236 0.5969582 0.5969582 0.5969582 0.5969582 0.5969582 0.5969582 0.5969582
237 0.6679104 0.6679104 0.6679104 0.6679104 0.6679104 0.6679104 0.6679104
242 0.8139535 0.8139535 0.8139535 0.8139535 0.8139535 0.8139535 0.8139535
246 0.6436170 0.6436170 0.6436170 0.6436170 0.6436170 0.6436170 0.6436170
248 0.6124402 0.6124402 0.6124402 0.6124402 0.6124402 0.6124402 0.6124402
273 0.5537190 0.5537190 0.5537190 0.5537190 0.5537190 0.5537190 0.5537190
349 0.7538462 0.7538462 0.7538462 0.7538462 0.7538462 0.7538462 0.7538462
352 0.7205882 0.7205882 0.7205882 0.7205882 0.7205882 0.7205882 0.7205882
353 0.7833333 0.7833333 0.7833333 0.7833333 0.7833333 0.7833333 0.7833333
368 0.7272727 0.7272727 0.7272727 0.7272727 0.7272727 0.7272727 0.7272727
489 0.6976744 0.6976744 0.6976744 0.6976744 0.6976744 0.6976744 0.6976744
502 0.6902655 0.6902655 0.6902655 0.6902655 0.6902655 0.6902655 0.6902655
543 0.6111111 0.6111111 0.6111111 0.6111111 0.6111111 0.6111111 0.6111111
704 0.5872093 0.5872093 0.5872093 0.5872093 0.5872093 0.5872093 0.5872093
708 0.6857143 0.6857143 0.6857143 0.6857143 0.6857143 0.6857143 0.6857143

# quanteda::textmodel()のexample
> ie2010Corpus_dfm <- quanteda::dfm(x = ie2010Corpus, verbose = FALSE)
> ref_scores <- c(rep(x = NA, each = 4), -1, 1, rep(x = NA, each = 8))

# Wordfish model
> ws <- quanteda::textmodel(
+   x = ie2010Corpus_dfm, y = ref_scores, model = "wordscores", smooth = 1
+ )
# Wordfish model: scale = "logit"
> bs <- quanteda::textmodel(
+   x = ie2010Corpus_dfm, y = ref_scores, model = "wordscores", smooth = 1, scale = "logit"
+ )
> plot(
+   x = ws@Sw, y = bs@Sw, xlim = c(-1, 1),
+   xlab = "Linear word score", ylab = "Logit word score"
+ )

> wordfish <- quanteda::textmodel(
+   x = ie2010Corpus_dfm, y = NULL, model = "wordfish"
+ ) %>%
+   print
Fitted wordfish model:
Call:
	textmodel_wordfish(data = x)

Estimated document positions:

                               Documents       theta SE lower upper
1        2010_BUDGET_01_Brian_Lenihan_FF -1.77842508 NA    NA    NA
2       2010_BUDGET_02_Richard_Bruton_FG  0.58436869 NA    NA    NA
3         2010_BUDGET_03_Joan_Burton_LAB  1.14761695 NA    NA    NA
4        2010_BUDGET_04_Arthur_Morgan_SF  0.09400958 NA    NA    NA
5          2010_BUDGET_05_Brian_Cowen_FF -1.79211539 NA    NA    NA
6           2010_BUDGET_06_Enda_Kenny_FG  0.78894787 NA    NA    NA
7      2010_BUDGET_07_Kieran_ODonnell_FG  0.49306437 NA    NA    NA
8       2010_BUDGET_08_Eamon_Gilmore_LAB  0.58812988 NA    NA    NA
9     2010_BUDGET_09_Michael_Higgins_LAB  0.97901464 NA    NA    NA
10       2010_BUDGET_10_Ruairi_Quinn_LAB  0.92084329 NA    NA    NA
11     2010_BUDGET_11_John_Gormley_Green -1.12261547 NA    NA    NA
12       2010_BUDGET_12_Eamon_Ryan_Green -0.21004677 NA    NA    NA
13     2010_BUDGET_13_Ciaran_Cuffe_Green -0.79133534 NA    NA    NA
14 2010_BUDGET_14_Caoimhghin_OCaolain_SF  0.09854280 NA    NA    NA

Estimated feature scores: showing first 30 beta-hats for features

           when               i       presented             the   supplementary          budget 
     0.13824923     -0.33876765     -0.35600050     -0.21346691     -1.07842554     -0.05457924 
             to            this           house            last           april            said 
    -0.33005556     -0.26757662     -0.16145904     -0.25783983      0.15467183      0.82440376 
             we           could            work             our             way         through 
    -0.44459300      0.57604718     -0.54806584     -0.71558503     -0.29671938     -0.64350927 
         period              of          severe        economic        distress           today 
    -0.52533975     -0.29754960     -1.24430279     -0.44508315     -1.78752195     -0.11651167 
            can          report            that notwithstanding    difficulties            past 
    -0.33001557     -0.65601898     -0.04671622     -1.78752195     -1.18563671     -0.50345524

最後に

　{quanteda}というテキスト処理・解析のためのRパッケージについて、一部の関数を実際に動かしてみました。今回紹介した以外にも様々なテキスト分析の関数が定義されており、手軽に使えるため今後に期待したいです。
　しかしながら、日本語でやろうとするとダメなケースも散見し、日本語処理を自前で実装すれば解決できるかどうか、今後調査を続けます。
　なお、今回の内容は追加修正してどこかでまとめられる予定です。

　明日の更新はgg_hatanoさんが担当してくださいます。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up