Help us understand the problem. What is going on with this article?

O'reilly本はだいたいOOO語の単語でできている

前回の続きです。
awkとかsedを駆使してさらなる分析を深めていきたいと思います。

PDFのテキスト処理を簡略化

pdftotextの処理が重いので、ここからは前回の結果をoreilly_text_all.txtに出力したところからスタートします。

複数pdfのテキスト化を1ファイルに格納
<oreilly.txt | xargs -I{} pdftotext {} - | cat > oreilly_text_all.txt

グラフ化

単語 カウント が降順に並んでいるとして、最大を50個の"#"記号で表してグラフ化するawkスクリプト1にパイプします。

oreilly_text_allに出現する単語top300をグラフ表示する
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' | grep -oE '[a-z]{2,}' | sort |
  uniq -c | sort -k1nr | awk '{printf "%16s %4d\n",$2,$1;}'|
  awk '!max{max=$2;}{f=50/max;if(f>1)f=1;i=$2*f;r="";while(i-->0)r=r"#";printf "%s %s\n",$0,r;}' |  # グラフ表示
  head -300 |  # top300
  cat -n       # 行番号
Oreilly本に出てくる頻出英単語top300
     1               the 138844 ##################################################
     2                to 64866 ########################
     3                in 50969 ###################
     4                of 50764 ###################
     5               and 48537 ##################
     6                is 38182 ##############
     7               you 29710 ###########
     8               for 28060 ###########
     9              that 26829 ##########
    10                it 23518 #########
    11              this 21736 ########
    12                as 18763 #######
    13              with 18283 #######
    14                if 16465 ######
    15                or 15867 ######
    16                on 15841 ######
    17               are 14711 ######
    18               can 14594 ######
    19                be 14418 ######
    20                we 14357 ######
    21              file 13089 #####
    22                by 12027 #####
    23              from 11293 #####
    24               use 11250 #####
    25                an 11174 #####
    26               not 9980 ####
    27              perl 9366 ####
    28               but 8904 ####
    29              data 8895 ####
    30           command 8735 ####
    31               one 8286 ###
    32               all 8149 ###
    33           example 8041 ###
    34              line 8035 ###
    35                at 7848 ###
    36            python 7721 ###
    37              your 7681 ###
    38           chapter 7204 ###
    39              have 7073 ###
    40              more 6918 ###
    41              when 6779 ###
    42              name 6483 ###
    43          function 6394 ###
    44             which 6362 ###
    45              will 6304 ###
    46             files 6248 ###
    47              text 6128 ###
    48              like 6023 ###
    49                so 6009 ###
    50              also 5889 ###
    51           program 5853 ###
    52               see 5853 ###
    53              list 5783 ###
    54             using 5694 ###
    55             print 5610 ###
    56           section 5174 ##
    57             other 5110 ##
    58              self 5091 ##
    59               set 5084 ##
    60             shell 4949 ##
    61             first 4948 ##
    62              code 4913 ##
    63              time 4865 ##
    64               its 4829 ##
    65                do 4819 ##
    66             value 4778 ##
    67              each 4702 ##
    68              some 4650 ##
    69              only 4561 ##
    70            system 4540 ##
    71            module 4425 ##
    72            script 4413 ##
    73               may 4377 ##
    74            string 4346 ##
    75               any 4339 ##
    76             there 4320 ##
    77              used 4318 ##
    78             class 4245 ##
    79              here 4214 ##
    80            output 4163 ##
    81               has 4146 ##
    82              they 4142 ##
    83            number 4125 ##
    84               out 4073 ##
    85              just 3973 ##
    86               new 3954 ##
    87               two 3945 ##
    88           because 3894 ##
    89              same 3815 ##
    90             input 3792 ##
    91               get 3709 ##
    92              into 3695 ##
    93          variable 3692 ##
    94         directory 3664 ##
    95             these 3657 ##
    96                no 3551 ##
    97               run 3514 ##
    98                up 3495 ##
    99              what 3440 ##
   100              then 3419 ##
   101              than 3395 ##
   102              want 3379 ##
   103            return 3353 ##
   104              user 3341 ##
   105               how 3327 ##
   106             array 3273 ##
   107            object 3255 ##
   108              make 3254 ##
   109                my 3208 ##
   110              test 3196 ##
   111              next 3173 ##
   112              them 3155 ##
   113                re 3136 ##
   114               way 3104 ##
   115                ll 3102 ##
   116              such 3086 ##
   117              need 3081 ##
   118              book 3072 ##
   119              type 2992 ##
   120            method 2964 ##
   121            server 2918 ##
   122             about 2894 ##
   123              most 2875 ##
   124              read 2840 ##
   125             lines 2837 ##
   126          standard 2831 ##
   127               end 2827 ##
   128            values 2806 ##
   129           pattern 2757 #
   130              many 2736 #
   131                py 2732 #
   132              unix 2700 #
   133              find 2677 #
   134           process 2664 #
   135         functions 2646 #
   136              open 2571 #
   137               don 2544 #
   138               key 2537 #
   139          learning 2536 #
   140         character 2511 #
   141               was 2431 #
   142             would 2385 #
   143               def 2361 #
   144            import 2354 #
   145              call 2351 #
   146       programming 2347 #
   147         variables 2334 #
   148              both 2330 #
   149              does 2325 #
   150            should 2318 #
   151               now 2317 #
   152             match 2303 #
   153          instance 2298 #
   154          operator 2285 #
   155              even 2276 #
   156              last 2267 #
   157        characters 2265 #
   158             after 2255 #
   159            figure 2234 #
   160              work 2200 #
   161         different 2196 #
   162             their 2172 #
   163              html 2160 #
   164           another 2140 #
   165           regular 2120 #
   166            simple 2116 #
   167             while 2113 #
   168           default 2101 #
   169             names 2083 #
   170            before 2080 #
   171             might 2079 #
   172           instead 2076 #
   173               our 2070 #
   174             index 2061 #
   175          programs 2060 #
   176             could 2058 #
   177               awk 2050 #
   178          commands 2025 #
   179        expression 2022 #
   180             write 2016 #
   181            called 2014 #
   182              loop 2003 #
   183             start 1995 #
   184             model 1992 #
   185             where 1973 #
   186              true 1967 #
   187            option 1940 #
   188              part 1932 #
   189              case 1921 #
   190               bin 1920 #
   191             tools 1907 #
   192            window 1906 #
   193               gui 1856 #
   194            single 1855 #
   195           version 1837 #
   196             error 1835 #
   197          argument 1813 #
   198           running 1811 #
   199           current 1809 #
   200          training 1808 #
   201           message 1761 #
   202           scripts 1761 #
   203              word 1754 #
   204          examples 1737 #
   205              page 1729 #
   206           between 1728 #
   207              hash 1728 #
   208              much 1724 #
   209           without 1716 #
   210              look 1708 #
   211              well 1702 #
   212                pp 1687 #
   213             since 1687 #
   214         arguments 1665 #
   215              path 1656 #
   216              must 1651 #
   217            client 1645 #
   218         following 1645 #
   219             order 1645 #
   220             table 1645 #
   221            change 1638 #
   222               let 1633 #
   223          filename 1629 #
   224              main 1621 #
   225           package 1606 #
   226                os 1598 #
   227              else 1595 #
   228              form 1594 #
   229          language 1590 #
   230             doesn 1584 #
   231            though 1584 #
   232               web 1573 #
   233           returns 1569 #
   234            create 1568 #
   235             local 1568 #
   236              sort 1556 #
   237           strings 1554 #
   238          previous 1551 #
   239            access 1541 #
   240           windows 1537 #
   241           special 1533 #
   242           library 1532 #
   243            result 1519 #
   244             right 1516 #
   245             three 1515 #
   246               add 1511 #
   247           objects 1510 #
   248           through 1509 #
   249              uses 1507 #
   250              over 1505 #
   251           machine 1502 #
   252              mail 1490 #
   253             space 1483 #
   254           numbers 1482 #
   255          multiple 1475 #
   256           methods 1465 #
   257                ve 1463 #
   258             those 1455 #
   259              http 1443 #
   260             built 1441 #
   261            second 1433 #
   262              very 1433 #
   263           systems 1432 #
   264       information 1431 #
   265            search 1426 #
   266              home 1415 #
   267               own 1411 #
   268               try 1401 #
   269            format 1379 #
   270             means 1371 #
   271              mode 1371 #
   272             words 1371 #
   273               sed 1366 #
   274           problem 1355 #
   275             often 1350 #
   276              size 1338 #
   277              exit 1329 #
   278         operators 1329 #
   279            always 1318 #
   280               usr 1313 #
   281               cgi 1312 #
   282           however 1299 #
   283         available 1297 #
   284            common 1296 #
   285               say 1295 #
   286             later 1294 #
   287             point 1293 #
   288            random 1290 #
   289               com 1285 #
   290              once 1285 #
   291               top 1280 #
   292             field 1278 #
   293              left 1274 #
   294       expressions 1273 #
   295              know 1270 #
   296         something 1269 #
   297             still 1267 #
   298               sub 1267 #
   299             email 1265 #
   300             users 1264 #

割合

全文字列のカウントがわかったので、全体に占める割合を調べます。
下のawkスクリプトは全行を配列に格納して、合計を算出し、配列の内容を吐き出しながら合計で割ります。

全単語に占める割合
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
  }
  END{
    for(i=1;i<=NR;i++)
    {printf "%16s %8d %6f\n", word[i],count[i],count[i]/s}
  }' |
  head

             the   138844 0.053903
              to    64866 0.025183
              in    50969 0.019788
              of    50764 0.019708
             and    48537 0.018843
              is    38182 0.014823
             you    29710 0.011534
             for    28060 0.010894
            that    26829 0.010416
              it    23518 0.009130

右からワード、ワードの出現数、全体に占める割合を表しています。
例えば一番上の"the"は全PDFで約14万件でてきて全ワード中の5.3%を占めています。

累積和

同様にawkの配列を駆使して累積和(cumsum)=ratioの積み上げ足し算を出力していきます。

割合の累積和
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |  # 小文字に統一
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |  # ワードカウント
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
    cumsum[NR]=s
  }
  END{
    for(i=1;i<=NR;i++)
    {printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
  }' |  # 回数/割合/累積和
  head -50 |
  sed -e '1s/^/            word    count    ratio   cumsum\n/'  # ヘッダー

            word    count    ratio   cumsum
             the   138844 0.053903 0.053903
              to    64866 0.025183 0.079086
              in    50969 0.019788 0.098873
              of    50764 0.019708 0.118581
             and    48537 0.018843 0.137425
              is    38182 0.014823 0.152248
             you    29710 0.011534 0.163782
             for    28060 0.010894 0.174676
            that    26829 0.010416 0.185092
              it    23518 0.009130 0.194222
            this    21736 0.008439 0.202661
              as    18763 0.007284 0.209945
            with    18283 0.007098 0.217043
              if    16465 0.006392 0.223435
              or    15867 0.006160 0.229595
              on    15841 0.006150 0.235745
             are    14711 0.005711 0.241456
             can    14594 0.005666 0.247122
              be    14418 0.005597 0.252720
              we    14357 0.005574 0.258293
            file    13089 0.005082 0.263375
              by    12027 0.004669 0.268044
            from    11293 0.004384 0.272428
             use    11250 0.004368 0.276796
              an    11174 0.004338 0.281134
             not     9980 0.003875 0.285009
            perl     9366 0.003636 0.288645
             but     8904 0.003457 0.292101
            data     8895 0.003453 0.295555
         command     8735 0.003391 0.298946
             one     8286 0.003217 0.302163
             all     8149 0.003164 0.305326
         example     8041 0.003122 0.308448
            line     8035 0.003119 0.311568
              at     7848 0.003047 0.314614
          python     7721 0.002998 0.317612
            your     7681 0.002982 0.320594
         chapter     7204 0.002797 0.323391
            have     7073 0.002746 0.326137
            more     6918 0.002686 0.328822
            when     6779 0.002632 0.331454
            name     6483 0.002517 0.333971
        function     6394 0.002482 0.336453
           which     6362 0.002470 0.338923
            will     6304 0.002447 0.341371
           files     6248 0.002426 0.343796
            text     6128 0.002379 0.346175
            like     6023 0.002338 0.348514
              so     6009 0.002333 0.350847
            also     5889 0.002286 0.353133

もちろんcumsumの最後は1になります。

累積和の最後は1
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |  # 小文字に統一
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |  # ワードカウント
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
    cumsum[NR]=s
  }
  END{
    for(i=1;i<=NR;i++)
    {printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
  }' |  # 回数/割合/累積和
  tail

          zwemer        1 0.000000 0.999997
          zwicky        1 0.000000 0.999997
         zwspace        1 0.000000 0.999997
              zx        1 0.000000 0.999998
             zxf        1 0.000000 0.999998
          zygote        1 0.000000 0.999998
           zyzzy        1 0.000000 0.999999
       zzzzteana        1 0.000000 0.999999
         zzzzzzz        1 0.000000 1.000000
        zzzzzzzz        1 0.000000 1.000000

考え方があっているかはわかりませんが、 単語のほとんど を表すために正規分布を使います。
awkのifを使って正規分布σ1(=68.27%)の範囲に入る文字までを表示します。2

σ1までの累積和を表示
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |  # 小文字に統一
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |  # ワードカウント
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
    cumsum[NR]=s
  }
  END{
    for(i=1;i<=NR;i++)
    {if (cumsum[i]/s<0.6827) printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
  }' |  # 回数/割合/累積和 4列目をσ1まで表示
  sed -e '1s/^/            word    count    ratio   cumsum\n/' |  # ヘッダー
  cat -n  # 行番号

     1              word    count    ratio   cumsum
     2               the   138844 0.053903 0.053903
     3                to    64866 0.025183 0.079086
     4                in    50969 0.019788 0.098873
     5                of    50764 0.019708 0.118581
     6               and    48537 0.018843 0.137425
     7                is    38182 0.014823 0.152248
     8               you    29710 0.011534 0.163782
     9               for    28060 0.010894 0.174676
    10              that    26829 0.010416 0.185092
    11                it    23518 0.009130 0.194222
    12              this    21736 0.008439 0.202661
    13                as    18763 0.007284 0.209945
    14              with    18283 0.007098 0.217043
    15                if    16465 0.006392 0.223435
    16                or    15867 0.006160 0.229595
    17                on    15841 0.006150 0.235745
    18               are    14711 0.005711 0.241456
    19               can    14594 0.005666 0.247122
    20                be    14418 0.005597 0.252720
    21                we    14357 0.005574 0.258293
    22              file    13089 0.005082 0.263375
    23                by    12027 0.004669 0.268044
    24              from    11293 0.004384 0.272428
    25               use    11250 0.004368 0.276796
    26                an    11174 0.004338 0.281134
    27               not     9980 0.003875 0.285009
    28              perl     9366 0.003636 0.288645
    29               but     8904 0.003457 0.292101
    30              data     8895 0.003453 0.295555
    31           command     8735 0.003391 0.298946
    32               one     8286 0.003217 0.302163
    33               all     8149 0.003164 0.305326
    34           example     8041 0.003122 0.308448
    35              line     8035 0.003119 0.311568
    36                at     7848 0.003047 0.314614
    37            python     7721 0.002998 0.317612
    38              your     7681 0.002982 0.320594
    39           chapter     7204 0.002797 0.323391
    40              have     7073 0.002746 0.326137
    41              more     6918 0.002686 0.328822
    42              when     6779 0.002632 0.331454
    43              name     6483 0.002517 0.333971
    44          function     6394 0.002482 0.336453
    45             which     6362 0.002470 0.338923
    46              will     6304 0.002447 0.341371
    47             files     6248 0.002426 0.343796
    48              text     6128 0.002379 0.346175
    49              like     6023 0.002338 0.348514
    50                so     6009 0.002333 0.350847
    51              also     5889 0.002286 0.353133
    52           program     5853 0.002272 0.355405
    53               see     5853 0.002272 0.357677
    54              list     5783 0.002245 0.359923
    55             using     5694 0.002211 0.362133
    56             print     5610 0.002178 0.364311
    57           section     5174 0.002009 0.366320
    58             other     5110 0.001984 0.368304
    59              self     5091 0.001976 0.370280
    60               set     5084 0.001974 0.372254
    61             shell     4949 0.001921 0.374175
    62             first     4948 0.001921 0.376096
    63              code     4913 0.001907 0.378003
    64              time     4865 0.001889 0.379892
    65               its     4829 0.001875 0.381767
    66                do     4819 0.001871 0.383638
    67             value     4778 0.001855 0.385493
    68              each     4702 0.001825 0.387318
    69              some     4650 0.001805 0.389123
    70              only     4561 0.001771 0.390894
    71            system     4540 0.001763 0.392657
    72            module     4425 0.001718 0.394375
    73            script     4413 0.001713 0.396088
    74               may     4377 0.001699 0.397787
    75            string     4346 0.001687 0.399474
    76               any     4339 0.001685 0.401159
    77             there     4320 0.001677 0.402836
    78              used     4318 0.001676 0.404512
    79             class     4245 0.001648 0.406160
    80              here     4214 0.001636 0.407796
    81            output     4163 0.001616 0.409413
    82               has     4146 0.001610 0.411022
    83              they     4142 0.001608 0.412630
    84            number     4125 0.001601 0.414232
    85               out     4073 0.001581 0.415813
    86              just     3973 0.001542 0.417355
    87               new     3954 0.001535 0.418890
    88               two     3945 0.001532 0.420422
    89           because     3894 0.001512 0.421934
    90              same     3815 0.001481 0.423415
    91             input     3792 0.001472 0.424887
    92               get     3709 0.001440 0.426327
    93              into     3695 0.001435 0.427761
    94          variable     3692 0.001433 0.429195
    95         directory     3664 0.001422 0.430617
    96             these     3657 0.001420 0.432037
    97                no     3551 0.001379 0.433416
    98               run     3514 0.001364 0.434780
    99                up     3495 0.001357 0.436137
   100              what     3440 0.001336 0.437472
   101              then     3419 0.001327 0.438800
   102              than     3395 0.001318 0.440118
   103              want     3379 0.001312 0.441429
   104            return     3353 0.001302 0.442731
   105              user     3341 0.001297 0.444028
   106               how     3327 0.001292 0.445320
   107             array     3273 0.001271 0.446591
   108            object     3255 0.001264 0.447854
   109              make     3254 0.001263 0.449117
   110                my     3208 0.001245 0.450363
   111              test     3196 0.001241 0.451604
   112              next     3173 0.001232 0.452836
   113              them     3155 0.001225 0.454060
   114                re     3136 0.001217 0.455278
   115               way     3104 0.001205 0.456483
   116                ll     3102 0.001204 0.457687
   117              such     3086 0.001198 0.458885
   118              need     3081 0.001196 0.460081
   119              book     3072 0.001193 0.461274
   120              type     2992 0.001162 0.462436
   121            method     2964 0.001151 0.463586
   122            server     2918 0.001133 0.464719
   123             about     2894 0.001124 0.465843
   124              most     2875 0.001116 0.466959
   125              read     2840 0.001103 0.468061
   126             lines     2837 0.001101 0.469163
   127          standard     2831 0.001099 0.470262
   128               end     2827 0.001098 0.471359
   129            values     2806 0.001089 0.472449
   130           pattern     2757 0.001070 0.473519
   131              many     2736 0.001062 0.474581
   132                py     2732 0.001061 0.475642
   133              unix     2700 0.001048 0.476690
   134              find     2677 0.001039 0.477730
   135           process     2664 0.001034 0.478764
   136         functions     2646 0.001027 0.479791
   137              open     2571 0.000998 0.480789
   138               don     2544 0.000988 0.481777
   139               key     2537 0.000985 0.482762
   140          learning     2536 0.000985 0.483746
   141         character     2511 0.000975 0.484721
   142               was     2431 0.000944 0.485665
   143             would     2385 0.000926 0.486591
   144               def     2361 0.000917 0.487507
   145            import     2354 0.000914 0.488421
   146              call     2351 0.000913 0.489334
   147       programming     2347 0.000911 0.490245
   148         variables     2334 0.000906 0.491151
   149              both     2330 0.000905 0.492056
   150              does     2325 0.000903 0.492959
   151            should     2318 0.000900 0.493858
   152               now     2317 0.000900 0.494758
   153             match     2303 0.000894 0.495652
   154          instance     2298 0.000892 0.496544
   155          operator     2285 0.000887 0.497431
   156              even     2276 0.000884 0.498315
   157              last     2267 0.000880 0.499195
   158        characters     2265 0.000879 0.500074
   159             after     2255 0.000875 0.500950
   160            figure     2234 0.000867 0.501817
   161              work     2200 0.000854 0.502671
   162         different     2196 0.000853 0.503524
   163             their     2172 0.000843 0.504367
   164              html     2160 0.000839 0.505206
   165           another     2140 0.000831 0.506036
   166           regular     2120 0.000823 0.506859
   167            simple     2116 0.000821 0.507681
   168             while     2113 0.000820 0.508501
   169           default     2101 0.000816 0.509317
   170             names     2083 0.000809 0.510126
   171            before     2080 0.000808 0.510933
   172             might     2079 0.000807 0.511740
   173           instead     2076 0.000806 0.512546
   174               our     2070 0.000804 0.513350
   175             index     2061 0.000800 0.514150
   176          programs     2060 0.000800 0.514950
   177             could     2058 0.000799 0.515749
   178               awk     2050 0.000796 0.516545
   179          commands     2025 0.000786 0.517331
   180        expression     2022 0.000785 0.518116
   181             write     2016 0.000783 0.518898
   182            called     2014 0.000782 0.519680
   183              loop     2003 0.000778 0.520458
   184             start     1995 0.000775 0.521232
   185             model     1992 0.000773 0.522006
   186             where     1973 0.000766 0.522772
   187              true     1967 0.000764 0.523535
   188            option     1940 0.000753 0.524289
   189              part     1932 0.000750 0.525039
   190              case     1921 0.000746 0.525784
   191               bin     1920 0.000745 0.526530
   192             tools     1907 0.000740 0.527270
   193            window     1906 0.000740 0.528010
   194               gui     1856 0.000721 0.528731
   195            single     1855 0.000720 0.529451
   196           version     1837 0.000713 0.530164
   197             error     1835 0.000712 0.530876
   198          argument     1813 0.000704 0.531580
   199           running     1811 0.000703 0.532283
   200           current     1809 0.000702 0.532986
   201          training     1808 0.000702 0.533688
   202           message     1761 0.000684 0.534371
   203           scripts     1761 0.000684 0.535055
   204              word     1754 0.000681 0.535736
   205          examples     1737 0.000674 0.536410
   206              page     1729 0.000671 0.537081
   207           between     1728 0.000671 0.537752
   208              hash     1728 0.000671 0.538423
   209              much     1724 0.000669 0.539092
   210           without     1716 0.000666 0.539759
   211              look     1708 0.000663 0.540422
   212              well     1702 0.000661 0.541082
   213                pp     1687 0.000655 0.541737
   214             since     1687 0.000655 0.542392
   215         arguments     1665 0.000646 0.543039
   216              path     1656 0.000643 0.543682
   217              must     1651 0.000641 0.544323
   218            client     1645 0.000639 0.544961
   219         following     1645 0.000639 0.545600
   220             order     1645 0.000639 0.546239
   221             table     1645 0.000639 0.546877
   222            change     1638 0.000636 0.547513
   223               let     1633 0.000634 0.548147
   224          filename     1629 0.000632 0.548779
   225              main     1621 0.000629 0.549409
   226           package     1606 0.000623 0.550032
   227                os     1598 0.000620 0.550653
   228              else     1595 0.000619 0.551272
   229              form     1594 0.000619 0.551891
   230          language     1590 0.000617 0.552508
   231             doesn     1584 0.000615 0.553123
   232            though     1584 0.000615 0.553738
   233               web     1573 0.000611 0.554349
   234           returns     1569 0.000609 0.554958
   235            create     1568 0.000609 0.555566
   236             local     1568 0.000609 0.556175
   237              sort     1556 0.000604 0.556779
   238           strings     1554 0.000603 0.557383
   239          previous     1551 0.000602 0.557985
   240            access     1541 0.000598 0.558583
   241           windows     1537 0.000597 0.559180
   242           special     1533 0.000595 0.559775
   243           library     1532 0.000595 0.560370
   244            result     1519 0.000590 0.560959
   245             right     1516 0.000589 0.561548
   246             three     1515 0.000588 0.562136
   247               add     1511 0.000587 0.562723
   248           objects     1510 0.000586 0.563309
   249           through     1509 0.000586 0.563895
   250              uses     1507 0.000585 0.564480
   251              over     1505 0.000584 0.565064
   252           machine     1502 0.000583 0.565647
   253              mail     1490 0.000578 0.566226
   254             space     1483 0.000576 0.566801
   255           numbers     1482 0.000575 0.567377
   256          multiple     1475 0.000573 0.567949
   257           methods     1465 0.000569 0.568518
   258                ve     1463 0.000568 0.569086
   259             those     1455 0.000565 0.569651
   260              http     1443 0.000560 0.570211
   261             built     1441 0.000559 0.570771
   262            second     1433 0.000556 0.571327
   263              very     1433 0.000556 0.571883
   264           systems     1432 0.000556 0.572439
   265       information     1431 0.000556 0.572995
   266            search     1426 0.000554 0.573548
   267              home     1415 0.000549 0.574098
   268               own     1411 0.000548 0.574646
   269               try     1401 0.000544 0.575189
   270            format     1379 0.000535 0.575725
   271             means     1371 0.000532 0.576257
   272              mode     1371 0.000532 0.576789
   273             words     1371 0.000532 0.577322
   274               sed     1366 0.000530 0.577852
   275           problem     1355 0.000526 0.578378
   276             often     1350 0.000524 0.578902
   277              size     1338 0.000519 0.579422
   278              exit     1329 0.000516 0.579937
   279         operators     1329 0.000516 0.580453
   280            always     1318 0.000512 0.580965
   281               usr     1313 0.000510 0.581475
   282               cgi     1312 0.000509 0.581984
   283           however     1299 0.000504 0.582489
   284         available     1297 0.000504 0.582992
   285            common     1296 0.000503 0.583495
   286               say     1295 0.000503 0.583998
   287             later     1294 0.000502 0.584500
   288             point     1293 0.000502 0.585002
   289            random     1290 0.000501 0.585503
   290               com     1285 0.000499 0.586002
   291              once     1285 0.000499 0.586501
   292               top     1280 0.000497 0.586998
   293             field     1278 0.000496 0.587494
   294              left     1274 0.000495 0.587989
   295       expressions     1273 0.000494 0.588483
   296              know     1270 0.000493 0.588976
   297         something     1269 0.000493 0.589468
   298             still     1267 0.000492 0.589960
   299               sub     1267 0.000492 0.590452
   300             email     1265 0.000491 0.590943
   301             users     1264 0.000491 0.591434
   302          features     1260 0.000489 0.591923
   303            either     1256 0.000488 0.592411
   304             every     1250 0.000485 0.592896
   305           options     1248 0.000485 0.593381
   306           context     1244 0.000483 0.593864
   307             group     1244 0.000483 0.594347
   308              been     1241 0.000482 0.594828
   309               too     1241 0.000482 0.595310
   310              were     1236 0.000480 0.595790
   311         reference     1232 0.000478 0.596268
   312           unicode     1231 0.000478 0.596746
   313           classes     1225 0.000476 0.597222
   314             bytes     1216 0.000472 0.597694
   315              root     1211 0.000470 0.598164
   316             named     1201 0.000466 0.598630
   317             train     1200 0.000466 0.599096
   318             split     1194 0.000464 0.599560
   319           writing     1183 0.000459 0.600019
   320              echo     1178 0.000457 0.600476
   321            useful     1177 0.000457 0.600933
   322              back     1173 0.000455 0.601389
   323             close     1172 0.000455 0.601844
   324           control     1171 0.000455 0.602298
   325             count     1167 0.000453 0.602751
   326           modules     1166 0.000453 0.603204
   327             title     1165 0.000452 0.603656
   328          matching     1160 0.000450 0.604107
   329            binary     1159 0.000450 0.604557
   330              long     1159 0.000450 0.605007
   331             works     1158 0.000450 0.605456
   332             state     1152 0.000447 0.605903
   333             makes     1146 0.000445 0.606348
   334            source     1143 0.000444 0.606792
   335              good     1142 0.000443 0.607235
   336             level     1142 0.000443 0.607679
   337               die     1139 0.000442 0.608121
   338              side     1133 0.000440 0.608561
   339             calls     1127 0.000438 0.608998
   340         processes     1125 0.000437 0.609435
   341           support     1121 0.000435 0.609870
   342           feature     1120 0.000435 0.610305
   343            simply     1119 0.000434 0.610739
   344              note     1118 0.000434 0.611174
   345              keys     1113 0.000432 0.611606
   346               sys     1104 0.000429 0.612034
   347           reading     1103 0.000428 0.612462
   348             given     1095 0.000425 0.612888
   349           defined     1084 0.000421 0.613308
   350              done     1080 0.000419 0.613728
   351       environment     1073 0.000417 0.614144
   352           network     1067 0.000414 0.614558
   353            scalar     1064 0.000413 0.614972
   354             range     1062 0.000412 0.615384
   355           matches     1061 0.000412 0.615796
   356            socket     1056 0.000410 0.616206
   357         statement     1056 0.000410 0.616616
   358              none     1055 0.000410 0.617025
   359             block     1053 0.000409 0.617434
   360               put     1051 0.000408 0.617842
   361              take     1049 0.000407 0.618249
   362             false     1044 0.000405 0.618655
   363           tkinter     1042 0.000405 0.619059
   364             shows     1027 0.000399 0.619458
   365             based     1023 0.000397 0.619855
   366            better     1020 0.000396 0.620251
   367              tree     1019 0.000396 0.620647
   368             check     1013 0.000393 0.621040
   369            reilly     1012 0.000393 0.621433
   370               few     1006 0.000391 0.621823
   371           usually     1006 0.000391 0.622214
   372                go     1005 0.000390 0.622604
   373           whether     1000 0.000388 0.622992
   374          possible      998 0.000387 0.623380
   375           display      997 0.000387 0.623767
   376          original      997 0.000387 0.624154
   377            really      991 0.000385 0.624539
   378           already      989 0.000384 0.624923
   379            vector      989 0.000384 0.625307
   380                us      988 0.000384 0.625690
   381               bit      983 0.000382 0.626072
   382           edition      982 0.000381 0.626453
   383            itself      980 0.000380 0.626833
   384        processing      973 0.000378 0.627211
   385            fields      967 0.000375 0.627587
   386            syntax      965 0.000375 0.627961
   387             again      964 0.000374 0.628336
   388            thread      961 0.000373 0.628709
   389            within      956 0.000371 0.629080
   390            things      947 0.000368 0.629447
   391             found      945 0.000367 0.629814
   392            unless      945 0.000367 0.630181
   393         interface      943 0.000366 0.630547
   394            people      943 0.000366 0.630913
   395                id      940 0.000365 0.631278
   396               per      939 0.000365 0.631643
   397              copy      937 0.000364 0.632007
   398            record      937 0.000364 0.632370
   399              full      936 0.000363 0.632734
   400            parent      929 0.000361 0.633094
   401            delete      926 0.000359 0.633454
   402             event      926 0.000359 0.633813
   403               non      925 0.000359 0.634173
   404          sequence      925 0.000359 0.634532
   405              show      925 0.000359 0.634891
   406            arrays      923 0.000358 0.635249
   407             until      917 0.000356 0.635605
   408            memory      915 0.000355 0.635960
   409        particular      913 0.000354 0.636315
   410          provides      913 0.000354 0.636669
   411           changes      912 0.000354 0.637023
   412             types      907 0.000352 0.637375
   413            button      906 0.000352 0.637727
   414               job      903 0.000351 0.638078
   415          although      902 0.000350 0.638428
   416     automatically      902 0.000350 0.638778
   417           address      901 0.000350 0.639128
   418              pass      891 0.000346 0.639474
   419              send      890 0.000346 0.639819
   420              step      890 0.000346 0.640165
   421          contents      887 0.000344 0.640509
   422            action      883 0.000343 0.640852
   423        subroutine      883 0.000343 0.641195
   424              mean      881 0.000342 0.641537
   425             world      879 0.000341 0.641878
   426          internet      878 0.000341 0.642219
   427          encoding      876 0.000340 0.642559
   428                ls      875 0.000340 0.642899
   429       directories      873 0.000339 0.643238
   430          database      872 0.000339 0.643576
   431             large      871 0.000338 0.643914
   432              fact      870 0.000338 0.644252
   433              save      869 0.000337 0.644589
   434          probably      867 0.000337 0.644926
   435        operations      863 0.000335 0.645261
   436              date      861 0.000334 0.645595
   437               etc      861 0.000334 0.645930
   438              help      860 0.000334 0.646263
   439             empty      859 0.000333 0.646597
   440        parameters      858 0.000333 0.646930
   441              pack      855 0.000332 0.647262
   442            select      854 0.000332 0.647594
   443             shown      853 0.000331 0.647925
   444               pop      847 0.000329 0.648254
   445           general      846 0.000328 0.648582
   446              less      843 0.000327 0.648909
   447              bash      840 0.000326 0.649235
   448            length      840 0.000326 0.649561
   449             build      838 0.000325 0.649887
   450              argv      836 0.000325 0.650211
   451             being      828 0.000321 0.650533
   452              pipe      828 0.000321 0.650854
   453         attribute      823 0.000320 0.651174
   454              easy      823 0.000320 0.651493
   455            remote      821 0.000319 0.651812
   456          actually      820 0.000318 0.652130
   457            handle      820 0.000318 0.652449
   458             label      818 0.000318 0.652766
   459          contains      816 0.000317 0.653083
   460          cookbook      816 0.000317 0.653400
   461           earlier      816 0.000317 0.653717
   462          elements      816 0.000317 0.654033
   463             lists      816 0.000317 0.654350
   464           provide      816 0.000317 0.654667
   465           similar      815 0.000316 0.654983
   466                tf      815 0.000316 0.655300
   467            global      811 0.000315 0.655615
   468            widget      811 0.000315 0.655930
   469              give      810 0.000314 0.656244
   470         important      805 0.000313 0.656557
   471        references      804 0.000312 0.656869
   472               isn      803 0.000312 0.657180
   473               inc      802 0.000311 0.657492
   474               str      801 0.000311 0.657803
   475           element      800 0.000311 0.658113
   476              best      798 0.000310 0.658423
   477             times      798 0.000310 0.658733
   478              dict      793 0.000308 0.659041
   479               log      792 0.000307 0.659348
   480             learn      791 0.000307 0.659655
   481               len      790 0.000307 0.659962
   482             entry      789 0.000306 0.660268
   483         languages      786 0.000305 0.660574
   484           details      783 0.000304 0.660878
   485           threads      781 0.000303 0.661181
   486          terminal      780 0.000303 0.661484
   487             items      778 0.000302 0.661786
   488            except      771 0.000299 0.662085
   489          solution      771 0.000299 0.662384
   490              grep      768 0.000298 0.662682
   491             image      767 0.000298 0.662980
   492             parts      767 0.000298 0.663278
   493           results      767 0.000298 0.663576
   494           several      767 0.000298 0.663873
   495            allows      766 0.000297 0.664171
   496              runs      764 0.000297 0.664467
   497          patterns      763 0.000296 0.664764
   498               who      763 0.000296 0.665060
   499            making      761 0.000295 0.665355
   500             place      761 0.000295 0.665651
   501             layer      760 0.000295 0.665946
   502            prompt      758 0.000294 0.666240
   503               ftp      754 0.000293 0.666533
   504     documentation      751 0.000292 0.666824
   505               tar      750 0.000291 0.667116
   506              real      749 0.000291 0.667406
   507               row      748 0.000290 0.667697
   508            signal      738 0.000287 0.667983
   509               won      738 0.000287 0.668270
   510           written      738 0.000287 0.668556
   511               dir      737 0.000286 0.668842
   512               win      737 0.000286 0.669129
   513             hello      735 0.000285 0.669414
   514         sometimes      734 0.000285 0.669699
   515         generally      732 0.000284 0.669983
   516         including      732 0.000284 0.670267
   517            status      732 0.000284 0.670551
   518            column      727 0.000282 0.670834
   519            editor      727 0.000282 0.671116
   520           columns      725 0.000281 0.671397
   521            series      725 0.000281 0.671679
   522               off      724 0.000281 0.671960
   523            around      722 0.000280 0.672240
   524        filehandle      721 0.000280 0.672520
   525                vi      719 0.000279 0.672799
   526              spam      713 0.000277 0.673076
   527              ways      713 0.000277 0.673353
   528               why      711 0.000276 0.673629
   529              fred      710 0.000276 0.673905
   530          creating      709 0.000275 0.674180
   531              init      709 0.000275 0.674455
   532              zero      709 0.000275 0.674730
   533            passed      706 0.000274 0.675004
   534            shells      703 0.000273 0.675277
   535               cat      695 0.000270 0.675547
   536         filenames      695 0.000270 0.675817
   537           setting      695 0.000270 0.676087
   538              eval      694 0.000269 0.676356
   539             posix      694 0.000269 0.676626
   540             child      693 0.000269 0.676895
   541          reserved      691 0.000268 0.677163
   542             under      691 0.000268 0.677431
   543              sure      689 0.000267 0.677699
   544               url      689 0.000267 0.677966
   545                np      686 0.000266 0.678232
   546           specify      686 0.000266 0.678499
   547          directly      685 0.000266 0.678765
   548             ascii      684 0.000266 0.679030
   549               old      684 0.000266 0.679296
   550          advanced      683 0.000265 0.679561
   551        attributes      682 0.000265 0.679826
   552              item      681 0.000264 0.680090
   553          messages      679 0.000264 0.680354
   554              term      679 0.000264 0.680617
   555               map      678 0.000263 0.680881
   556         exception      677 0.000263 0.681143
   557              made      677 0.000263 0.681406
   558            hidden      676 0.000262 0.681669
   559             basic      675 0.000262 0.681931
   560           created      675 0.000262 0.682193
   561              sets      675 0.000262 0.682455

ということで「O'reilly本はだいたい561単語でできている」という結果になりました。

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away