1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

O'reilly本はだいたいOOO語の単語でできている

Last updated at Posted at 2019-02-08

前回の続きです。
awkとかsedを駆使してさらなる分析を深めていきたいと思います。

PDFのテキスト処理を簡略化

pdftotextの処理が重いので、ここからは前回の結果をoreilly_text_all.txtに出力したところからスタートします。

複数pdfのテキスト化を1ファイルに格納
<oreilly.txt | xargs -I{} pdftotext {} - | cat > oreilly_text_all.txt

グラフ化

単語 カウント が降順に並んでいるとして、最大を50個の"#"記号で表してグラフ化するawkスクリプト1にパイプします。

oreilly_text_allに出現する単語top300をグラフ表示する
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' | grep -oE '[a-z]{2,}' | sort |
  uniq -c | sort -k1nr | awk '{printf "%16s %4d\n",$2,$1;}'|
  awk '!max{max=$2;}{f=50/max;if(f>1)f=1;i=$2*f;r="";while(i-->0)r=r"#";printf "%s %s\n",$0,r;}' |  # グラフ表示
  head -300 |  # top300
  cat -n       # 行番号
Oreilly本に出てくる頻出英単語top300
     1	             the 138844 ##################################################
     2	              to 64866 ########################
     3	              in 50969 ###################
     4	              of 50764 ###################
     5	             and 48537 ##################
     6	              is 38182 ##############
     7	             you 29710 ###########
     8	             for 28060 ###########
     9	            that 26829 ##########
    10	              it 23518 #########
    11	            this 21736 ########
    12	              as 18763 #######
    13	            with 18283 #######
    14	              if 16465 ######
    15	              or 15867 ######
    16	              on 15841 ######
    17	             are 14711 ######
    18	             can 14594 ######
    19	              be 14418 ######
    20	              we 14357 ######
    21	            file 13089 #####
    22	              by 12027 #####
    23	            from 11293 #####
    24	             use 11250 #####
    25	              an 11174 #####
    26	             not 9980 ####
    27	            perl 9366 ####
    28	             but 8904 ####
    29	            data 8895 ####
    30	         command 8735 ####
    31	             one 8286 ###
    32	             all 8149 ###
    33	         example 8041 ###
    34	            line 8035 ###
    35	              at 7848 ###
    36	          python 7721 ###
    37	            your 7681 ###
    38	         chapter 7204 ###
    39	            have 7073 ###
    40	            more 6918 ###
    41	            when 6779 ###
    42	            name 6483 ###
    43	        function 6394 ###
    44	           which 6362 ###
    45	            will 6304 ###
    46	           files 6248 ###
    47	            text 6128 ###
    48	            like 6023 ###
    49	              so 6009 ###
    50	            also 5889 ###
    51	         program 5853 ###
    52	             see 5853 ###
    53	            list 5783 ###
    54	           using 5694 ###
    55	           print 5610 ###
    56	         section 5174 ##
    57	           other 5110 ##
    58	            self 5091 ##
    59	             set 5084 ##
    60	           shell 4949 ##
    61	           first 4948 ##
    62	            code 4913 ##
    63	            time 4865 ##
    64	             its 4829 ##
    65	              do 4819 ##
    66	           value 4778 ##
    67	            each 4702 ##
    68	            some 4650 ##
    69	            only 4561 ##
    70	          system 4540 ##
    71	          module 4425 ##
    72	          script 4413 ##
    73	             may 4377 ##
    74	          string 4346 ##
    75	             any 4339 ##
    76	           there 4320 ##
    77	            used 4318 ##
    78	           class 4245 ##
    79	            here 4214 ##
    80	          output 4163 ##
    81	             has 4146 ##
    82	            they 4142 ##
    83	          number 4125 ##
    84	             out 4073 ##
    85	            just 3973 ##
    86	             new 3954 ##
    87	             two 3945 ##
    88	         because 3894 ##
    89	            same 3815 ##
    90	           input 3792 ##
    91	             get 3709 ##
    92	            into 3695 ##
    93	        variable 3692 ##
    94	       directory 3664 ##
    95	           these 3657 ##
    96	              no 3551 ##
    97	             run 3514 ##
    98	              up 3495 ##
    99	            what 3440 ##
   100	            then 3419 ##
   101	            than 3395 ##
   102	            want 3379 ##
   103	          return 3353 ##
   104	            user 3341 ##
   105	             how 3327 ##
   106	           array 3273 ##
   107	          object 3255 ##
   108	            make 3254 ##
   109	              my 3208 ##
   110	            test 3196 ##
   111	            next 3173 ##
   112	            them 3155 ##
   113	              re 3136 ##
   114	             way 3104 ##
   115	              ll 3102 ##
   116	            such 3086 ##
   117	            need 3081 ##
   118	            book 3072 ##
   119	            type 2992 ##
   120	          method 2964 ##
   121	          server 2918 ##
   122	           about 2894 ##
   123	            most 2875 ##
   124	            read 2840 ##
   125	           lines 2837 ##
   126	        standard 2831 ##
   127	             end 2827 ##
   128	          values 2806 ##
   129	         pattern 2757 #
   130	            many 2736 #
   131	              py 2732 #
   132	            unix 2700 #
   133	            find 2677 #
   134	         process 2664 #
   135	       functions 2646 #
   136	            open 2571 #
   137	             don 2544 #
   138	             key 2537 #
   139	        learning 2536 #
   140	       character 2511 #
   141	             was 2431 #
   142	           would 2385 #
   143	             def 2361 #
   144	          import 2354 #
   145	            call 2351 #
   146	     programming 2347 #
   147	       variables 2334 #
   148	            both 2330 #
   149	            does 2325 #
   150	          should 2318 #
   151	             now 2317 #
   152	           match 2303 #
   153	        instance 2298 #
   154	        operator 2285 #
   155	            even 2276 #
   156	            last 2267 #
   157	      characters 2265 #
   158	           after 2255 #
   159	          figure 2234 #
   160	            work 2200 #
   161	       different 2196 #
   162	           their 2172 #
   163	            html 2160 #
   164	         another 2140 #
   165	         regular 2120 #
   166	          simple 2116 #
   167	           while 2113 #
   168	         default 2101 #
   169	           names 2083 #
   170	          before 2080 #
   171	           might 2079 #
   172	         instead 2076 #
   173	             our 2070 #
   174	           index 2061 #
   175	        programs 2060 #
   176	           could 2058 #
   177	             awk 2050 #
   178	        commands 2025 #
   179	      expression 2022 #
   180	           write 2016 #
   181	          called 2014 #
   182	            loop 2003 #
   183	           start 1995 #
   184	           model 1992 #
   185	           where 1973 #
   186	            true 1967 #
   187	          option 1940 #
   188	            part 1932 #
   189	            case 1921 #
   190	             bin 1920 #
   191	           tools 1907 #
   192	          window 1906 #
   193	             gui 1856 #
   194	          single 1855 #
   195	         version 1837 #
   196	           error 1835 #
   197	        argument 1813 #
   198	         running 1811 #
   199	         current 1809 #
   200	        training 1808 #
   201	         message 1761 #
   202	         scripts 1761 #
   203	            word 1754 #
   204	        examples 1737 #
   205	            page 1729 #
   206	         between 1728 #
   207	            hash 1728 #
   208	            much 1724 #
   209	         without 1716 #
   210	            look 1708 #
   211	            well 1702 #
   212	              pp 1687 #
   213	           since 1687 #
   214	       arguments 1665 #
   215	            path 1656 #
   216	            must 1651 #
   217	          client 1645 #
   218	       following 1645 #
   219	           order 1645 #
   220	           table 1645 #
   221	          change 1638 #
   222	             let 1633 #
   223	        filename 1629 #
   224	            main 1621 #
   225	         package 1606 #
   226	              os 1598 #
   227	            else 1595 #
   228	            form 1594 #
   229	        language 1590 #
   230	           doesn 1584 #
   231	          though 1584 #
   232	             web 1573 #
   233	         returns 1569 #
   234	          create 1568 #
   235	           local 1568 #
   236	            sort 1556 #
   237	         strings 1554 #
   238	        previous 1551 #
   239	          access 1541 #
   240	         windows 1537 #
   241	         special 1533 #
   242	         library 1532 #
   243	          result 1519 #
   244	           right 1516 #
   245	           three 1515 #
   246	             add 1511 #
   247	         objects 1510 #
   248	         through 1509 #
   249	            uses 1507 #
   250	            over 1505 #
   251	         machine 1502 #
   252	            mail 1490 #
   253	           space 1483 #
   254	         numbers 1482 #
   255	        multiple 1475 #
   256	         methods 1465 #
   257	              ve 1463 #
   258	           those 1455 #
   259	            http 1443 #
   260	           built 1441 #
   261	          second 1433 #
   262	            very 1433 #
   263	         systems 1432 #
   264	     information 1431 #
   265	          search 1426 #
   266	            home 1415 #
   267	             own 1411 #
   268	             try 1401 #
   269	          format 1379 #
   270	           means 1371 #
   271	            mode 1371 #
   272	           words 1371 #
   273	             sed 1366 #
   274	         problem 1355 #
   275	           often 1350 #
   276	            size 1338 #
   277	            exit 1329 #
   278	       operators 1329 #
   279	          always 1318 #
   280	             usr 1313 #
   281	             cgi 1312 #
   282	         however 1299 #
   283	       available 1297 #
   284	          common 1296 #
   285	             say 1295 #
   286	           later 1294 #
   287	           point 1293 #
   288	          random 1290 #
   289	             com 1285 #
   290	            once 1285 #
   291	             top 1280 #
   292	           field 1278 #
   293	            left 1274 #
   294	     expressions 1273 #
   295	            know 1270 #
   296	       something 1269 #
   297	           still 1267 #
   298	             sub 1267 #
   299	           email 1265 #
   300	           users 1264 #

割合

全文字列のカウントがわかったので、全体に占める割合を調べます。
下のawkスクリプトは全行を配列に格納して、合計を算出し、配列の内容を吐き出しながら合計で割ります。

全単語に占める割合
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
  }
  END{
    for(i=1;i<=NR;i++)
    {printf "%16s %8d %6f\n", word[i],count[i],count[i]/s}
  }' |
  head

             the   138844 0.053903
              to    64866 0.025183
              in    50969 0.019788
              of    50764 0.019708
             and    48537 0.018843
              is    38182 0.014823
             you    29710 0.011534
             for    28060 0.010894
            that    26829 0.010416
              it    23518 0.009130

右からワード、ワードの出現数、全体に占める割合を表しています。
例えば一番上の"the"は全PDFで約14万件でてきて全ワード中の5.3%を占めています。

累積和

同様にawkの配列を駆使して累積和(cumsum)=ratioの積み上げ足し算を出力していきます。

割合の累積和
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |  # 小文字に統一
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |  # ワードカウント
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
    cumsum[NR]=s
  }
  END{
    for(i=1;i<=NR;i++)
    {printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
  }' |  # 回数/割合/累積和
  head -50 |
  sed -e '1s/^/            word    count    ratio   cumsum\n/'  # ヘッダー

            word    count    ratio   cumsum
             the   138844 0.053903 0.053903
              to    64866 0.025183 0.079086
              in    50969 0.019788 0.098873
              of    50764 0.019708 0.118581
             and    48537 0.018843 0.137425
              is    38182 0.014823 0.152248
             you    29710 0.011534 0.163782
             for    28060 0.010894 0.174676
            that    26829 0.010416 0.185092
              it    23518 0.009130 0.194222
            this    21736 0.008439 0.202661
              as    18763 0.007284 0.209945
            with    18283 0.007098 0.217043
              if    16465 0.006392 0.223435
              or    15867 0.006160 0.229595
              on    15841 0.006150 0.235745
             are    14711 0.005711 0.241456
             can    14594 0.005666 0.247122
              be    14418 0.005597 0.252720
              we    14357 0.005574 0.258293
            file    13089 0.005082 0.263375
              by    12027 0.004669 0.268044
            from    11293 0.004384 0.272428
             use    11250 0.004368 0.276796
              an    11174 0.004338 0.281134
             not     9980 0.003875 0.285009
            perl     9366 0.003636 0.288645
             but     8904 0.003457 0.292101
            data     8895 0.003453 0.295555
         command     8735 0.003391 0.298946
             one     8286 0.003217 0.302163
             all     8149 0.003164 0.305326
         example     8041 0.003122 0.308448
            line     8035 0.003119 0.311568
              at     7848 0.003047 0.314614
          python     7721 0.002998 0.317612
            your     7681 0.002982 0.320594
         chapter     7204 0.002797 0.323391
            have     7073 0.002746 0.326137
            more     6918 0.002686 0.328822
            when     6779 0.002632 0.331454
            name     6483 0.002517 0.333971
        function     6394 0.002482 0.336453
           which     6362 0.002470 0.338923
            will     6304 0.002447 0.341371
           files     6248 0.002426 0.343796
            text     6128 0.002379 0.346175
            like     6023 0.002338 0.348514
              so     6009 0.002333 0.350847
            also     5889 0.002286 0.353133

もちろんcumsumの最後は1になります。

累積和の最後は1
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |  # 小文字に統一
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |  # ワードカウント
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
    cumsum[NR]=s
  }
  END{
    for(i=1;i<=NR;i++)
    {printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
  }' |  # 回数/割合/累積和
  tail

          zwemer        1 0.000000 0.999997
          zwicky        1 0.000000 0.999997
         zwspace        1 0.000000 0.999997
              zx        1 0.000000 0.999998
             zxf        1 0.000000 0.999998
          zygote        1 0.000000 0.999998
           zyzzy        1 0.000000 0.999999
       zzzzteana        1 0.000000 0.999999
         zzzzzzz        1 0.000000 1.000000
        zzzzzzzz        1 0.000000 1.000000

考え方があっているかはわかりませんが、 単語のほとんど を表すために正規分布を使います。
awkのifを使って正規分布σ1(=68.27%)の範囲に入る文字までを表示します。2

σ1までの累積和を表示
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |  # 小文字に統一
  grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |  # ワードカウント
  awk '{
    word[NR]=$2;
    count[NR]=$1;
    s+=$1;
    cumsum[NR]=s
  }
  END{
    for(i=1;i<=NR;i++)
    {if (cumsum[i]/s<0.6827) printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
  }' |  # 回数/割合/累積和 4列目をσ1まで表示
  sed -e '1s/^/            word    count    ratio   cumsum\n/' |  # ヘッダー
  cat -n  # 行番号

     1	            word    count    ratio   cumsum
     2	             the   138844 0.053903 0.053903
     3	              to    64866 0.025183 0.079086
     4	              in    50969 0.019788 0.098873
     5	              of    50764 0.019708 0.118581
     6	             and    48537 0.018843 0.137425
     7	              is    38182 0.014823 0.152248
     8	             you    29710 0.011534 0.163782
     9	             for    28060 0.010894 0.174676
    10	            that    26829 0.010416 0.185092
    11	              it    23518 0.009130 0.194222
    12	            this    21736 0.008439 0.202661
    13	              as    18763 0.007284 0.209945
    14	            with    18283 0.007098 0.217043
    15	              if    16465 0.006392 0.223435
    16	              or    15867 0.006160 0.229595
    17	              on    15841 0.006150 0.235745
    18	             are    14711 0.005711 0.241456
    19	             can    14594 0.005666 0.247122
    20	              be    14418 0.005597 0.252720
    21	              we    14357 0.005574 0.258293
    22	            file    13089 0.005082 0.263375
    23	              by    12027 0.004669 0.268044
    24	            from    11293 0.004384 0.272428
    25	             use    11250 0.004368 0.276796
    26	              an    11174 0.004338 0.281134
    27	             not     9980 0.003875 0.285009
    28	            perl     9366 0.003636 0.288645
    29	             but     8904 0.003457 0.292101
    30	            data     8895 0.003453 0.295555
    31	         command     8735 0.003391 0.298946
    32	             one     8286 0.003217 0.302163
    33	             all     8149 0.003164 0.305326
    34	         example     8041 0.003122 0.308448
    35	            line     8035 0.003119 0.311568
    36	              at     7848 0.003047 0.314614
    37	          python     7721 0.002998 0.317612
    38	            your     7681 0.002982 0.320594
    39	         chapter     7204 0.002797 0.323391
    40	            have     7073 0.002746 0.326137
    41	            more     6918 0.002686 0.328822
    42	            when     6779 0.002632 0.331454
    43	            name     6483 0.002517 0.333971
    44	        function     6394 0.002482 0.336453
    45	           which     6362 0.002470 0.338923
    46	            will     6304 0.002447 0.341371
    47	           files     6248 0.002426 0.343796
    48	            text     6128 0.002379 0.346175
    49	            like     6023 0.002338 0.348514
    50	              so     6009 0.002333 0.350847
    51	            also     5889 0.002286 0.353133
    52	         program     5853 0.002272 0.355405
    53	             see     5853 0.002272 0.357677
    54	            list     5783 0.002245 0.359923
    55	           using     5694 0.002211 0.362133
    56	           print     5610 0.002178 0.364311
    57	         section     5174 0.002009 0.366320
    58	           other     5110 0.001984 0.368304
    59	            self     5091 0.001976 0.370280
    60	             set     5084 0.001974 0.372254
    61	           shell     4949 0.001921 0.374175
    62	           first     4948 0.001921 0.376096
    63	            code     4913 0.001907 0.378003
    64	            time     4865 0.001889 0.379892
    65	             its     4829 0.001875 0.381767
    66	              do     4819 0.001871 0.383638
    67	           value     4778 0.001855 0.385493
    68	            each     4702 0.001825 0.387318
    69	            some     4650 0.001805 0.389123
    70	            only     4561 0.001771 0.390894
    71	          system     4540 0.001763 0.392657
    72	          module     4425 0.001718 0.394375
    73	          script     4413 0.001713 0.396088
    74	             may     4377 0.001699 0.397787
    75	          string     4346 0.001687 0.399474
    76	             any     4339 0.001685 0.401159
    77	           there     4320 0.001677 0.402836
    78	            used     4318 0.001676 0.404512
    79	           class     4245 0.001648 0.406160
    80	            here     4214 0.001636 0.407796
    81	          output     4163 0.001616 0.409413
    82	             has     4146 0.001610 0.411022
    83	            they     4142 0.001608 0.412630
    84	          number     4125 0.001601 0.414232
    85	             out     4073 0.001581 0.415813
    86	            just     3973 0.001542 0.417355
    87	             new     3954 0.001535 0.418890
    88	             two     3945 0.001532 0.420422
    89	         because     3894 0.001512 0.421934
    90	            same     3815 0.001481 0.423415
    91	           input     3792 0.001472 0.424887
    92	             get     3709 0.001440 0.426327
    93	            into     3695 0.001435 0.427761
    94	        variable     3692 0.001433 0.429195
    95	       directory     3664 0.001422 0.430617
    96	           these     3657 0.001420 0.432037
    97	              no     3551 0.001379 0.433416
    98	             run     3514 0.001364 0.434780
    99	              up     3495 0.001357 0.436137
   100	            what     3440 0.001336 0.437472
   101	            then     3419 0.001327 0.438800
   102	            than     3395 0.001318 0.440118
   103	            want     3379 0.001312 0.441429
   104	          return     3353 0.001302 0.442731
   105	            user     3341 0.001297 0.444028
   106	             how     3327 0.001292 0.445320
   107	           array     3273 0.001271 0.446591
   108	          object     3255 0.001264 0.447854
   109	            make     3254 0.001263 0.449117
   110	              my     3208 0.001245 0.450363
   111	            test     3196 0.001241 0.451604
   112	            next     3173 0.001232 0.452836
   113	            them     3155 0.001225 0.454060
   114	              re     3136 0.001217 0.455278
   115	             way     3104 0.001205 0.456483
   116	              ll     3102 0.001204 0.457687
   117	            such     3086 0.001198 0.458885
   118	            need     3081 0.001196 0.460081
   119	            book     3072 0.001193 0.461274
   120	            type     2992 0.001162 0.462436
   121	          method     2964 0.001151 0.463586
   122	          server     2918 0.001133 0.464719
   123	           about     2894 0.001124 0.465843
   124	            most     2875 0.001116 0.466959
   125	            read     2840 0.001103 0.468061
   126	           lines     2837 0.001101 0.469163
   127	        standard     2831 0.001099 0.470262
   128	             end     2827 0.001098 0.471359
   129	          values     2806 0.001089 0.472449
   130	         pattern     2757 0.001070 0.473519
   131	            many     2736 0.001062 0.474581
   132	              py     2732 0.001061 0.475642
   133	            unix     2700 0.001048 0.476690
   134	            find     2677 0.001039 0.477730
   135	         process     2664 0.001034 0.478764
   136	       functions     2646 0.001027 0.479791
   137	            open     2571 0.000998 0.480789
   138	             don     2544 0.000988 0.481777
   139	             key     2537 0.000985 0.482762
   140	        learning     2536 0.000985 0.483746
   141	       character     2511 0.000975 0.484721
   142	             was     2431 0.000944 0.485665
   143	           would     2385 0.000926 0.486591
   144	             def     2361 0.000917 0.487507
   145	          import     2354 0.000914 0.488421
   146	            call     2351 0.000913 0.489334
   147	     programming     2347 0.000911 0.490245
   148	       variables     2334 0.000906 0.491151
   149	            both     2330 0.000905 0.492056
   150	            does     2325 0.000903 0.492959
   151	          should     2318 0.000900 0.493858
   152	             now     2317 0.000900 0.494758
   153	           match     2303 0.000894 0.495652
   154	        instance     2298 0.000892 0.496544
   155	        operator     2285 0.000887 0.497431
   156	            even     2276 0.000884 0.498315
   157	            last     2267 0.000880 0.499195
   158	      characters     2265 0.000879 0.500074
   159	           after     2255 0.000875 0.500950
   160	          figure     2234 0.000867 0.501817
   161	            work     2200 0.000854 0.502671
   162	       different     2196 0.000853 0.503524
   163	           their     2172 0.000843 0.504367
   164	            html     2160 0.000839 0.505206
   165	         another     2140 0.000831 0.506036
   166	         regular     2120 0.000823 0.506859
   167	          simple     2116 0.000821 0.507681
   168	           while     2113 0.000820 0.508501
   169	         default     2101 0.000816 0.509317
   170	           names     2083 0.000809 0.510126
   171	          before     2080 0.000808 0.510933
   172	           might     2079 0.000807 0.511740
   173	         instead     2076 0.000806 0.512546
   174	             our     2070 0.000804 0.513350
   175	           index     2061 0.000800 0.514150
   176	        programs     2060 0.000800 0.514950
   177	           could     2058 0.000799 0.515749
   178	             awk     2050 0.000796 0.516545
   179	        commands     2025 0.000786 0.517331
   180	      expression     2022 0.000785 0.518116
   181	           write     2016 0.000783 0.518898
   182	          called     2014 0.000782 0.519680
   183	            loop     2003 0.000778 0.520458
   184	           start     1995 0.000775 0.521232
   185	           model     1992 0.000773 0.522006
   186	           where     1973 0.000766 0.522772
   187	            true     1967 0.000764 0.523535
   188	          option     1940 0.000753 0.524289
   189	            part     1932 0.000750 0.525039
   190	            case     1921 0.000746 0.525784
   191	             bin     1920 0.000745 0.526530
   192	           tools     1907 0.000740 0.527270
   193	          window     1906 0.000740 0.528010
   194	             gui     1856 0.000721 0.528731
   195	          single     1855 0.000720 0.529451
   196	         version     1837 0.000713 0.530164
   197	           error     1835 0.000712 0.530876
   198	        argument     1813 0.000704 0.531580
   199	         running     1811 0.000703 0.532283
   200	         current     1809 0.000702 0.532986
   201	        training     1808 0.000702 0.533688
   202	         message     1761 0.000684 0.534371
   203	         scripts     1761 0.000684 0.535055
   204	            word     1754 0.000681 0.535736
   205	        examples     1737 0.000674 0.536410
   206	            page     1729 0.000671 0.537081
   207	         between     1728 0.000671 0.537752
   208	            hash     1728 0.000671 0.538423
   209	            much     1724 0.000669 0.539092
   210	         without     1716 0.000666 0.539759
   211	            look     1708 0.000663 0.540422
   212	            well     1702 0.000661 0.541082
   213	              pp     1687 0.000655 0.541737
   214	           since     1687 0.000655 0.542392
   215	       arguments     1665 0.000646 0.543039
   216	            path     1656 0.000643 0.543682
   217	            must     1651 0.000641 0.544323
   218	          client     1645 0.000639 0.544961
   219	       following     1645 0.000639 0.545600
   220	           order     1645 0.000639 0.546239
   221	           table     1645 0.000639 0.546877
   222	          change     1638 0.000636 0.547513
   223	             let     1633 0.000634 0.548147
   224	        filename     1629 0.000632 0.548779
   225	            main     1621 0.000629 0.549409
   226	         package     1606 0.000623 0.550032
   227	              os     1598 0.000620 0.550653
   228	            else     1595 0.000619 0.551272
   229	            form     1594 0.000619 0.551891
   230	        language     1590 0.000617 0.552508
   231	           doesn     1584 0.000615 0.553123
   232	          though     1584 0.000615 0.553738
   233	             web     1573 0.000611 0.554349
   234	         returns     1569 0.000609 0.554958
   235	          create     1568 0.000609 0.555566
   236	           local     1568 0.000609 0.556175
   237	            sort     1556 0.000604 0.556779
   238	         strings     1554 0.000603 0.557383
   239	        previous     1551 0.000602 0.557985
   240	          access     1541 0.000598 0.558583
   241	         windows     1537 0.000597 0.559180
   242	         special     1533 0.000595 0.559775
   243	         library     1532 0.000595 0.560370
   244	          result     1519 0.000590 0.560959
   245	           right     1516 0.000589 0.561548
   246	           three     1515 0.000588 0.562136
   247	             add     1511 0.000587 0.562723
   248	         objects     1510 0.000586 0.563309
   249	         through     1509 0.000586 0.563895
   250	            uses     1507 0.000585 0.564480
   251	            over     1505 0.000584 0.565064
   252	         machine     1502 0.000583 0.565647
   253	            mail     1490 0.000578 0.566226
   254	           space     1483 0.000576 0.566801
   255	         numbers     1482 0.000575 0.567377
   256	        multiple     1475 0.000573 0.567949
   257	         methods     1465 0.000569 0.568518
   258	              ve     1463 0.000568 0.569086
   259	           those     1455 0.000565 0.569651
   260	            http     1443 0.000560 0.570211
   261	           built     1441 0.000559 0.570771
   262	          second     1433 0.000556 0.571327
   263	            very     1433 0.000556 0.571883
   264	         systems     1432 0.000556 0.572439
   265	     information     1431 0.000556 0.572995
   266	          search     1426 0.000554 0.573548
   267	            home     1415 0.000549 0.574098
   268	             own     1411 0.000548 0.574646
   269	             try     1401 0.000544 0.575189
   270	          format     1379 0.000535 0.575725
   271	           means     1371 0.000532 0.576257
   272	            mode     1371 0.000532 0.576789
   273	           words     1371 0.000532 0.577322
   274	             sed     1366 0.000530 0.577852
   275	         problem     1355 0.000526 0.578378
   276	           often     1350 0.000524 0.578902
   277	            size     1338 0.000519 0.579422
   278	            exit     1329 0.000516 0.579937
   279	       operators     1329 0.000516 0.580453
   280	          always     1318 0.000512 0.580965
   281	             usr     1313 0.000510 0.581475
   282	             cgi     1312 0.000509 0.581984
   283	         however     1299 0.000504 0.582489
   284	       available     1297 0.000504 0.582992
   285	          common     1296 0.000503 0.583495
   286	             say     1295 0.000503 0.583998
   287	           later     1294 0.000502 0.584500
   288	           point     1293 0.000502 0.585002
   289	          random     1290 0.000501 0.585503
   290	             com     1285 0.000499 0.586002
   291	            once     1285 0.000499 0.586501
   292	             top     1280 0.000497 0.586998
   293	           field     1278 0.000496 0.587494
   294	            left     1274 0.000495 0.587989
   295	     expressions     1273 0.000494 0.588483
   296	            know     1270 0.000493 0.588976
   297	       something     1269 0.000493 0.589468
   298	           still     1267 0.000492 0.589960
   299	             sub     1267 0.000492 0.590452
   300	           email     1265 0.000491 0.590943
   301	           users     1264 0.000491 0.591434
   302	        features     1260 0.000489 0.591923
   303	          either     1256 0.000488 0.592411
   304	           every     1250 0.000485 0.592896
   305	         options     1248 0.000485 0.593381
   306	         context     1244 0.000483 0.593864
   307	           group     1244 0.000483 0.594347
   308	            been     1241 0.000482 0.594828
   309	             too     1241 0.000482 0.595310
   310	            were     1236 0.000480 0.595790
   311	       reference     1232 0.000478 0.596268
   312	         unicode     1231 0.000478 0.596746
   313	         classes     1225 0.000476 0.597222
   314	           bytes     1216 0.000472 0.597694
   315	            root     1211 0.000470 0.598164
   316	           named     1201 0.000466 0.598630
   317	           train     1200 0.000466 0.599096
   318	           split     1194 0.000464 0.599560
   319	         writing     1183 0.000459 0.600019
   320	            echo     1178 0.000457 0.600476
   321	          useful     1177 0.000457 0.600933
   322	            back     1173 0.000455 0.601389
   323	           close     1172 0.000455 0.601844
   324	         control     1171 0.000455 0.602298
   325	           count     1167 0.000453 0.602751
   326	         modules     1166 0.000453 0.603204
   327	           title     1165 0.000452 0.603656
   328	        matching     1160 0.000450 0.604107
   329	          binary     1159 0.000450 0.604557
   330	            long     1159 0.000450 0.605007
   331	           works     1158 0.000450 0.605456
   332	           state     1152 0.000447 0.605903
   333	           makes     1146 0.000445 0.606348
   334	          source     1143 0.000444 0.606792
   335	            good     1142 0.000443 0.607235
   336	           level     1142 0.000443 0.607679
   337	             die     1139 0.000442 0.608121
   338	            side     1133 0.000440 0.608561
   339	           calls     1127 0.000438 0.608998
   340	       processes     1125 0.000437 0.609435
   341	         support     1121 0.000435 0.609870
   342	         feature     1120 0.000435 0.610305
   343	          simply     1119 0.000434 0.610739
   344	            note     1118 0.000434 0.611174
   345	            keys     1113 0.000432 0.611606
   346	             sys     1104 0.000429 0.612034
   347	         reading     1103 0.000428 0.612462
   348	           given     1095 0.000425 0.612888
   349	         defined     1084 0.000421 0.613308
   350	            done     1080 0.000419 0.613728
   351	     environment     1073 0.000417 0.614144
   352	         network     1067 0.000414 0.614558
   353	          scalar     1064 0.000413 0.614972
   354	           range     1062 0.000412 0.615384
   355	         matches     1061 0.000412 0.615796
   356	          socket     1056 0.000410 0.616206
   357	       statement     1056 0.000410 0.616616
   358	            none     1055 0.000410 0.617025
   359	           block     1053 0.000409 0.617434
   360	             put     1051 0.000408 0.617842
   361	            take     1049 0.000407 0.618249
   362	           false     1044 0.000405 0.618655
   363	         tkinter     1042 0.000405 0.619059
   364	           shows     1027 0.000399 0.619458
   365	           based     1023 0.000397 0.619855
   366	          better     1020 0.000396 0.620251
   367	            tree     1019 0.000396 0.620647
   368	           check     1013 0.000393 0.621040
   369	          reilly     1012 0.000393 0.621433
   370	             few     1006 0.000391 0.621823
   371	         usually     1006 0.000391 0.622214
   372	              go     1005 0.000390 0.622604
   373	         whether     1000 0.000388 0.622992
   374	        possible      998 0.000387 0.623380
   375	         display      997 0.000387 0.623767
   376	        original      997 0.000387 0.624154
   377	          really      991 0.000385 0.624539
   378	         already      989 0.000384 0.624923
   379	          vector      989 0.000384 0.625307
   380	              us      988 0.000384 0.625690
   381	             bit      983 0.000382 0.626072
   382	         edition      982 0.000381 0.626453
   383	          itself      980 0.000380 0.626833
   384	      processing      973 0.000378 0.627211
   385	          fields      967 0.000375 0.627587
   386	          syntax      965 0.000375 0.627961
   387	           again      964 0.000374 0.628336
   388	          thread      961 0.000373 0.628709
   389	          within      956 0.000371 0.629080
   390	          things      947 0.000368 0.629447
   391	           found      945 0.000367 0.629814
   392	          unless      945 0.000367 0.630181
   393	       interface      943 0.000366 0.630547
   394	          people      943 0.000366 0.630913
   395	              id      940 0.000365 0.631278
   396	             per      939 0.000365 0.631643
   397	            copy      937 0.000364 0.632007
   398	          record      937 0.000364 0.632370
   399	            full      936 0.000363 0.632734
   400	          parent      929 0.000361 0.633094
   401	          delete      926 0.000359 0.633454
   402	           event      926 0.000359 0.633813
   403	             non      925 0.000359 0.634173
   404	        sequence      925 0.000359 0.634532
   405	            show      925 0.000359 0.634891
   406	          arrays      923 0.000358 0.635249
   407	           until      917 0.000356 0.635605
   408	          memory      915 0.000355 0.635960
   409	      particular      913 0.000354 0.636315
   410	        provides      913 0.000354 0.636669
   411	         changes      912 0.000354 0.637023
   412	           types      907 0.000352 0.637375
   413	          button      906 0.000352 0.637727
   414	             job      903 0.000351 0.638078
   415	        although      902 0.000350 0.638428
   416	   automatically      902 0.000350 0.638778
   417	         address      901 0.000350 0.639128
   418	            pass      891 0.000346 0.639474
   419	            send      890 0.000346 0.639819
   420	            step      890 0.000346 0.640165
   421	        contents      887 0.000344 0.640509
   422	          action      883 0.000343 0.640852
   423	      subroutine      883 0.000343 0.641195
   424	            mean      881 0.000342 0.641537
   425	           world      879 0.000341 0.641878
   426	        internet      878 0.000341 0.642219
   427	        encoding      876 0.000340 0.642559
   428	              ls      875 0.000340 0.642899
   429	     directories      873 0.000339 0.643238
   430	        database      872 0.000339 0.643576
   431	           large      871 0.000338 0.643914
   432	            fact      870 0.000338 0.644252
   433	            save      869 0.000337 0.644589
   434	        probably      867 0.000337 0.644926
   435	      operations      863 0.000335 0.645261
   436	            date      861 0.000334 0.645595
   437	             etc      861 0.000334 0.645930
   438	            help      860 0.000334 0.646263
   439	           empty      859 0.000333 0.646597
   440	      parameters      858 0.000333 0.646930
   441	            pack      855 0.000332 0.647262
   442	          select      854 0.000332 0.647594
   443	           shown      853 0.000331 0.647925
   444	             pop      847 0.000329 0.648254
   445	         general      846 0.000328 0.648582
   446	            less      843 0.000327 0.648909
   447	            bash      840 0.000326 0.649235
   448	          length      840 0.000326 0.649561
   449	           build      838 0.000325 0.649887
   450	            argv      836 0.000325 0.650211
   451	           being      828 0.000321 0.650533
   452	            pipe      828 0.000321 0.650854
   453	       attribute      823 0.000320 0.651174
   454	            easy      823 0.000320 0.651493
   455	          remote      821 0.000319 0.651812
   456	        actually      820 0.000318 0.652130
   457	          handle      820 0.000318 0.652449
   458	           label      818 0.000318 0.652766
   459	        contains      816 0.000317 0.653083
   460	        cookbook      816 0.000317 0.653400
   461	         earlier      816 0.000317 0.653717
   462	        elements      816 0.000317 0.654033
   463	           lists      816 0.000317 0.654350
   464	         provide      816 0.000317 0.654667
   465	         similar      815 0.000316 0.654983
   466	              tf      815 0.000316 0.655300
   467	          global      811 0.000315 0.655615
   468	          widget      811 0.000315 0.655930
   469	            give      810 0.000314 0.656244
   470	       important      805 0.000313 0.656557
   471	      references      804 0.000312 0.656869
   472	             isn      803 0.000312 0.657180
   473	             inc      802 0.000311 0.657492
   474	             str      801 0.000311 0.657803
   475	         element      800 0.000311 0.658113
   476	            best      798 0.000310 0.658423
   477	           times      798 0.000310 0.658733
   478	            dict      793 0.000308 0.659041
   479	             log      792 0.000307 0.659348
   480	           learn      791 0.000307 0.659655
   481	             len      790 0.000307 0.659962
   482	           entry      789 0.000306 0.660268
   483	       languages      786 0.000305 0.660574
   484	         details      783 0.000304 0.660878
   485	         threads      781 0.000303 0.661181
   486	        terminal      780 0.000303 0.661484
   487	           items      778 0.000302 0.661786
   488	          except      771 0.000299 0.662085
   489	        solution      771 0.000299 0.662384
   490	            grep      768 0.000298 0.662682
   491	           image      767 0.000298 0.662980
   492	           parts      767 0.000298 0.663278
   493	         results      767 0.000298 0.663576
   494	         several      767 0.000298 0.663873
   495	          allows      766 0.000297 0.664171
   496	            runs      764 0.000297 0.664467
   497	        patterns      763 0.000296 0.664764
   498	             who      763 0.000296 0.665060
   499	          making      761 0.000295 0.665355
   500	           place      761 0.000295 0.665651
   501	           layer      760 0.000295 0.665946
   502	          prompt      758 0.000294 0.666240
   503	             ftp      754 0.000293 0.666533
   504	   documentation      751 0.000292 0.666824
   505	             tar      750 0.000291 0.667116
   506	            real      749 0.000291 0.667406
   507	             row      748 0.000290 0.667697
   508	          signal      738 0.000287 0.667983
   509	             won      738 0.000287 0.668270
   510	         written      738 0.000287 0.668556
   511	             dir      737 0.000286 0.668842
   512	             win      737 0.000286 0.669129
   513	           hello      735 0.000285 0.669414
   514	       sometimes      734 0.000285 0.669699
   515	       generally      732 0.000284 0.669983
   516	       including      732 0.000284 0.670267
   517	          status      732 0.000284 0.670551
   518	          column      727 0.000282 0.670834
   519	          editor      727 0.000282 0.671116
   520	         columns      725 0.000281 0.671397
   521	          series      725 0.000281 0.671679
   522	             off      724 0.000281 0.671960
   523	          around      722 0.000280 0.672240
   524	      filehandle      721 0.000280 0.672520
   525	              vi      719 0.000279 0.672799
   526	            spam      713 0.000277 0.673076
   527	            ways      713 0.000277 0.673353
   528	             why      711 0.000276 0.673629
   529	            fred      710 0.000276 0.673905
   530	        creating      709 0.000275 0.674180
   531	            init      709 0.000275 0.674455
   532	            zero      709 0.000275 0.674730
   533	          passed      706 0.000274 0.675004
   534	          shells      703 0.000273 0.675277
   535	             cat      695 0.000270 0.675547
   536	       filenames      695 0.000270 0.675817
   537	         setting      695 0.000270 0.676087
   538	            eval      694 0.000269 0.676356
   539	           posix      694 0.000269 0.676626
   540	           child      693 0.000269 0.676895
   541	        reserved      691 0.000268 0.677163
   542	           under      691 0.000268 0.677431
   543	            sure      689 0.000267 0.677699
   544	             url      689 0.000267 0.677966
   545	              np      686 0.000266 0.678232
   546	         specify      686 0.000266 0.678499
   547	        directly      685 0.000266 0.678765
   548	           ascii      684 0.000266 0.679030
   549	             old      684 0.000266 0.679296
   550	        advanced      683 0.000265 0.679561
   551	      attributes      682 0.000265 0.679826
   552	            item      681 0.000264 0.680090
   553	        messages      679 0.000264 0.680354
   554	            term      679 0.000264 0.680617
   555	             map      678 0.000263 0.680881
   556	       exception      677 0.000263 0.681143
   557	            made      677 0.000263 0.681406
   558	          hidden      676 0.000262 0.681669
   559	           basic      675 0.000262 0.681931
   560	         created      675 0.000262 0.682193
   561	            sets      675 0.000262 0.682455

ということで「O'reilly本はだいたい561単語でできている」という結果になりました。

  1. コマンドライン上でヒストグラム値(頻度数)を求める

  2. 正規分布 - Wikipedia

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?