前回の続きです。
awkとかsedを駆使してさらなる分析を深めていきたいと思います。
PDFのテキスト処理を簡略化
pdftotext
の処理が重いので、ここからは前回の結果をoreilly_text_all.txt
に出力したところからスタートします。
複数pdfのテキスト化を1ファイルに格納
<oreilly.txt | xargs -I{} pdftotext {} - | cat > oreilly_text_all.txt
グラフ化
単語 カウント が降順に並んでいるとして、最大を50個の"#"記号で表してグラフ化するawkスクリプト1にパイプします。
oreilly_text_allに出現する単語top300をグラフ表示する
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' | grep -oE '[a-z]{2,}' | sort |
uniq -c | sort -k1nr | awk '{printf "%16s %4d\n",$2,$1;}'|
awk '!max{max=$2;}{f=50/max;if(f>1)f=1;i=$2*f;r="";while(i-->0)r=r"#";printf "%s %s\n",$0,r;}' | # グラフ表示
head -300 | # top300
cat -n # 行番号
Oreilly本に出てくる頻出英単語top300
1 the 138844 ##################################################
2 to 64866 ########################
3 in 50969 ###################
4 of 50764 ###################
5 and 48537 ##################
6 is 38182 ##############
7 you 29710 ###########
8 for 28060 ###########
9 that 26829 ##########
10 it 23518 #########
11 this 21736 ########
12 as 18763 #######
13 with 18283 #######
14 if 16465 ######
15 or 15867 ######
16 on 15841 ######
17 are 14711 ######
18 can 14594 ######
19 be 14418 ######
20 we 14357 ######
21 file 13089 #####
22 by 12027 #####
23 from 11293 #####
24 use 11250 #####
25 an 11174 #####
26 not 9980 ####
27 perl 9366 ####
28 but 8904 ####
29 data 8895 ####
30 command 8735 ####
31 one 8286 ###
32 all 8149 ###
33 example 8041 ###
34 line 8035 ###
35 at 7848 ###
36 python 7721 ###
37 your 7681 ###
38 chapter 7204 ###
39 have 7073 ###
40 more 6918 ###
41 when 6779 ###
42 name 6483 ###
43 function 6394 ###
44 which 6362 ###
45 will 6304 ###
46 files 6248 ###
47 text 6128 ###
48 like 6023 ###
49 so 6009 ###
50 also 5889 ###
51 program 5853 ###
52 see 5853 ###
53 list 5783 ###
54 using 5694 ###
55 print 5610 ###
56 section 5174 ##
57 other 5110 ##
58 self 5091 ##
59 set 5084 ##
60 shell 4949 ##
61 first 4948 ##
62 code 4913 ##
63 time 4865 ##
64 its 4829 ##
65 do 4819 ##
66 value 4778 ##
67 each 4702 ##
68 some 4650 ##
69 only 4561 ##
70 system 4540 ##
71 module 4425 ##
72 script 4413 ##
73 may 4377 ##
74 string 4346 ##
75 any 4339 ##
76 there 4320 ##
77 used 4318 ##
78 class 4245 ##
79 here 4214 ##
80 output 4163 ##
81 has 4146 ##
82 they 4142 ##
83 number 4125 ##
84 out 4073 ##
85 just 3973 ##
86 new 3954 ##
87 two 3945 ##
88 because 3894 ##
89 same 3815 ##
90 input 3792 ##
91 get 3709 ##
92 into 3695 ##
93 variable 3692 ##
94 directory 3664 ##
95 these 3657 ##
96 no 3551 ##
97 run 3514 ##
98 up 3495 ##
99 what 3440 ##
100 then 3419 ##
101 than 3395 ##
102 want 3379 ##
103 return 3353 ##
104 user 3341 ##
105 how 3327 ##
106 array 3273 ##
107 object 3255 ##
108 make 3254 ##
109 my 3208 ##
110 test 3196 ##
111 next 3173 ##
112 them 3155 ##
113 re 3136 ##
114 way 3104 ##
115 ll 3102 ##
116 such 3086 ##
117 need 3081 ##
118 book 3072 ##
119 type 2992 ##
120 method 2964 ##
121 server 2918 ##
122 about 2894 ##
123 most 2875 ##
124 read 2840 ##
125 lines 2837 ##
126 standard 2831 ##
127 end 2827 ##
128 values 2806 ##
129 pattern 2757 #
130 many 2736 #
131 py 2732 #
132 unix 2700 #
133 find 2677 #
134 process 2664 #
135 functions 2646 #
136 open 2571 #
137 don 2544 #
138 key 2537 #
139 learning 2536 #
140 character 2511 #
141 was 2431 #
142 would 2385 #
143 def 2361 #
144 import 2354 #
145 call 2351 #
146 programming 2347 #
147 variables 2334 #
148 both 2330 #
149 does 2325 #
150 should 2318 #
151 now 2317 #
152 match 2303 #
153 instance 2298 #
154 operator 2285 #
155 even 2276 #
156 last 2267 #
157 characters 2265 #
158 after 2255 #
159 figure 2234 #
160 work 2200 #
161 different 2196 #
162 their 2172 #
163 html 2160 #
164 another 2140 #
165 regular 2120 #
166 simple 2116 #
167 while 2113 #
168 default 2101 #
169 names 2083 #
170 before 2080 #
171 might 2079 #
172 instead 2076 #
173 our 2070 #
174 index 2061 #
175 programs 2060 #
176 could 2058 #
177 awk 2050 #
178 commands 2025 #
179 expression 2022 #
180 write 2016 #
181 called 2014 #
182 loop 2003 #
183 start 1995 #
184 model 1992 #
185 where 1973 #
186 true 1967 #
187 option 1940 #
188 part 1932 #
189 case 1921 #
190 bin 1920 #
191 tools 1907 #
192 window 1906 #
193 gui 1856 #
194 single 1855 #
195 version 1837 #
196 error 1835 #
197 argument 1813 #
198 running 1811 #
199 current 1809 #
200 training 1808 #
201 message 1761 #
202 scripts 1761 #
203 word 1754 #
204 examples 1737 #
205 page 1729 #
206 between 1728 #
207 hash 1728 #
208 much 1724 #
209 without 1716 #
210 look 1708 #
211 well 1702 #
212 pp 1687 #
213 since 1687 #
214 arguments 1665 #
215 path 1656 #
216 must 1651 #
217 client 1645 #
218 following 1645 #
219 order 1645 #
220 table 1645 #
221 change 1638 #
222 let 1633 #
223 filename 1629 #
224 main 1621 #
225 package 1606 #
226 os 1598 #
227 else 1595 #
228 form 1594 #
229 language 1590 #
230 doesn 1584 #
231 though 1584 #
232 web 1573 #
233 returns 1569 #
234 create 1568 #
235 local 1568 #
236 sort 1556 #
237 strings 1554 #
238 previous 1551 #
239 access 1541 #
240 windows 1537 #
241 special 1533 #
242 library 1532 #
243 result 1519 #
244 right 1516 #
245 three 1515 #
246 add 1511 #
247 objects 1510 #
248 through 1509 #
249 uses 1507 #
250 over 1505 #
251 machine 1502 #
252 mail 1490 #
253 space 1483 #
254 numbers 1482 #
255 multiple 1475 #
256 methods 1465 #
257 ve 1463 #
258 those 1455 #
259 http 1443 #
260 built 1441 #
261 second 1433 #
262 very 1433 #
263 systems 1432 #
264 information 1431 #
265 search 1426 #
266 home 1415 #
267 own 1411 #
268 try 1401 #
269 format 1379 #
270 means 1371 #
271 mode 1371 #
272 words 1371 #
273 sed 1366 #
274 problem 1355 #
275 often 1350 #
276 size 1338 #
277 exit 1329 #
278 operators 1329 #
279 always 1318 #
280 usr 1313 #
281 cgi 1312 #
282 however 1299 #
283 available 1297 #
284 common 1296 #
285 say 1295 #
286 later 1294 #
287 point 1293 #
288 random 1290 #
289 com 1285 #
290 once 1285 #
291 top 1280 #
292 field 1278 #
293 left 1274 #
294 expressions 1273 #
295 know 1270 #
296 something 1269 #
297 still 1267 #
298 sub 1267 #
299 email 1265 #
300 users 1264 #
割合
全文字列のカウントがわかったので、全体に占める割合を調べます。
下のawkスクリプトは全行を配列に格納して、合計を算出し、配列の内容を吐き出しながら合計で割ります。
全単語に占める割合
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' |
grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr |
awk '{
word[NR]=$2;
count[NR]=$1;
s+=$1;
}
END{
for(i=1;i<=NR;i++)
{printf "%16s %8d %6f\n", word[i],count[i],count[i]/s}
}' |
head
the 138844 0.053903
to 64866 0.025183
in 50969 0.019788
of 50764 0.019708
and 48537 0.018843
is 38182 0.014823
you 29710 0.011534
for 28060 0.010894
that 26829 0.010416
it 23518 0.009130
右からワード、ワードの出現数、全体に占める割合を表しています。
例えば一番上の"the"は全PDFで約14万件でてきて全ワード中の5.3%を占めています。
累積和
同様にawkの配列を駆使して累積和(cumsum)=ratioの積み上げ足し算を出力していきます。
割合の累積和
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' | # 小文字に統一
grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr | # ワードカウント
awk '{
word[NR]=$2;
count[NR]=$1;
s+=$1;
cumsum[NR]=s
}
END{
for(i=1;i<=NR;i++)
{printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
}' | # 回数/割合/累積和
head -50 |
sed -e '1s/^/ word count ratio cumsum\n/' # ヘッダー
word count ratio cumsum
the 138844 0.053903 0.053903
to 64866 0.025183 0.079086
in 50969 0.019788 0.098873
of 50764 0.019708 0.118581
and 48537 0.018843 0.137425
is 38182 0.014823 0.152248
you 29710 0.011534 0.163782
for 28060 0.010894 0.174676
that 26829 0.010416 0.185092
it 23518 0.009130 0.194222
this 21736 0.008439 0.202661
as 18763 0.007284 0.209945
with 18283 0.007098 0.217043
if 16465 0.006392 0.223435
or 15867 0.006160 0.229595
on 15841 0.006150 0.235745
are 14711 0.005711 0.241456
can 14594 0.005666 0.247122
be 14418 0.005597 0.252720
we 14357 0.005574 0.258293
file 13089 0.005082 0.263375
by 12027 0.004669 0.268044
from 11293 0.004384 0.272428
use 11250 0.004368 0.276796
an 11174 0.004338 0.281134
not 9980 0.003875 0.285009
perl 9366 0.003636 0.288645
but 8904 0.003457 0.292101
data 8895 0.003453 0.295555
command 8735 0.003391 0.298946
one 8286 0.003217 0.302163
all 8149 0.003164 0.305326
example 8041 0.003122 0.308448
line 8035 0.003119 0.311568
at 7848 0.003047 0.314614
python 7721 0.002998 0.317612
your 7681 0.002982 0.320594
chapter 7204 0.002797 0.323391
have 7073 0.002746 0.326137
more 6918 0.002686 0.328822
when 6779 0.002632 0.331454
name 6483 0.002517 0.333971
function 6394 0.002482 0.336453
which 6362 0.002470 0.338923
will 6304 0.002447 0.341371
files 6248 0.002426 0.343796
text 6128 0.002379 0.346175
like 6023 0.002338 0.348514
so 6009 0.002333 0.350847
also 5889 0.002286 0.353133
もちろんcumsumの最後は1になります。
累積和の最後は1
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' | # 小文字に統一
grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr | # ワードカウント
awk '{
word[NR]=$2;
count[NR]=$1;
s+=$1;
cumsum[NR]=s
}
END{
for(i=1;i<=NR;i++)
{printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
}' | # 回数/割合/累積和
tail
zwemer 1 0.000000 0.999997
zwicky 1 0.000000 0.999997
zwspace 1 0.000000 0.999997
zx 1 0.000000 0.999998
zxf 1 0.000000 0.999998
zygote 1 0.000000 0.999998
zyzzy 1 0.000000 0.999999
zzzzteana 1 0.000000 0.999999
zzzzzzz 1 0.000000 1.000000
zzzzzzzz 1 0.000000 1.000000
考え方があっているかはわかりませんが、 単語のほとんど を表すために正規分布を使います。
awkのifを使って正規分布σ1(=68.27%)の範囲に入る文字までを表示します。2
σ1までの累積和を表示
cat oreilly_text_all.txt | tr '[A-Z]' '[a-z]' | # 小文字に統一
grep -oE '[a-z]{2,}' | sort | uniq -c | sort -k1nr | # ワードカウント
awk '{
word[NR]=$2;
count[NR]=$1;
s+=$1;
cumsum[NR]=s
}
END{
for(i=1;i<=NR;i++)
{if (cumsum[i]/s<0.6827) printf "%16s %8d %6f %6f\n", word[i],count[i],count[i]/s,cumsum[i]/s}
}' | # 回数/割合/累積和 4列目をσ1まで表示
sed -e '1s/^/ word count ratio cumsum\n/' | # ヘッダー
cat -n # 行番号
1 word count ratio cumsum
2 the 138844 0.053903 0.053903
3 to 64866 0.025183 0.079086
4 in 50969 0.019788 0.098873
5 of 50764 0.019708 0.118581
6 and 48537 0.018843 0.137425
7 is 38182 0.014823 0.152248
8 you 29710 0.011534 0.163782
9 for 28060 0.010894 0.174676
10 that 26829 0.010416 0.185092
11 it 23518 0.009130 0.194222
12 this 21736 0.008439 0.202661
13 as 18763 0.007284 0.209945
14 with 18283 0.007098 0.217043
15 if 16465 0.006392 0.223435
16 or 15867 0.006160 0.229595
17 on 15841 0.006150 0.235745
18 are 14711 0.005711 0.241456
19 can 14594 0.005666 0.247122
20 be 14418 0.005597 0.252720
21 we 14357 0.005574 0.258293
22 file 13089 0.005082 0.263375
23 by 12027 0.004669 0.268044
24 from 11293 0.004384 0.272428
25 use 11250 0.004368 0.276796
26 an 11174 0.004338 0.281134
27 not 9980 0.003875 0.285009
28 perl 9366 0.003636 0.288645
29 but 8904 0.003457 0.292101
30 data 8895 0.003453 0.295555
31 command 8735 0.003391 0.298946
32 one 8286 0.003217 0.302163
33 all 8149 0.003164 0.305326
34 example 8041 0.003122 0.308448
35 line 8035 0.003119 0.311568
36 at 7848 0.003047 0.314614
37 python 7721 0.002998 0.317612
38 your 7681 0.002982 0.320594
39 chapter 7204 0.002797 0.323391
40 have 7073 0.002746 0.326137
41 more 6918 0.002686 0.328822
42 when 6779 0.002632 0.331454
43 name 6483 0.002517 0.333971
44 function 6394 0.002482 0.336453
45 which 6362 0.002470 0.338923
46 will 6304 0.002447 0.341371
47 files 6248 0.002426 0.343796
48 text 6128 0.002379 0.346175
49 like 6023 0.002338 0.348514
50 so 6009 0.002333 0.350847
51 also 5889 0.002286 0.353133
52 program 5853 0.002272 0.355405
53 see 5853 0.002272 0.357677
54 list 5783 0.002245 0.359923
55 using 5694 0.002211 0.362133
56 print 5610 0.002178 0.364311
57 section 5174 0.002009 0.366320
58 other 5110 0.001984 0.368304
59 self 5091 0.001976 0.370280
60 set 5084 0.001974 0.372254
61 shell 4949 0.001921 0.374175
62 first 4948 0.001921 0.376096
63 code 4913 0.001907 0.378003
64 time 4865 0.001889 0.379892
65 its 4829 0.001875 0.381767
66 do 4819 0.001871 0.383638
67 value 4778 0.001855 0.385493
68 each 4702 0.001825 0.387318
69 some 4650 0.001805 0.389123
70 only 4561 0.001771 0.390894
71 system 4540 0.001763 0.392657
72 module 4425 0.001718 0.394375
73 script 4413 0.001713 0.396088
74 may 4377 0.001699 0.397787
75 string 4346 0.001687 0.399474
76 any 4339 0.001685 0.401159
77 there 4320 0.001677 0.402836
78 used 4318 0.001676 0.404512
79 class 4245 0.001648 0.406160
80 here 4214 0.001636 0.407796
81 output 4163 0.001616 0.409413
82 has 4146 0.001610 0.411022
83 they 4142 0.001608 0.412630
84 number 4125 0.001601 0.414232
85 out 4073 0.001581 0.415813
86 just 3973 0.001542 0.417355
87 new 3954 0.001535 0.418890
88 two 3945 0.001532 0.420422
89 because 3894 0.001512 0.421934
90 same 3815 0.001481 0.423415
91 input 3792 0.001472 0.424887
92 get 3709 0.001440 0.426327
93 into 3695 0.001435 0.427761
94 variable 3692 0.001433 0.429195
95 directory 3664 0.001422 0.430617
96 these 3657 0.001420 0.432037
97 no 3551 0.001379 0.433416
98 run 3514 0.001364 0.434780
99 up 3495 0.001357 0.436137
100 what 3440 0.001336 0.437472
101 then 3419 0.001327 0.438800
102 than 3395 0.001318 0.440118
103 want 3379 0.001312 0.441429
104 return 3353 0.001302 0.442731
105 user 3341 0.001297 0.444028
106 how 3327 0.001292 0.445320
107 array 3273 0.001271 0.446591
108 object 3255 0.001264 0.447854
109 make 3254 0.001263 0.449117
110 my 3208 0.001245 0.450363
111 test 3196 0.001241 0.451604
112 next 3173 0.001232 0.452836
113 them 3155 0.001225 0.454060
114 re 3136 0.001217 0.455278
115 way 3104 0.001205 0.456483
116 ll 3102 0.001204 0.457687
117 such 3086 0.001198 0.458885
118 need 3081 0.001196 0.460081
119 book 3072 0.001193 0.461274
120 type 2992 0.001162 0.462436
121 method 2964 0.001151 0.463586
122 server 2918 0.001133 0.464719
123 about 2894 0.001124 0.465843
124 most 2875 0.001116 0.466959
125 read 2840 0.001103 0.468061
126 lines 2837 0.001101 0.469163
127 standard 2831 0.001099 0.470262
128 end 2827 0.001098 0.471359
129 values 2806 0.001089 0.472449
130 pattern 2757 0.001070 0.473519
131 many 2736 0.001062 0.474581
132 py 2732 0.001061 0.475642
133 unix 2700 0.001048 0.476690
134 find 2677 0.001039 0.477730
135 process 2664 0.001034 0.478764
136 functions 2646 0.001027 0.479791
137 open 2571 0.000998 0.480789
138 don 2544 0.000988 0.481777
139 key 2537 0.000985 0.482762
140 learning 2536 0.000985 0.483746
141 character 2511 0.000975 0.484721
142 was 2431 0.000944 0.485665
143 would 2385 0.000926 0.486591
144 def 2361 0.000917 0.487507
145 import 2354 0.000914 0.488421
146 call 2351 0.000913 0.489334
147 programming 2347 0.000911 0.490245
148 variables 2334 0.000906 0.491151
149 both 2330 0.000905 0.492056
150 does 2325 0.000903 0.492959
151 should 2318 0.000900 0.493858
152 now 2317 0.000900 0.494758
153 match 2303 0.000894 0.495652
154 instance 2298 0.000892 0.496544
155 operator 2285 0.000887 0.497431
156 even 2276 0.000884 0.498315
157 last 2267 0.000880 0.499195
158 characters 2265 0.000879 0.500074
159 after 2255 0.000875 0.500950
160 figure 2234 0.000867 0.501817
161 work 2200 0.000854 0.502671
162 different 2196 0.000853 0.503524
163 their 2172 0.000843 0.504367
164 html 2160 0.000839 0.505206
165 another 2140 0.000831 0.506036
166 regular 2120 0.000823 0.506859
167 simple 2116 0.000821 0.507681
168 while 2113 0.000820 0.508501
169 default 2101 0.000816 0.509317
170 names 2083 0.000809 0.510126
171 before 2080 0.000808 0.510933
172 might 2079 0.000807 0.511740
173 instead 2076 0.000806 0.512546
174 our 2070 0.000804 0.513350
175 index 2061 0.000800 0.514150
176 programs 2060 0.000800 0.514950
177 could 2058 0.000799 0.515749
178 awk 2050 0.000796 0.516545
179 commands 2025 0.000786 0.517331
180 expression 2022 0.000785 0.518116
181 write 2016 0.000783 0.518898
182 called 2014 0.000782 0.519680
183 loop 2003 0.000778 0.520458
184 start 1995 0.000775 0.521232
185 model 1992 0.000773 0.522006
186 where 1973 0.000766 0.522772
187 true 1967 0.000764 0.523535
188 option 1940 0.000753 0.524289
189 part 1932 0.000750 0.525039
190 case 1921 0.000746 0.525784
191 bin 1920 0.000745 0.526530
192 tools 1907 0.000740 0.527270
193 window 1906 0.000740 0.528010
194 gui 1856 0.000721 0.528731
195 single 1855 0.000720 0.529451
196 version 1837 0.000713 0.530164
197 error 1835 0.000712 0.530876
198 argument 1813 0.000704 0.531580
199 running 1811 0.000703 0.532283
200 current 1809 0.000702 0.532986
201 training 1808 0.000702 0.533688
202 message 1761 0.000684 0.534371
203 scripts 1761 0.000684 0.535055
204 word 1754 0.000681 0.535736
205 examples 1737 0.000674 0.536410
206 page 1729 0.000671 0.537081
207 between 1728 0.000671 0.537752
208 hash 1728 0.000671 0.538423
209 much 1724 0.000669 0.539092
210 without 1716 0.000666 0.539759
211 look 1708 0.000663 0.540422
212 well 1702 0.000661 0.541082
213 pp 1687 0.000655 0.541737
214 since 1687 0.000655 0.542392
215 arguments 1665 0.000646 0.543039
216 path 1656 0.000643 0.543682
217 must 1651 0.000641 0.544323
218 client 1645 0.000639 0.544961
219 following 1645 0.000639 0.545600
220 order 1645 0.000639 0.546239
221 table 1645 0.000639 0.546877
222 change 1638 0.000636 0.547513
223 let 1633 0.000634 0.548147
224 filename 1629 0.000632 0.548779
225 main 1621 0.000629 0.549409
226 package 1606 0.000623 0.550032
227 os 1598 0.000620 0.550653
228 else 1595 0.000619 0.551272
229 form 1594 0.000619 0.551891
230 language 1590 0.000617 0.552508
231 doesn 1584 0.000615 0.553123
232 though 1584 0.000615 0.553738
233 web 1573 0.000611 0.554349
234 returns 1569 0.000609 0.554958
235 create 1568 0.000609 0.555566
236 local 1568 0.000609 0.556175
237 sort 1556 0.000604 0.556779
238 strings 1554 0.000603 0.557383
239 previous 1551 0.000602 0.557985
240 access 1541 0.000598 0.558583
241 windows 1537 0.000597 0.559180
242 special 1533 0.000595 0.559775
243 library 1532 0.000595 0.560370
244 result 1519 0.000590 0.560959
245 right 1516 0.000589 0.561548
246 three 1515 0.000588 0.562136
247 add 1511 0.000587 0.562723
248 objects 1510 0.000586 0.563309
249 through 1509 0.000586 0.563895
250 uses 1507 0.000585 0.564480
251 over 1505 0.000584 0.565064
252 machine 1502 0.000583 0.565647
253 mail 1490 0.000578 0.566226
254 space 1483 0.000576 0.566801
255 numbers 1482 0.000575 0.567377
256 multiple 1475 0.000573 0.567949
257 methods 1465 0.000569 0.568518
258 ve 1463 0.000568 0.569086
259 those 1455 0.000565 0.569651
260 http 1443 0.000560 0.570211
261 built 1441 0.000559 0.570771
262 second 1433 0.000556 0.571327
263 very 1433 0.000556 0.571883
264 systems 1432 0.000556 0.572439
265 information 1431 0.000556 0.572995
266 search 1426 0.000554 0.573548
267 home 1415 0.000549 0.574098
268 own 1411 0.000548 0.574646
269 try 1401 0.000544 0.575189
270 format 1379 0.000535 0.575725
271 means 1371 0.000532 0.576257
272 mode 1371 0.000532 0.576789
273 words 1371 0.000532 0.577322
274 sed 1366 0.000530 0.577852
275 problem 1355 0.000526 0.578378
276 often 1350 0.000524 0.578902
277 size 1338 0.000519 0.579422
278 exit 1329 0.000516 0.579937
279 operators 1329 0.000516 0.580453
280 always 1318 0.000512 0.580965
281 usr 1313 0.000510 0.581475
282 cgi 1312 0.000509 0.581984
283 however 1299 0.000504 0.582489
284 available 1297 0.000504 0.582992
285 common 1296 0.000503 0.583495
286 say 1295 0.000503 0.583998
287 later 1294 0.000502 0.584500
288 point 1293 0.000502 0.585002
289 random 1290 0.000501 0.585503
290 com 1285 0.000499 0.586002
291 once 1285 0.000499 0.586501
292 top 1280 0.000497 0.586998
293 field 1278 0.000496 0.587494
294 left 1274 0.000495 0.587989
295 expressions 1273 0.000494 0.588483
296 know 1270 0.000493 0.588976
297 something 1269 0.000493 0.589468
298 still 1267 0.000492 0.589960
299 sub 1267 0.000492 0.590452
300 email 1265 0.000491 0.590943
301 users 1264 0.000491 0.591434
302 features 1260 0.000489 0.591923
303 either 1256 0.000488 0.592411
304 every 1250 0.000485 0.592896
305 options 1248 0.000485 0.593381
306 context 1244 0.000483 0.593864
307 group 1244 0.000483 0.594347
308 been 1241 0.000482 0.594828
309 too 1241 0.000482 0.595310
310 were 1236 0.000480 0.595790
311 reference 1232 0.000478 0.596268
312 unicode 1231 0.000478 0.596746
313 classes 1225 0.000476 0.597222
314 bytes 1216 0.000472 0.597694
315 root 1211 0.000470 0.598164
316 named 1201 0.000466 0.598630
317 train 1200 0.000466 0.599096
318 split 1194 0.000464 0.599560
319 writing 1183 0.000459 0.600019
320 echo 1178 0.000457 0.600476
321 useful 1177 0.000457 0.600933
322 back 1173 0.000455 0.601389
323 close 1172 0.000455 0.601844
324 control 1171 0.000455 0.602298
325 count 1167 0.000453 0.602751
326 modules 1166 0.000453 0.603204
327 title 1165 0.000452 0.603656
328 matching 1160 0.000450 0.604107
329 binary 1159 0.000450 0.604557
330 long 1159 0.000450 0.605007
331 works 1158 0.000450 0.605456
332 state 1152 0.000447 0.605903
333 makes 1146 0.000445 0.606348
334 source 1143 0.000444 0.606792
335 good 1142 0.000443 0.607235
336 level 1142 0.000443 0.607679
337 die 1139 0.000442 0.608121
338 side 1133 0.000440 0.608561
339 calls 1127 0.000438 0.608998
340 processes 1125 0.000437 0.609435
341 support 1121 0.000435 0.609870
342 feature 1120 0.000435 0.610305
343 simply 1119 0.000434 0.610739
344 note 1118 0.000434 0.611174
345 keys 1113 0.000432 0.611606
346 sys 1104 0.000429 0.612034
347 reading 1103 0.000428 0.612462
348 given 1095 0.000425 0.612888
349 defined 1084 0.000421 0.613308
350 done 1080 0.000419 0.613728
351 environment 1073 0.000417 0.614144
352 network 1067 0.000414 0.614558
353 scalar 1064 0.000413 0.614972
354 range 1062 0.000412 0.615384
355 matches 1061 0.000412 0.615796
356 socket 1056 0.000410 0.616206
357 statement 1056 0.000410 0.616616
358 none 1055 0.000410 0.617025
359 block 1053 0.000409 0.617434
360 put 1051 0.000408 0.617842
361 take 1049 0.000407 0.618249
362 false 1044 0.000405 0.618655
363 tkinter 1042 0.000405 0.619059
364 shows 1027 0.000399 0.619458
365 based 1023 0.000397 0.619855
366 better 1020 0.000396 0.620251
367 tree 1019 0.000396 0.620647
368 check 1013 0.000393 0.621040
369 reilly 1012 0.000393 0.621433
370 few 1006 0.000391 0.621823
371 usually 1006 0.000391 0.622214
372 go 1005 0.000390 0.622604
373 whether 1000 0.000388 0.622992
374 possible 998 0.000387 0.623380
375 display 997 0.000387 0.623767
376 original 997 0.000387 0.624154
377 really 991 0.000385 0.624539
378 already 989 0.000384 0.624923
379 vector 989 0.000384 0.625307
380 us 988 0.000384 0.625690
381 bit 983 0.000382 0.626072
382 edition 982 0.000381 0.626453
383 itself 980 0.000380 0.626833
384 processing 973 0.000378 0.627211
385 fields 967 0.000375 0.627587
386 syntax 965 0.000375 0.627961
387 again 964 0.000374 0.628336
388 thread 961 0.000373 0.628709
389 within 956 0.000371 0.629080
390 things 947 0.000368 0.629447
391 found 945 0.000367 0.629814
392 unless 945 0.000367 0.630181
393 interface 943 0.000366 0.630547
394 people 943 0.000366 0.630913
395 id 940 0.000365 0.631278
396 per 939 0.000365 0.631643
397 copy 937 0.000364 0.632007
398 record 937 0.000364 0.632370
399 full 936 0.000363 0.632734
400 parent 929 0.000361 0.633094
401 delete 926 0.000359 0.633454
402 event 926 0.000359 0.633813
403 non 925 0.000359 0.634173
404 sequence 925 0.000359 0.634532
405 show 925 0.000359 0.634891
406 arrays 923 0.000358 0.635249
407 until 917 0.000356 0.635605
408 memory 915 0.000355 0.635960
409 particular 913 0.000354 0.636315
410 provides 913 0.000354 0.636669
411 changes 912 0.000354 0.637023
412 types 907 0.000352 0.637375
413 button 906 0.000352 0.637727
414 job 903 0.000351 0.638078
415 although 902 0.000350 0.638428
416 automatically 902 0.000350 0.638778
417 address 901 0.000350 0.639128
418 pass 891 0.000346 0.639474
419 send 890 0.000346 0.639819
420 step 890 0.000346 0.640165
421 contents 887 0.000344 0.640509
422 action 883 0.000343 0.640852
423 subroutine 883 0.000343 0.641195
424 mean 881 0.000342 0.641537
425 world 879 0.000341 0.641878
426 internet 878 0.000341 0.642219
427 encoding 876 0.000340 0.642559
428 ls 875 0.000340 0.642899
429 directories 873 0.000339 0.643238
430 database 872 0.000339 0.643576
431 large 871 0.000338 0.643914
432 fact 870 0.000338 0.644252
433 save 869 0.000337 0.644589
434 probably 867 0.000337 0.644926
435 operations 863 0.000335 0.645261
436 date 861 0.000334 0.645595
437 etc 861 0.000334 0.645930
438 help 860 0.000334 0.646263
439 empty 859 0.000333 0.646597
440 parameters 858 0.000333 0.646930
441 pack 855 0.000332 0.647262
442 select 854 0.000332 0.647594
443 shown 853 0.000331 0.647925
444 pop 847 0.000329 0.648254
445 general 846 0.000328 0.648582
446 less 843 0.000327 0.648909
447 bash 840 0.000326 0.649235
448 length 840 0.000326 0.649561
449 build 838 0.000325 0.649887
450 argv 836 0.000325 0.650211
451 being 828 0.000321 0.650533
452 pipe 828 0.000321 0.650854
453 attribute 823 0.000320 0.651174
454 easy 823 0.000320 0.651493
455 remote 821 0.000319 0.651812
456 actually 820 0.000318 0.652130
457 handle 820 0.000318 0.652449
458 label 818 0.000318 0.652766
459 contains 816 0.000317 0.653083
460 cookbook 816 0.000317 0.653400
461 earlier 816 0.000317 0.653717
462 elements 816 0.000317 0.654033
463 lists 816 0.000317 0.654350
464 provide 816 0.000317 0.654667
465 similar 815 0.000316 0.654983
466 tf 815 0.000316 0.655300
467 global 811 0.000315 0.655615
468 widget 811 0.000315 0.655930
469 give 810 0.000314 0.656244
470 important 805 0.000313 0.656557
471 references 804 0.000312 0.656869
472 isn 803 0.000312 0.657180
473 inc 802 0.000311 0.657492
474 str 801 0.000311 0.657803
475 element 800 0.000311 0.658113
476 best 798 0.000310 0.658423
477 times 798 0.000310 0.658733
478 dict 793 0.000308 0.659041
479 log 792 0.000307 0.659348
480 learn 791 0.000307 0.659655
481 len 790 0.000307 0.659962
482 entry 789 0.000306 0.660268
483 languages 786 0.000305 0.660574
484 details 783 0.000304 0.660878
485 threads 781 0.000303 0.661181
486 terminal 780 0.000303 0.661484
487 items 778 0.000302 0.661786
488 except 771 0.000299 0.662085
489 solution 771 0.000299 0.662384
490 grep 768 0.000298 0.662682
491 image 767 0.000298 0.662980
492 parts 767 0.000298 0.663278
493 results 767 0.000298 0.663576
494 several 767 0.000298 0.663873
495 allows 766 0.000297 0.664171
496 runs 764 0.000297 0.664467
497 patterns 763 0.000296 0.664764
498 who 763 0.000296 0.665060
499 making 761 0.000295 0.665355
500 place 761 0.000295 0.665651
501 layer 760 0.000295 0.665946
502 prompt 758 0.000294 0.666240
503 ftp 754 0.000293 0.666533
504 documentation 751 0.000292 0.666824
505 tar 750 0.000291 0.667116
506 real 749 0.000291 0.667406
507 row 748 0.000290 0.667697
508 signal 738 0.000287 0.667983
509 won 738 0.000287 0.668270
510 written 738 0.000287 0.668556
511 dir 737 0.000286 0.668842
512 win 737 0.000286 0.669129
513 hello 735 0.000285 0.669414
514 sometimes 734 0.000285 0.669699
515 generally 732 0.000284 0.669983
516 including 732 0.000284 0.670267
517 status 732 0.000284 0.670551
518 column 727 0.000282 0.670834
519 editor 727 0.000282 0.671116
520 columns 725 0.000281 0.671397
521 series 725 0.000281 0.671679
522 off 724 0.000281 0.671960
523 around 722 0.000280 0.672240
524 filehandle 721 0.000280 0.672520
525 vi 719 0.000279 0.672799
526 spam 713 0.000277 0.673076
527 ways 713 0.000277 0.673353
528 why 711 0.000276 0.673629
529 fred 710 0.000276 0.673905
530 creating 709 0.000275 0.674180
531 init 709 0.000275 0.674455
532 zero 709 0.000275 0.674730
533 passed 706 0.000274 0.675004
534 shells 703 0.000273 0.675277
535 cat 695 0.000270 0.675547
536 filenames 695 0.000270 0.675817
537 setting 695 0.000270 0.676087
538 eval 694 0.000269 0.676356
539 posix 694 0.000269 0.676626
540 child 693 0.000269 0.676895
541 reserved 691 0.000268 0.677163
542 under 691 0.000268 0.677431
543 sure 689 0.000267 0.677699
544 url 689 0.000267 0.677966
545 np 686 0.000266 0.678232
546 specify 686 0.000266 0.678499
547 directly 685 0.000266 0.678765
548 ascii 684 0.000266 0.679030
549 old 684 0.000266 0.679296
550 advanced 683 0.000265 0.679561
551 attributes 682 0.000265 0.679826
552 item 681 0.000264 0.680090
553 messages 679 0.000264 0.680354
554 term 679 0.000264 0.680617
555 map 678 0.000263 0.680881
556 exception 677 0.000263 0.681143
557 made 677 0.000263 0.681406
558 hidden 676 0.000262 0.681669
559 basic 675 0.000262 0.681931
560 created 675 0.000262 0.682193
561 sets 675 0.000262 0.682455
ということで「O'reilly本はだいたい561単語でできている」という結果になりました。