More than 5 years have passed since last update.

O'reilly本はOOO語の単語ででてきている

Last updated at 2019-02-08Posted at 2019-02-08

oreilly.txt

Hands.Machine.Learning.Scikit.Learn.Tensorflow.5225.pdf
the big data market.pdf
introduction-to-machine-learning-mueller(www.ebook-dl.com).pdf
integrated analytics.pdf
data-science-banking-and-fintech.pdf
fluent-python-2015-.pdf
janssens2014.pdf
Programming Perl, 4th Edition.pdf
Programming Python, 4th ed [O`Reilly].pdf
Python for Data Analysis.pdf
Sed & Awk 2nd Edition.pdf
Classic Shell Scripting.pdf
Unix Power Tools.pdf
[Joel_Grus]_Data_Science_from_Scratch_First_Princ.pdf
Learning Perl. 5th Edition.pdf
OReilly.Perl.Cookbook.pdf

pdftotextでoreilly.txtに書いてあるすべてのpdfファイルをtxt形式に変換し、標準出力に表示します。

oreilly.txtにあるpdfファイルをすべてテキスト化して標準出力に表示

<oreilly.txt | xargs -I{} pdftotext {} -

並列処理
parallel¹コマンドによる並列処理でfor+xargsに勝る処理速度を実現します。
3ファイルだけ変換してみた結果2倍以上は早くなっている様子。

parallel処理比較

# parallelあり
$ time \ls | parallel -a - "pdftotext {} -" > /dev/null 2>&1
\ls  0.00s user 0.00s system 81% cpu 0.004 total
parallel -a - "pdftotext {} -" > /dev/null 2>&1  2.88s user 0.18s system 252% cpu 1.215 total

# parallelなし
$ time \ls | xargs -I{} pdftotext {} > /dev/null 2>&1
\ls  0.00s user 0.00s system 73% cpu 0.001 total
xargs -I{} pdftotext {} > /dev/null 2>&1  2.64s user 0.11s system 97% cpu 2.807 total

trでtxtにある文字列の大文字、小文字を統一し、grepで記号を排除します。-oでマッチした文字のみ(only), -Eで拡張正規表現(extended regexp)を使用します。
ソートしてユニークにしてソートして…よくある頻度カウンターにパイプします。
最後にawkで空白揃えし、列を入れ替えます。

出現単語のカウント

<oreilly.txt | xargs -I{} pdftotext {} - |
  tr '[A-Z]' '[a-z]' |                  # 小文字に正規化
  grep -oE '[a-z]{2,}' |                # 2文字以上の英字(記号・数字を除く)
  sort |
  uniq -c |                             # 行数カウント
  sort -k1nr |                          # 1列目を数字としてソート, 逆順表示
  awk '{printf "%16s %4d\n",$2,$1;}' |  # 列入れ替えと空白埋め
  head                                  # 最初の10行だけ表示

             the 138844
              to 64866
              in 50969
              of 50764
             and 48537
              is 38182
             you 29710
             for 28060
            that 26829
              it 23518

theが13万回、toが6万回、inが5万回出現しています。

単語の種類はwc -lで行数カウントします。

単語の種類カウント

<oreilly.txt | xargs -I{} pdftotext {} - |
  tr '[A-Z]' '[a-z]' |                # 小文字に正規化
  grep -oE '[a-z]{2,}' |              # 2文字以上の英字(記号・数字を除く)
  sort |
  uniq -c |                           # 重複カウント
  wc -l                               # 行数カウント

35002

ということで「O'reilly本は35002語の単語でできている」という結果になりました。

次

2014-08-10 コマンドを並列に実行するGNU parallelがとても便利 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up