More than 3 years have passed since last update.

CCC: coding crash course (5) Steve Jobsのスピーチに出現する単語・文字の頻度を調べる

Python

Last updated at 2020-05-28Posted at 2020-05-28

Steve Jobs のスピーチ

まず、スタンフォード大学のページからスピーチの文字情報を取得し、段落分けをしないテクストファイルを作成します。
https://news.stanford.edu/2005/06/14/jobs-061505/


f= open("stevejobs.txt","r")

contents =f.read()

以下のようにすると、「（ほぼ）単語のリスト（と言えるもの）」list_word を作ることができます。


list_word = contents.split(" ")

これをつかって、特定の単語、例えばtheが何回出てくるかを数えることができます。


count = 0
for w in list_word:
    if (w == "the"):
        count = count + 1

print(count)
# 91

なお、上で「単語のリスト」と言い切らなかったのは、このやり方では、文末の単語に「.」などがくっついてしまうからです。例を見てみます。


for w in list_word:
    if ('.' in w ):
        print(w)
        print('----')

# 
"""
world.
----
college.
----
graduation.
----
life.
----
以下、たくさん続きます。
"""

以下のようにすると、「文のリスト（のようなもの）」list_sentence を作ることができます。


list_sentence = contents.split(". ")

「文のリスト」と言い切れないのは、このやり方では、末尾が「.」ではなく、「!」「?」「:」のようなものは文の句切れと認識されないためです。例を見てみます。


for l in list_sentence:
    if ("!" in l):
        print(l)
        print('----')

# 何も表示されないので、「!」が含まれる文はない模様。

for l in list_sentence:
     if ("?" in l):
         print(l)
         print('----')

# 実行結果は以下のようになる。 
"""
So why did I drop out? It started before I was born
----
So my parents, who were on a waiting list, got a call in the middle of the night asking: “We have an unexpected baby boy; do you want him?” They said: “Of course.” My biological mother later found out that my mother had never graduated from college and that my father had never graduated from high school
----
How can you get fired from a company you started? Well, as Apple grew we hired someone who I thought was very talented to run the company with me, and for the first year or so things went well
----
When I was 17, I read a quote that went something like: “If you live each day as if it was your last, someday you’ll most certainly be right.” It made an impression on me, and since then, for the past 33 years, I have looked in the mirror every morning and asked myself: “If today were the last day of my life, would I want to do what I am about to do today?” And whenever the answer has been “No” for too many days in a row, I know I need to change something
----
"""

文字の頻度

最後に、文字の頻度を見て見ましょう。

まず、すべての文字を小文字にします。


contents_lower = contents.lower()

def count_char(char_target):
    count = 0
    for i in range(len(contents_lower)):
        s = contents_lower[i]
        if (s==char_target):
            count = count+1
    
    return count

abc = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

list_abc = abc.split()
# list_abc = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

import numpy
list_count_letter = numpy.zeros(26) 

for i in range(26):
    letter = list_abc[i]
    N = count_char(letter)
    list_count_letter[i] = N
    print(letter, N)

# 
"""
a 772
b 132
c 219
d 417
e 1077
f 206
g 207
h 440
i 642
j 9
k 65
l 402
m 216
n 595
o 772
p 183
q 4
r 499
s 510
t 926
u 283
v 115
w 245
x 16
y 257
z 4
"""

見やすくするために、sortをしてみると、以下のようになります。

(1位) "e"
(2位) "t"
(3位) "a"
(4位) "o"
(5位) "i"
(6-10位) "nsrhd"
(11-15位) "luywc"
(16-20位) "mgfpb"
(21-26位) "vkxjqz"

のようになっていて、意外にも"u"は12位なのですね。


index_sort = numpy.argsort(list_count_letter)

for i in range(26):
    k = index_sort[i]
    letter = list_abc[k]
    N = list_count_letter[k]
    print(letter, N)

"""
z 4.0
q 4.0
j 9.0
x 16.0
k 65.0
v 115.0
b 132.0
p 183.0
f 206.0
g 207.0
m 216.0
c 219.0
w 245.0
y 257.0
u 283.0
l 402.0
d 417.0
h 440.0
r 499.0
s 510.0
n 595.0
i 642.0
o 772.0
a 772.0
t 926.0
e 1077.0
"""

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up