きっかけ

KaggleのPersonalized Medicine: Redefining Cancer Treatmentに挑戦しようと思い、テキストデータの分析方法を学習中です。
論文のようなテキストデータに基づいて、遺伝子変異を9つのカテゴリーに分類するというコンペです。

用いるデータ

上記コンペでtrain_textとして与えられている、医学系の論文のデータセットです。
https://www.kaggle.com/c/msk-redefining-cancer-treatment/download/training_text.zip

データの読み込み

train_texts = pd.read_table("training_text", sep="\|\|", engine="python", squeeze=True)
train_texts.tail()

3316    Introduction  Myelodysplastic syndromes (MDS) ...
3317    Introduction  Myelodysplastic syndromes (MDS) ...
3318    The Runt-related transcription factor 1 gene (...
3319    The RUNX1/AML1 gene is the most frequent targe...
3320    The most frequent mutations associated with le...
Name: ID,Text, dtype: object

3320個の論文（重複含む）が含まれています。
試しに1つの論文の中身をもう少し見てみます。

train_texts[0][:200]

'Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase a'

このような普通の論文です。

トークン化と集計

NLTKや自作の関数を使っても良いのですが、今回は簡単のためscikit-learnに頼ります。

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english", lowercase=True, min_df=1)
X = vectorizer.fit_transform(train_texts[:100])
X

<100x18296 sparse matrix of type '<class 'numpy.int64'>'
    with 133551 stored elements in Compressed Sparse Row format>

全部読み込むと少し時間がかかるので、100番目の論文までを使うことにしました。

stop_words="english"と指定することで、"that" や "and" など、あまりにありふれていて、解析上あまり意味のない単語を除外してくれます。
lowercaseはデフォルトでTrueになっていますが、全て小文字に変換して集計してくれます。
min_dfはint型を与えると文書中にその数未満しか現れない単語を除外してくれます。デフォルトは1ですが、例えば2とすると、1回しか現れない単語が除外されるようになります。
ちなみに、[0.0, 1.0]の範囲のfloat型を与えると、全単語中その数値の割合以下でしか出現しない単語が除外されるようです。

vectorizerにより変換した結果は、sparse matrixとして返されます。

stop_wordsに何が含まれるか気になったので調べてみます。

from sklearn.feature_extraction import stop_words
eng_stop_words = stop_words.ENGLISH_STOP_WORDS

print(len(eng_stop_words))
print(list(eng_stop_words)[:20])

318
['con', 'toward', 'anywhere', 'on', 'a', 'most', 'this', 'full', 'inc', 'whereafter', 'had', 'hers', 'call', 'interest', 'move', 'she', 'six', 'up', 'he', 'you']

'on', 'most', 'this' など納得いくものが多いですが、'call', 'move' などは場合によっては除外しないほうが良いかもしれません。

何はともあれ、論文中の単語を集計してみます。

vocabulary = vectorizer.get_feature_names()
word_count = np.sum(X.toarray(), axis=0)

D = pd.Series({tag: count for tag, count in zip(vocabulary, dist)})
D.sort_values(ascending=False).head(20)

mutations     7212
cbl           5660
cells         4912
cell          4558
cancer        2931
et            2775
figure        2755
al            2722
fig           2491
mutation      2181
protein       2043
activity      2034
type          1998
cyclin        1870
expression    1679
tert          1609
binding       1543
domain        1537
wild          1491
tumor         1418
dtype: int64

mutation, cell, cancerなど、確かにがん関係の論文に多そうな単語が上位に入っています。figure, et alなども論文に特有ですが、これらは特に有益な情報にならないので、stop wordに加えた方が良さそうです。

今後は、各カテゴリーに属する論文に多い単語などを探索します。
今日はここまで。

scikit-learnで英文テキスト中の単語を集計

きっかけ

用いるデータ

トークン化と集計