Help us understand the problem. What is going on with this article?

TF-IDFとは

More than 1 year has passed since last update.

背景

arxivのサイトに掲載される論文に対して何か特徴量を抽出し、トレンドなどを得られればと考えた際に特徴語を抽出するTF-IDFを発見しました。

TF-IDF

TF-IDFとはTFとIDFから成り立ちます。
TF(Term Frequency)は各文書における単語の頻出頻度を表しています。
IDF(Inverse Document Frequency)は逆文書頻度という希少価値の高い単語に高い値を付与します。
よってTF-IDFとは単語の頻度かつその単語の頻度を考慮した指標であるというこ
とです。

sample.py
from sklearn.feature_extraction.text import TfidfVectorizer

sentence = ["I'm gonna make a change.For once I'm my life It's gonna feel real good,Gonna make a difference Gonna make it right"]
vec = TfidfVectorizer(max_df=10)
docs = sentence
term_doc = vec.fit_transform(docs)
vocabulary_ = vec.vocabulary_
term_doc_array = term_doc.toarray()
print(term_doc_array)

結果は
[[0. 0. 0. ... 0. 23. 0.
0. 0. 0.57 ... 0.023. 0.
...
0. 0. 0.64 ... 0.13 0. 0.]]

以下のような1次元配列が出力され、TF-IDFの値が高いものが重要度が高い単語となります。

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away