More than 1 year has passed since last update.

PythonとGridDBを用いたLDAによるトピックモデリング

Posted at 2022-08-09

自然言語処理において、トピックモデリングは与えられたコーパスに含まれる単語を基にトピックを割り当てます。テキストデータはラベル付けされていないため、教師なし技法です。データに溢れる現代において、文書をトピックに分類することの重要性はますます高まっています。例えば、ある企業が何百件ものレビューを受け取った場合、どのカテゴリのレビューが最も重要なのか、逆にどのカテゴリのレビューが重要ではないのかを知る必要があります。

キーワードと同様に、トピックは文書を記述するために使われます。例えば、経済に関するトピックといえば、株式市場、米ドル、インフレ、GDPなどを思い浮かべるでしょう。トピックモデルとは、文書中に現れる単語をもとに、自動的にトピックを検出できるモデルのことです。ここで取り組む問題は、トピックモデリングになります。

LDA - (Latent Dirichlet Allocation)

Latentとは、隠された、まだ発見されていないものという意味です。Dirichletで示されるように、文書中のトピックや単語パターンの分布はDirichlet分布が支配していると想定されます。ここでいう "Allocation "とは、何か、この場合はトピックを与えるということです。

このチュートリアルでは、以下のデータセットのレビューを使って、レビューからトピックを生成してみます。こうすることで、ユーザーが何について話しているのか、何に注目しているのか、そしておそらくアプリ開発者はどこで進歩すべきなのかを知ることができます。

チュートリアルの概要は以下の通りです。

前提条件と環境設定
データセット概要
必要なライブラリのインポート
データセットの読み込み
データのクリーニングと前処理
機械学習モデルの構築と学習
まとめ

1. 前提条件と環境設定

このチュートリアルは、Windows オペレーティングシステム上の Anaconda Navigator (Python バージョン - 3.8.3) で実行されます。チュートリアルを続ける前に、以下のパッケージがインストールされている必要があります。

Pandas
NumPy
Sklearn
nltk
re
griddb_python
spacy
gensim

これらのパッケージは Conda の仮想環境に conda install package-name を使ってインストールすることができます。ターミナルやコマンドプロンプトから直接Pythonを使っている場合は、 pip install package-name でインストールできます。

GridDBのインストール

このチュートリアルでは、データセットをロードする際に、GridDB を使用する方法と、Pandas を使用する方法の 2 種類を取り上げます。Pythonを使用してGridDBにアクセスするためには、以下のパッケージも予めインストールしておく必要があります。

GridDB C クライアント
SWIG (Simplified Wrapper and Interface Generator)
GridDB Python クライアント。

2. データセット概要

Google Play Store Apps データセット: Androidマーケットを分析するために、Play Storeのアプリ1万個をWebスクレイピングしたデータです。

こちら (https://www.kaggle.com/datasets/lava18/google-play-store-apps/download) からダウンロードすることができます。

3. 必要なライブラリのインポート

import griddb_python as griddb
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

4. データセットの読み込み

続けて、データセットをノートブックに読み込んでみましょう。

4.a GridDBを使用する

東芝GridDB™は、IoTやビッグデータに最適な高スケーラブルNoSQLデータベースです。GridDBの理念の根幹は、IoTに最適化された汎用性の高いデータストアの提供、高いスケーラビリティ、高性能なチューニング、高い信頼性の確保にあります。

大量のデータを保存する場合、CSVファイルでは面倒なことがあります。GridDBは、オープンソースでスケーラブルなデータベースとして、完璧な代替手段となっています。GridDBは、スケーラブルでインメモリなNo SQLデータベースで、大量のデータを簡単に保存することができます。GridDBを初めて使う場合は、こちらのチュートリアルが役に立ちます。

すでにデータベースの設定が済んでいると仮定して、今度はデータセットを読み込むためのSQLクエリをpythonで書いてみましょう。

sql_statement = ('SELECT * FROM googleplaystore_user_reviews')
dataset = pd.read_sql_query(sql_statement, cont)

変数 cont には、データが格納されるコンテナ情報が格納されていることに注意してください。bbc-text はコンテナ名で置き換えてください。詳細は、こちらのチュートリアルを参照してください。

IoTやビッグデータのユースケースに関して言えば、GridDBはリレーショナルやNoSQLの領域の他のデータベースの中で明らかに際立っています。全体として、GridDBは高可用性とデータ保持を必要とするミッションクリティカルなアプリケーションのために、複数の信頼性機能を提供しています。

4.b pandasのread_csvを使用する

Pythonでは、ファイルを開くことによって、そのファイルにアクセスできるようにする必要があります。これはopen()関数を用いて行うことができます。openはファイルオブジェクトを返し、そのオブジェクトは開かれたファイルに関する情報を取得し、操作するためのメソッドと属性を持っています。上記のどちらの方法を使っても、pandas dataframeの形でデータが読み込まれるので、同じ出力になります。

df = pd.read_csv("googleplaystore_user_reviews.csv")
df = df.dropna(subset=["Translated_Review"])

df.head()

	App	Translated_Review	Sentiment	Sentiment_Polarity	Sentiment_Subjectivity
0	10 Best Foods for You	I like eat delicious food. That's I'm cooking ...	Positive	1.00	0.533333
1	10 Best Foods for You	This help eating healthy exercise regular basis	Positive	0.25	0.288462
3	10 Best Foods for You	Works great especially going grocery store	Positive	0.40	0.875000
4	10 Best Foods for You	Best idea us	Positive	1.00	0.300000
5	10 Best Foods for You	Best way	Positive	1.00	0.300000

データセットが読み込まれたら、次はそのデータセットを調べてみましょう。head() 関数を使って、このデータセットの最初の5行を表示してみましょう。

5. データクリーニングと前処理

メールアドレス、改行文字、引用符を削除してデータをクリーニングします。

# Convert to list
data = df.Translated_Review.values.tolist()
# Remove Emails
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters
data = [re.sub(r'\s+', ' ', sent) for sent in data]
# Remove distracting single quotes
data = [re.sub(r"\'", "", sent) for sent in data]
print(data[:1])

['I like eat delicious food. Thats Im cooking food myself, case "10 Best '
 'Foods" helps lot, also "Best Before (Shelf Life)"']

次に、各文をトークン化して、句読点や不要な文字をすべて削除して、単語のリストにする必要があります。ステミングとは、単語を、接頭辞や接尾辞、またはレンマと呼ばれる語根に付加される語幹に還元することを指します。この利点は、辞書に含まれる固有の単語の総数を減らすことができることです。

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
        
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

data_words = list(sent_to_words(data))
print(data_words[:1])

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
data_lemmatized = lemmatization(data_words, allowed_postags=["NOUN", "VERB"]) #select noun and verb
print(data_lemmatized[:2])

[['like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'myself', 'case', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life']]
['eat food s m cook food case food help lot shelf life', 'help eat exercise basis']

LDAトピックモデルアルゴリズムは、入力として文書の単語行列を必要とします。これはCountVectorizerを用いて行われます。

vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,
                             stop_words='english',             
                             lowercase=True,                   
                             token_pattern='[a-zA-Z0-9]{3,}') 
data_vectorized = vectorizer.fit_transform(data_lemmatized)

6. 機械学習モデル構築

LDA (Latent Dirichlet Allocation) モデルを構築するために必要なものはすべて揃っています。LDAモデルを構築するために、モデルを初期化し、fit_transform()を呼び出しましょう。

データセットに関する予備知識に基づいて、この例ではn_topicsを20に設定しました。この数値は後でグリッド検索を使って調整します。

# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=20,max_iter=10,learning_method='online',random_state=100,batch_size=128,evaluate_every = -1,n_jobs = -1,               )
lda_output = lda_model.fit_transform(data_vectorized)
print(lda_model)  # Model attributes

LatentDirichletAllocation(learning_method='online', n_components=20, n_jobs=-1,
                          random_state=100)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method="online", learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
                          n_components=10, n_jobs=-1, perp_tol=0.1,
                          random_state=100, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

LatentDirichletAllocation(learning_method='online', n_jobs=-1, random_state=100)

パープレキシティと対数尤度でモデルの性能を診断する

対数尤度が高く、パープレキシティ（exp(-1. * log-likelihood per word)）が低いものが良いモデルとされます。

# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))
# See model parameters
pprint(lda_model.get_params())

Log Likelihood:  -2127623.32986425
Perplexity:  1065.3272644698702
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 20,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

GridSearchを使用して、最適なLDAモデルを決定する

N_components（トピックの数）は、LDAモデルにとって最も重要なチューニングパラメータです。さらに、learning_decay（学習速度を制御する）も同様に検索します。これらに加えて、learning_offset（初期反復回数の重み付け。 > 1であるべき）、max_iterも探索パラメータとして考えることができます。この処理は多くの時間とリソースを消費する可能性があります。

# Define Search Param
search_params = {'n_components': [10, 20], 'learning_decay': [0.5, 0.9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)
GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
        n_jobs=1,
       param_grid={'n_components': [10, 20], 'learning_decay': [0.5, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

GridSearchCV(error_score='raise',
             estimator=LatentDirichletAllocation(learning_method=None,
                                                 n_jobs=1),
             n_jobs=1,
             param_grid={'learning_decay': [0.5, 0.9],
                         'n_components': [10, 20]},
             return_train_score='warn')

# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_components': 10}
Best Log Likelihood Score:  -432616.36669435585
Model Perplexity:  764.0439579711182

ある文書が特定のトピックに属するかどうかを判断する論理的な方法は、どのトピックがその文書に最も貢献したかを確認し、そのトピックに割り当てることです。以下の表では、すべての主要なトピックを取り上げ、最も支配的なトピックに独自の列を割り当てています。

# Create Document — Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc" + str(i) for i in range(len(data))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic["dominant_topic"] = dominant_topic
# Styling
def color_green(val):
    color = "green" if val > .1 else "black"
    return "color: {col}".format(col=color)
def make_bold(val):
    weight = 700 if val > .1 else 400
    return "font-weight: {weight}".format(weight=weight)
# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

	Topic0	Topic1	Topic2	Topic3	Topic4	Topic5	Topic6	Topic7	Topic8	Topic9	dominant_topic
Doc0	0.010000	0.010000	0.010000	0.760000	0.010000	0.010000	0.010000	0.010000	0.010000	0.160000	3
Doc1	0.020000	0.020000	0.020000	0.820000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	3
Doc2	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	0.770000	0.030000	0.030000	0.030000	6
Doc3	0.550000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0
Doc4	0.050000	0.050000	0.050000	0.550000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	3
Doc5	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0
Doc6	0.030000	0.030000	0.700000	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	2
Doc7	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	0.250000	0.030000	0.030000	0.550000	9
Doc8	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0
Doc9	0.010000	0.010000	0.010000	0.010000	0.790000	0.120000	0.010000	0.010000	0.010000	0.010000	4
Doc10	0.850000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0
Doc11	0.020000	0.020000	0.220000	0.620000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	3
Doc12	0.030000	0.030000	0.030000	0.520000	0.030000	0.270000	0.030000	0.030000	0.030000	0.030000	3
Doc13	0.020000	0.020000	0.020000	0.380000	0.020000	0.020000	0.020000	0.020000	0.020000	0.460000	9
Doc14	0.850000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0

# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names_out()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()

	aap	abandon	ability	abuse	accept	access	accessory	accident	accommodation	accomplish	...	yardage	yay	year	yesterday	yoga	youtube	zip	zombie	zone	zoom
Topic0	0.102649	0.102871	56.001281	0.103583	0.107420	0.132561	12.712732	0.102863	0.102585	0.102685	...	8.642076	0.102612	153.232551	0.102522	0.496217	0.106992	0.211912	0.140018	0.177780	0.104975
Topic1	0.101828	0.102233	1.148602	0.102127	0.103543	558.310169	0.102997	2.594090	0.102651	0.110221	...	0.525860	0.102106	6.075186	20.135445	0.102284	0.106246	0.103076	0.108334	0.122234	0.102741
Topic2	0.103196	0.107593	0.107848	0.104019	0.103053	0.126004	0.106085	0.117876	9.979474	0.108507	...	0.366334	0.102367	5.066123	0.103931	31.039314	0.107878	0.102303	0.102200	0.128228	0.104907
Topic3	0.102564	0.107112	2.022397	12.968156	0.102692	0.130003	0.113959	1.838441	0.101579	8.345948	...	0.105286	0.103549	7.478397	0.104231	24.234774	0.118099	0.123212	0.128494	29.086953	0.103109
Topic4	0.102634	0.102345	76.332226	0.102486	41.139452	0.118419	0.115930	0.142032	0.103316	0.104292	...	0.409518	0.102979	737.692499	0.600751	0.116092	0.102262	0.108881	0.102011	0.115584	0.513135

5 rows × 2273 columns

# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names_out())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

	Word 0	Word 1	Word 2	Word 3	Word 4	Word 5	Word 6	Word 7	Word 8	Word 9	Word 10	Word 11	Word 12	Word 13	Word 14
Topic 0	phone	make	add	app	think	picture	version	month	work	minute	thing	look	list	home	number
Topic 1	email	send	news	check	price	bug	access	color	customer	order	make	message	service	app	camera
Topic 2	love	app	look	date	book	lose	guy	family	switch	music	recipe	information	quality	feel	change
Topic 3	fix	way	day	money	need	buy	star	make	lot	start	spend	help	rate	like	track
Topic 4	use	pay	want	account	user	year	fix	note	log	error	recommend	problem	app	star	option
Topic 5	feature	thank	hate	learn	photo	text	job	search	suck	help	tab	tool	weight	weather	group
Topic 6	work	screen	video	need	notification	device	wish	thing	option	set	store	choose	type	food	item
Topic 7	game	play	level	fun	player	watch	make	enjoy	start	graphic	thing	win	character	score	lose
Topic 8	time	update	try	review	crash	know	let	problem	page	load	waste	want	app	need	version
Topic 9	say	card	people	time	work	tell	download	help	datum	issue	happen	support	thing	know	want

このステップでは、キーワードからトピックを決定する必要があります。トピック3では、「カード」、「ビデオ」、「消費」に言及しているので、このトピックは「カード決済」に関するものであると結論付けられます。次に、推測した10個のトピックをdataframeに追加します。

Topics = ["Update Version/Fix Crash Problem","Download/Internet Access","Learn and Share","Card Payment","Notification/Support", 
          "Account Problem", "Device/Design/Password", "Language/Recommend/Screen Size", "Graphic/ Game Design/ Level and Coin", "Photo/Search"]
df_topic_keywords["Topics"]=Topics
df_topic_keywords

	Word 0	Word 1	Word 2	Word 3	Word 4	Word 5	Word 6	Word 7	Word 8	Word 9	Word 10	Word 11	Word 12	Word 13	Word 14	Topics
Topic 0	phone	make	add	app	think	picture	version	month	work	minute	thing	look	list	home	number	Update Version/Fix Crash Problem
Topic 1	email	send	news	check	price	bug	access	color	customer	order	make	message	service	app	camera	Download/Internet Access
Topic 2	love	app	look	date	book	lose	guy	family	switch	music	recipe	information	quality	feel	change	Learn and Share
Topic 3	fix	way	day	money	need	buy	star	make	lot	start	spend	help	rate	like	track	Card Payment
Topic 4	use	pay	want	account	user	year	fix	note	log	error	recommend	problem	app	star	option	Notification/Support
Topic 5	feature	thank	hate	learn	photo	text	job	search	suck	help	tab	tool	weight	weather	group	Account Problem
Topic 6	work	screen	video	need	notification	device	wish	thing	option	set	store	choose	type	food	item	Device/Design/Password
Topic 7	game	play	level	fun	player	watch	make	enjoy	start	graphic	thing	win	character	score	lose	Language/Recommend/Screen Size
Topic 8	time	update	try	review	crash	know	let	problem	page	load	waste	want	app	need	version	Graphic/ Game Design/ Level and Coin
Topic 9	say	card	people	time	work	tell	download	help	datum	issue	happen	support	thing	know	want	Photo/Search

トピックモデルを既に構築していると仮定すると、トピックを予測する前に、テキストを同じルーチンの変換に通す必要があります。今回の場合、変換の順番は sent_to_words() -> Stemming() -> vectorizer.transform() -> best_lda_model.transform() これらの変換を同じ順番で適用する必要があります。そこで、単純化するために、これらのステップをpredict_topic()関数にまとめてみましょう。

# Define function to predict topic for a given text document.
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization
# Step 1: Clean with simple_preprocess
    mytext_2 = list(sent_to_words(text))
# Step 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
# Step 3: Vectorize transform
    mytext_4 = vectorizer.transform(mytext_3)
# Step 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), 1:14].values.tolist()
    
    # Step 5: Infer Topic
    infer_topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), -1]
    
    #topic_guess = df_topic_keywords.iloc[np.argmax(topic_probability_scores), Topics]
    return infer_topic, topic, topic_probability_scores
# Predict the topic
mytext = ["Very Useful in diabetes age 30. I need control sugar. thanks Good deal"]
infer_topic, topic, prob_scores = predict_topic(text = mytext)
print(topic)
print(infer_topic)

['way', 'day', 'money', 'need', 'buy', 'star', 'make', 'lot', 'start', 'spend', 'help', 'rate', 'like']
Card Payment

オリジナルデータセットに含まれるレビューの最終予測値。

def apply_predict_topic(text):
    text = [text]
    infer_topic, topic, prob_scores = predict_topic(text = text)
    return(infer_topic)
df["Topic_key_word"]= df['Translated_Review'].apply(apply_predict_topic)
df.head()

	App	Translated_Review	Sentiment	Sentiment_Polarity	Sentiment_Subjectivity	Topic_key_word
0	10 Best Foods for You	I like eat delicious food. That's I'm cooking ...	Positive	1.00	0.533333	Card Payment
1	10 Best Foods for You	This help eating healthy exercise regular basis	Positive	0.25	0.288462	Card Payment
3	10 Best Foods for You	Works great especially going grocery store	Positive	0.40	0.875000	Device/Design/Password
4	10 Best Foods for You	Best idea us	Positive	1.00	0.300000	Notification/Support
5	10 Best Foods for You	Best way	Positive	1.00	0.300000	Card Payment

7. 結論

このチュートリアルでは、google plays storeのレビューを使用して、LDAを使用してトピックを生成しています。データのインポート方法として、(1) GridDB と (2) pandas read_csv の2つを検討しました。GridDBはオープンソースで拡張性が高いため、大規模なデータセットの場合、ノートブックにデータをインポートする優れた代替手段を提供します。ぜひGridDBをダウンロードしてみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up