【Python】SpaCyを使った日本語自然言語処理入門

Last updated at 2024-08-27Posted at 2024-08-27

はじめに

SpaCyは、Pythonで自然言語処理(NLP)を行うための強力なライブラリです。日本語にも対応しており、形態素解析や固有表現抽出、構文解析などの高度な処理を簡単に行うことができます。この記事では、SpaCyを使って日本語テキストを分析する方法を15章に分けて詳しく解説します。

第1章: SpaCyのインストールと基本設定

まずはSpaCyとその日本語モデルをインストールしましょう。

!pip install spacy
!python -m spacy download ja_core_news_sm

次に、基本的な使い方を見てみましょう。

import spacy

# 日本語モデルの読み込み
nlp = spacy.load("ja_core_news_sm")

# テキストの解析
text = "私は東京で寿司を食べました。"
doc = nlp(text)

# 単語ごとに表示
for token in doc:
    print(f"テキスト: {token.text}, 品詞: {token.pos_}")

第2章: 形態素解析

SpaCyは高度な形態素解析を提供します。単語の基本形（レンマ）や品詞を簡単に取得できます。

text = "私は美味しいラーメンを食べました。"
doc = nlp(text)

for token in doc:
    print(f"テキスト: {token.text}, レンマ: {token.lemma_, 品詞: {token.pos_}")

第3章: 固有表現抽出

固有表現（人名、組織名、地名など）を自動的に抽出できます。

text = "山田太郎は東京大学で勉強しています。"
doc = nlp(text)

for ent in doc.ents:
    print(f"テキスト: {ent.text}, ラベル: {ent.label_}")

第4章: 依存構造解析

文の構造を解析し、単語間の関係を理解することができます。

text = "私は美味しい寿司を食べました。"
doc = nlp(text)

for token in doc:
    print(f"{token.text} <- {token.head.text} ({token.dep_})")

第5章: 文節の認識

GiNZAを使用すると、日本語特有の文節を認識できます[1]。

import spacy
import ginza

nlp = spacy.load("ja_ginza")
text = "私は美味しい寿司を食べました。"
doc = nlp(text)

print(ginza.bunsetu_spans(doc))

第6章: 単語ベクトルと類似度

SpaCyの中・大モデルを使用すると、単語ベクトルを利用した類似度計算が可能です[1]。

nlp = spacy.load("ja_core_news_md")

text = "東京は雨が降りました。今週末は雪が降るようです。"
doc = nlp(text)

print(doc[1].similarity(doc[8]))  # "雨"と"雪"の類似度

第7章: テキスト分類

SpaCyを使ってテキスト分類モデルを構築できます。

import spacy
from spacy.training import Example

nlp = spacy.blank("ja")
textcat = nlp.add_pipe("textcat")
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# トレーニングデータの準備とモデルの学習
# （実際のコードはより複雑になります）

第8章: 固有表現認識の拡張

カスタム固有表現を追加することで、特定のドメインに適応させることができます。

from spacy.tokens import Span

nlp = spacy.load("ja_core_news_sm")

def add_custom_ents(doc):
    new_ents = []
    for token in doc:
        if token.text == "Python":
            new_ents.append(Span(doc, token.i, token.i + 1, label="PROGRAMMING_LANGUAGE"))
    doc.ents = list(doc.ents) + new_ents
    return doc

nlp.add_pipe("add_custom_ents", last=True)

doc = nlp("私はPythonでプログラミングを学んでいます。")
for ent in doc.ents:
    print(f"テキスト: {ent.text}, ラベル: {ent.label_}")

第9章: ルールベースのマッチング

SpaCyのMatcherを使用して、特定のパターンを持つフレーズを検出できます。

from spacy.matcher import Matcher

nlp = spacy.load("ja_core_news_sm")
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "こんにちは"}, {"IS_PUNCT": True}]
matcher.add("GREETING", [pattern])

doc = nlp("こんにちは！私の名前は太郎です。")
matches = matcher(doc)
for match_id, start, end in matches:
    print(doc[start:end])

第10章: 文書類似度

Doc2Vecを使用して文書の類似度を計算できます。

import spacy
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

nlp = spacy.load("ja_core_news_sm")

documents = ["私は犬が好きです", "私は猫が好きです", "私は動物が好きです"]
tagged_docs = [TaggedDocument(nlp(doc), [i]) for i, doc in enumerate(documents)]

model = Doc2Vec(tagged_docs, vector_size=5, window=2, min_count=1, workers=4)

new_doc = "私はペットが好きです"
new_vector = model.infer_vector(nlp(new_doc))

sims = model.dv.most_similar([new_vector])
print(sims)

第11章: 感情分析

単語の感情極性を利用して、テキストの感情分析を行うことができます。

import spacy
from spacy.tokens import Doc

nlp = spacy.load("ja_core_news_sm")

# 簡単な感情辞書（実際はもっと大規模なものを使用）
sentiment_dict = {"嬉しい": 1, "悲しい": -1, "楽しい": 1, "辛い": -1}

Doc.set_extension("sentiment", default=None)

def calculate_sentiment(doc):
    sentiment = sum(sentiment_dict.get(token.text, 0) for token in doc)
    doc._.sentiment = sentiment
    return doc

nlp.add_pipe("calculate_sentiment", last=True)

doc = nlp("今日は楽しい一日でした。でも少し疲れました。")
print(f"感情スコア: {doc._.sentiment}")

第12章: キーワード抽出

TF-IDFを使用してテキストから重要なキーワードを抽出できます。

from sklearn.feature_extraction.text import TfidfVectorizer

nlp = spacy.load("ja_core_news_sm")

texts = [
    "私は犬が好きです。犬は忠実な動物です。",
    "猫は独立心が強い動物です。私は猫も好きです。",
    "犬と猫はどちらもペットとして人気があります。"
]

def preprocess(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if not token.is_stop and token.pos_ != "PUNCT"])

preprocessed_texts = [preprocess(text) for text in texts]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_texts)

feature_names = vectorizer.get_feature_names_out()
for i, text in enumerate(texts):
    print(f"テキスト {i+1} のキーワード:")
    feature_index = tfidf_matrix[i,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
    for word, score in [(feature_names[i], s) for (i, s) in tfidf_scores]:
        print(f"\t{word}: {score:.3f}")

第13章: 文書要約

テキストランクアルゴリズムを使用して、文書の自動要約を行うことができます。

import networkx as nx
import numpy as np

def textrank_summarize(text, num_sentences=3):
    nlp = spacy.load("ja_core_news_sm")
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    
    # 文同士の類似度を計算
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
    for i, sent1 in enumerate(sentences):
        for j, sent2 in enumerate(sentences):
            if i != j:
                similarity_matrix[i][j] = nlp(sent1).similarity(nlp(sent2))
    
    # グラフを作成
    nx_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(nx_graph)
    
    # スコアの高い文を抽出
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    summary = " ".join([ranked_sentences[i][1] for i in range(min(num_sentences, len(ranked_sentences)))])
    
    return summary

text = """
日本の四季は美しいです。春には桜が咲き、人々は花見を楽しみます。
夏は蝉の声が聞こえ、花火大会が各地で開催されます。
秋になると紅葉が山々を彩り、冬には雪景色を楽しむことができます。
このように、日本の自然は四季折々の魅力にあふれています。
"""

summary = textrank_summarize(text)
print("要約:")
print(summary)

第14章: 文生成

マルコフ連鎖を使用して簡単な文生成を行うことができます。

import random

def generate_markov_chain(text, n=2):
    nlp = spacy.load("ja_core_news_sm")
    doc = nlp(text)
    words = [token.text for token in doc if not token.is_space]
    
    markov_dict = {}
    for i in range(len(words) - n):
        key = tuple(words[i:i+n])
        value = words[i+n]
        if key in markov_dict:
            markov_dict[key].append(value)
        else:
            markov_dict[key] = [value]
    
    return markov_dict

def generate_sentence(markov_dict, n=2, max_words=30):
    current = random.choice(list(markov_dict.keys()))
    result = list(current)
    for _ in range(max_words - n):
        if current in markov_dict:
            next_word = random.choice(markov_dict[current])
            result.append(next_word)
            current = tuple(result[-n:])
        else:
            break
    return "".join(result)

text = """
日本の四季は美しいです。春には桜が咲き、人々は花見を楽しみます。
夏は蝉の声が聞こえ、花火大会が各地で開催されます。
秋になると紅葉が山々を彩り、冬には雪景色を楽しむことができます。
このように、日本の自然は四季折々の魅力にあふれています。
"""

markov_dict = generate_markov_chain(text)
generated_sentence = generate_sentence(markov_dict)
print("生成された文:")
print(generated_sentence)

第15章: 多言語対応

SpaCyは多言語に対応しており、日本語と英語を組み合わせた処理も可能です。

import spacy

# 日本語モデルと英語モデルを読み込む
nlp_ja = spacy.load("ja_core_news_sm")
nlp_en = spacy.load("en_core_web_sm")

def analyze_text(text, nlp):
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]

# 日本語テキストの解析
ja_text = "私は寿司が好きです。"
ja_analysis = analyze_text(ja_text, nlp_ja)
print("日本語解析結果:")
print(ja_analysis)

# 英語テキストの解析
en_text = "I like sushi."
en_analysis = analyze_text(en_text, nlp_en)
print("\n英語解析結果:")
print(en_analysis)

# 混合テキストの解析（言語を自動判定）
mixed_text = "私はsushiが好きです。I love Japanese food."
mixed_analysis = []
for word in mixed_text.split():
    if any("\u4e00" <= char <= "\u9fff" for char in word):
        mixed_analysis.extend(analyze_text(word, nlp_ja))
    else:
        mixed_analysis.extend(analyze_text(word, nlp_en))

print("\n混合テキスト解析結果:")
print(mixed_analysis)

以上、SpaCyを使った日本語自然言語処理の基本から応用までを15章にわたって解説しました。これらの技術を組み合わせることで、高度なテキスト分析や自然言語処理タスクを実現できます。SpaCyの豊富な機能を活用して、日本語テキストの解析や処理にチャレンジしてみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up