Qiita Engineer Festa20242024年7月17日まで開催中！

Pythonで自然言語処理の基礎に触れてみた

Posted at 2024-07-17

はじめに

こんにちは、Webエンジニアの岩田史門(@SI_Monxy)です！
今回は、生成AIに触れているうちに自然言語処理に興味が出てきたので、Pythonを利用した自然言語処理の基礎について学んで記事を書いてみました！
改善点や修正点があれば、コメントにて優しくご指導いただけると嬉しいです！

概要

自然言語処理（Natural Language Processing, NLP）は、コンピュータが人間の言語を理解し、処理するための技術です。Pythonを使用した基本的なNLPの手法とその実装について解説します。

テキストの前処理

テキストデータを扱う前に、以下のような前処理が必要です

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # 小文字化
    text = text.lower()
    # 数字の除去
    text = re.sub(r'\d+', '', text)
    # 句読点などの特殊文字の除去
    text = re.sub(r'[^\w\s]', '', text)
    # ストップワードの除去
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    filtered_text = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_text)

# テキストの前処理例
text = "Natural Language Processing with Python is great!"
processed_text = preprocess_text(text)
print(processed_text)

単語のトークン化とベクトル化

テキストを単語レベルで処理し、数値データとして扱えるようにします

from sklearn.feature_extraction.text import CountVectorizer

# テキストのトークン化とベクトル化
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.toarray())

品詞タグ付けと固有表現認識（NER）

テキストから品詞タグ付けや固有表現認識を行います

import spacy

nlp = spacy.load('en_core_web_sm')

# 品詞タグ付けと固有表現認識
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

for ent in doc.ents:
    print(ent.text, ent.label_)

これらの手法とサンプルコードを使って、Pythonで自然言語処理を始める準備が整いました。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up