[機械学習/深層学習] Amazon Reviewsを用いてレビュー文の評価分類をLSTMとRNNで比較

Last updated at 2026-01-06Posted at 2026-01-06

　この記事では、Amazonレビューのテキストデータを使ってレビューがポジティブかネガティブかを分類する二値分類タスクを実装しました。
モデルとしては RNN と LSTM の2種類を用意して学習させ、それぞれの予測を出します。
最後に各モデルの精度やクラスごとの正答率などをまとめた比較表を作成しました。

前提

実行環境はGoogle Colab。ランタイムはPython3（T4 GPU）を使用
　※ 参照：機械学習・深層学習を勉強する際の検証用環境について
本記事のコード全容はこちらからダウンロード可能。ipynbファイルであり、そのまま自身のGoogle Driveにアップロードして実行可能
数学的知識や用語の説明について、参考文献やリンクを最下部に掲載 (本記事内で詳細には解説しませんが、流れや実施内容がわかるようにしたいと思います)

全体の流れと概要説明

データ準備
　レビュー文章を小文字化して、数字も残して単語ごとに分割（トークナイズ）。出現頻度の上位5万語を語彙にして、それぞれの単語を数字に変換（エンコード）
DataLoader作成
文章の長さはバラバラだから、パディングしてバッチ学習できる形に整形
モデル作成
RNN と LSTM の2種類のモデルを作成。LSTMは長い文脈を覚えるのが得意、RNNはシンプルで短期的な文脈に強そう
学習と評価
各モデルを学習させ、テストデータで精度を確認。サンプルレビューを表示して、正解・予測・信頼度も見られるようにした
結果比較
誤分類レビューをチェックしたり、精度・クラスごとの正答率などをまとめた表も作成。結果として、LSTMの方が文脈を考慮できる分、RNNより正確に分類できていた

実装

1. データロード

datasets ライブラリで Amazon Reviews (amazon_polarity) を取得。
学習用 50,000件、テスト用 10,000件にサンプル数を軽量化。
ラベルは 0: negative / 1: positive の二値。

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from datasets import load_dataset
from torch.nn.utils.rnn import pad_sequence
import re
from collections import Counter

device = "cuda" if torch.cuda.is_available() else "cpu"

# --- データロード ---
dataset = load_dataset("amazon_polarity")
train_texts = dataset["train"]["content"][:50000]  # 軽量化
train_labels = dataset["train"]["label"][:50000]
test_texts  = dataset["test"]["content"][:10000]
test_labels = dataset["test"]["label"][:10000]

2. 前処理・トークナイズ

小文字化 & 英数字以外を除去。
単語単位に分割（トークナイズ）。
数字も残す設定にしているのでレビュー中の数字情報も学習に使える。

# --- トークナイズ（数字も残す） ---
def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9 ]", "", text)  # 数字も残す
    return text.split()

train_tokens = [tokenize(t) for t in train_texts]
test_tokens  = [tokenize(t) for t in test_texts]

3. 語彙作成（Vocabulary）

学習データ全体から単語の出現頻度をカウント。
上位50,000語を語彙として使用。
とを定義して、パディングと未知語対応。

# --- 語彙作成（上位 50,000語） ---
counter = Counter()
for tokens in train_tokens:
    counter.update(tokens)

vocab = {"<pad>": 0, "<unk>": 1}
for word, _ in counter.most_common(50000):
    vocab[word] = len(vocab)

def encode(tokens):
    return torch.tensor([vocab.get(t, vocab["<unk>"]) for t in tokens])

train_encoded = [encode(t) for t in train_tokens]
test_encoded  = [encode(t) for t in test_tokens]

4. データエンコード & DataLoader

トークンを語彙のインデックスに変換。
可変長レビューをパディングしてバッチ処理可能に。
PyTorch DataLoader を使い、ミニバッチ学習を実施。

# --- DataLoader ---
def collate_fn(batch):
    texts, labels = zip(*batch)
    texts = pad_sequence(texts, batch_first=True)
    labels = torch.tensor(labels)
    return texts, labels

train_loader = DataLoader(list(zip(train_encoded, train_labels)), batch_size=64, shuffle=True, collate_fn=collate_fn)
test_loader  = DataLoader(list(zip(test_encoded, test_labels)), batch_size=64, collate_fn=collate_fn)

5. モデル定義

SentimentRNN クラスで LSTM または RNN を選択可能。
Embedding → RNN(LSTM) → Linear → 出力ラベル（2クラス）。
LSTM は過去情報の長期依存を学習しやすく、RNN は短期依存に弱い。

# --- モデル定義（LSTM / RNN） ---
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, rnn_type="LSTM"):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        if rnn_type == "LSTM":
            self.rnn = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        else:
            self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 2)
        self.rnn_type = rnn_type

    def forward(self, x):
        x = self.embedding(x)
        if self.rnn_type == "LSTM":
            _, (h, _) = self.rnn(x)
        else:
            _, h = self.rnn(x)
        out = self.fc(h[-1])
        return out

6. モデル学習

CrossEntropyLoss + Adam Optimizer で学習。
各 Epoch ごとに損失を表示。
学習後にテストデータで精度（Accuracy）を計算。

# --- 学習関数 ---
def train_model(model, train_loader, test_loader, epochs=5, lr=1e-3):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.to(device)

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            out = model(x)
            loss = criterion(out, y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"{model.rnn_type} Epoch {epoch+1}, Loss: {total_loss:.3f}")

    # 評価
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for x, y in test_loader:
            x, y = x.to(device), y.to(device)
            pred = model(x).argmax(1)
            correct += (pred == y).sum().item()
            total += y.size(0)
    acc = correct / total
    print(f"{model.rnn_type} Test Accuracy: {acc:.4f}")
    return model, acc

# --- ラベル & 単語復元 ---
label_map = {0: "negative", 1: "positive"}
idx_to_word = {idx: word for word, idx in vocab.items()}

def decode_review(token_ids):
    words = [idx_to_word.get(i, "<unk>") for i in token_ids if i != 0]
    return " ".join(words)

def visualize_review(token_ids, y_true, y_pred, confidence, orig_text="", title="Review"):
    print(f"{title}:")
    print("Original:", orig_text)  # 元の文章も表示
    print("Tokenized:", decode_review(token_ids))  # モデルに入ったトークン列
    print(f"Ground Truth : {label_map[y_true]}")
    print(f"Prediction   : {label_map[y_pred]}")
    print(f"Confidence   : {confidence:.3f}")
    print("-" * 80)

# --- モデル作成 & 学習 ---
lstm_model = SentimentRNN(len(vocab), rnn_type="LSTM")
rnn_model  = SentimentRNN(len(vocab), rnn_type="RNN")

lstm_model, lstm_acc = train_model(lstm_model, train_loader, test_loader)
rnn_model, rnn_acc   = train_model(rnn_model, train_loader, test_loader)

print(f"\nComparison -> LSTM Accuracy: {lstm_acc:.4f}, RNN Accuracy: {rnn_acc:.4f}\n")

7. LSTM vs RNN のサンプル比較

サンプル5件で両モデルの予測を比較。
誤分類レビューも最大5件まで表示。
LSTM は文章全体の文脈を考慮しやすいため、RNN より正確。

# --- サンプルレビュー比較 ---
def compare_models_on_sample(idx):
    x = test_encoded[idx].unsqueeze(0).to(device)
    y_true = test_labels[idx]
    orig_text = test_texts[idx]
    with torch.no_grad():
        lstm_pred_logits = lstm_model(x)
        rnn_pred_logits  = rnn_model(x)
        lstm_pred = torch.softmax(lstm_pred_logits, dim=1).argmax(1).item()
        rnn_pred  = torch.softmax(rnn_pred_logits, dim=1).argmax(1).item()
        lstm_conf = torch.softmax(lstm_pred_logits, dim=1)[0, lstm_pred].item()
        rnn_conf  = torch.softmax(rnn_pred_logits, dim=1)[0, rnn_pred].item()
    visualize_review(test_encoded[idx], y_true, lstm_pred, lstm_conf, orig_text, title=f"LSTM/RNN Sample {idx+1}")
    print(f"LSTM Prediction: {label_map[lstm_pred]} (conf: {lstm_conf:.3f})")
    print(f"RNN  Prediction: {label_map[rnn_pred]} (conf: {rnn_conf:.3f})")
    print("-"*80)

for i in range(5):
    compare_models_on_sample(i)

# --- 誤分類レビュー比較 ---
def compare_misclassified(max_samples=5):
    misclassified_count = 0
    for i, (x_tokens, y_true) in enumerate(zip(test_encoded, test_labels)):
        x = x_tokens.unsqueeze(0).to(device)
        orig_text = test_texts[i]
        with torch.no_grad():
            lstm_pred = torch.softmax(lstm_model(x), dim=1).argmax(1).item()
            rnn_pred  = torch.softmax(rnn_model(x), dim=1).argmax(1).item()
            lstm_conf = torch.softmax(lstm_model(x), dim=1)[0, lstm_pred].item()
            rnn_conf  = torch.softmax(rnn_model(x), dim=1)[0, rnn_pred].item()
        if lstm_pred != y_true or rnn_pred != y_true:
            misclassified_count += 1
            visualize_review(x_tokens, y_true, lstm_pred, lstm_conf, orig_text, title=f"Misclassified Review {misclassified_count}")
            print(f"LSTM Prediction: {label_map[lstm_pred]} (conf: {lstm_conf:.3f})")
            print(f"RNN  Prediction: {label_map[rnn_pred]} (conf: {rnn_conf:.3f})")
            print("-"*80)
            if misclassified_count >= max_samples:
                break

compare_misclassified(max_samples=5)

※ サンプルレビュー出力結果：レビュー文章についてGround truth(実際の値)にLSTM、RNNのそれぞれのPrediction(予測)を表示

※ 誤分類レビュー出力結果：レビュー結果について誤った予測をしたものを表示。多くはLSTMは正解だが、RNNは不正解。

8. 詳細比較表作成

compute_detailed_metrics で以下を算出：
- Accuracy（全体精度）
- Misclassified Rate（誤分類率）
- Positive Accuracy（正例のみ精度）
- Negative Accuracy（負例のみ精度）
LSTM と RNN の指標を DataFrame にまとめて比較可能。

import pandas as pd

# --- 詳細比較表作成 ---
def compute_detailed_metrics(model, test_encoded, test_labels):
    """モデルごとの詳細評価指標を計算"""
    correct, total = 0, 0
    tp, tn, fp, fn = 0, 0, 0, 0  # 正例／負例ごとの分類
    for x_tokens, y_true in zip(test_encoded, test_labels):
        x = x_tokens.unsqueeze(0).to(device)
        with torch.no_grad():
            pred = torch.softmax(model(x), dim=1).argmax(1).item()
        total += 1
        if pred == y_true:
            correct += 1
            if y_true == 1:
                tp += 1
            else:
                tn += 1
        else:
            if y_true == 1:
                fn += 1
            else:
                fp += 1
    accuracy = correct / total
    pos_acc = tp / (tp + fn) if (tp + fn) > 0 else 0
    neg_acc = tn / (tn + fp) if (tn + fp) > 0 else 0
    misclassified_rate = 1 - accuracy
    return {
        "Accuracy": accuracy,
        "Misclassified Rate": misclassified_rate,
        "Positive Accuracy": pos_acc,
        "Negative Accuracy": neg_acc
    }

# 各モデルの詳細評価
lstm_metrics = compute_detailed_metrics(lstm_model, test_encoded, test_labels)
rnn_metrics  = compute_detailed_metrics(rnn_model, test_encoded, test_labels)

# DataFrame にまとめる
df_metrics = pd.DataFrame([lstm_metrics, rnn_metrics], index=["LSTM", "RNN"])
print("\nDetailed Comparison Table:")
print(df_metrics)

※ 出力結果：Accuracy(正解率)など表にまとめたもの。RNNの方がAccuracyが低い

最後に

Accuracyは長期依存情報を保持できるためLSTM の方が高くなる傾向であった。
誤分類率はLSTM は少なめ、RNN は多め。
Positive / Negative Accuracyについて、LSTM は両方のクラスで安定、RNN は短文やネガティブレビューで誤分類しやすい。
総合的に文脈を理解する力の差がそのまま精度差に現れているように見える。

参考文献、リンク

ゼロからつくるPython機械学習プログラミング入門
詳解ディープラーニング第２版
※ 詳解とありますが、入門的な内容から丁寧に解説してあります。
YouTubeチャンネル - 予備校のノリで学ぶ「大学の数学・物理」
※ 数学的知識の学習としては、世界一わかりやすかったです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up