More than 5 years have passed since last update.

PyTorchではじめてのNLP（公式チュートリアル）

Last updated at 2018-07-03Posted at 2018-07-03

ニューラルネットを使った自然言語処理の勉強をしている今日この頃。
深層学習のフレームワークとして、Facebookが開発しているPyTorchが便利と聞いたので、使ってみる。PyTorchのチュートリアルに"Deep Learning for NLP with PyTorch"というセクションがあったので、備忘録もかねて要点をまとめる。

1. Introduction to PyTorch

Introduction to Torch’s tensor library

PyTorchではテンソル（多次元配列）を表すのにtorch.Tensorオブジェクトを用いる。

# Author: Robert Guthrie

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

# torch.tensor(data) creates a torch.Tensor object with the given data.
V_data = [1., 2., 3.]
V = torch.tensor(V_data)
print(V)

# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.tensor(M_data)
print(M)

# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
          [[5., 6.], [7., 8.]]]
T = torch.tensor(T_data)
print(T)

デフォルトはfloat型だが、torch.LongTensor()でint型のテンソルも生成可能。

# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1)

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)
print(z_2)

# If your tensors are not compatible, torch will complain.  Uncomment to see the error
# torch.cat([x_1, x_2])

catでtensorを連結することができる。デフォルトではaxis=0（最初の次元、一番外側のかっこに対応）の方向に連結される。他のaxisで連結したい場合は、２番目の引数で指定する。

x = torch.randn(2, 3, 4)
print(x)
print(x.view(2, 12))  # Reshape to 2 rows, 12 columns
# Same as above.  If one of the dimensions is -1, its size can be inferred
print(x.view(2, -1))

viewメソッドでreshapeすることができる。-1を与えることで、その次元のサイズを自動的に類推してくれる。

Computation Graphs and Automatic Differentiation

computation graphにより、データがどのように組み合わされて出力が得られたかを特定できる。微分の計算に必要な情報が含まれ、バックプロパゲーションを自分で実装する必要がなくなる。

# Tensor factory methods have a ``requires_grad`` flag
x = torch.tensor([1., 2., 3], requires_grad=True)

# With requires_grad=True, you can still do all the operations you previously
# could
y = torch.tensor([4., 5., 6], requires_grad=True)
z = x + y
print(z)

# BUT z knows something extra.
print(z.grad_fn)

# Lets sum up all the entries in z
s = z.sum()
print(s)
print(s.grad_fn)

上記の例の場合、sはtensor zの和として算出され、zはxとyの和として算出されたという情報が保存されている。

$$s = \overbrace{x_0 + y_0}^\text{$z_0$} + \overbrace{x_1 + y_1}^\text{$z_1$} + \overbrace{x_2 + y_2}^\text{$z_2$}$$

これにより、例えば下記の微分は1になることが分かる。

$$\frac{\partial s}{\partial x_0}$$

実際に勾配を計算してみる。

# calling .backward() on any variable will run backprop, starting from it.
s.backward()
print(x.grad)

requires_grad_メソッドにより、requires_gradフラグを更新することができる。detachやno_gradにより、変更履歴の追跡を停止することができる。

x = torch.randn(2, 2)
y = torch.randn(2, 2)
# By default, user created Tensors have ``requires_grad=False``
print(x.requires_grad, y.requires_grad)
z = x + y
# So you can't backprop through z
print(z.grad_fn)

# ``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad``
# flag in-place. The input flag defaults to ``True`` if not given.
x = x.requires_grad_()
y = y.requires_grad_()
# z contains enough information to compute gradients, as we saw above
z = x + y
print(z.grad_fn)
# If any input to an operation has ``requires_grad=True``, so will the output
print(z.requires_grad)

# Now z has the computation history that relates itself to x and y
# Can we just take its values, and **detach** it from its history?
new_z = z.detach()

# ... does new_z have information to backprop to x and y?
# NO!
print(new_z.grad_fn)
# And how could it? ``z.detach()`` returns a tensor that shares the same storage
# as ``z``, but with the computation history forgotten. It doesn't know anything
# about how it was computed.
# In essence, we have broken the Tensor away from its past history


# another way to stop autograd from tracking history 
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

2. Deep Learning with PyTorch

文章が多いので概要のみ。

Affine Maps

$f(x)=Ax+b$と表せる関数$f(x)$をaffine mapという。
プログラミング上は、$x$を横ベクトルとして、上下に積み重ねた行列$X$を用いて、$f=XW+b$という形式で表す。すなわち、入力の$i$番目の列の写像が出力の$i$番目の列となる。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

lin = nn.Linear(5, 3)  # maps from R^5 to R^3, parameters A, b
# data is 2x5.  A maps from 5 to 3... can we map "data" under A?
data = torch.randn(2, 5)
print(lin(data))  # yes

Non-linearities

affine mapを繰り返し適用しても、1つのaffine mapで表せてしまい、モデルに効力は生まれない。非線形を導入することで、性能の良いモデルを構築できる。シグモイド、tanh、ReLUなどの関数が一般的だが、シグモイドは勾配消失問題により、実際にはあまり使われない。

# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = torch.randn(2, 2)
print(data)
print(F.relu(data))

Softmax and Probabilities

ソフトマックスも非線形関数の1つであり、実数値ベクトルの入力に対して確率分布を返すため、ネットワークの最後に用いられることが多い。

# Softmax is also in torch.nn.functional
data = torch.randn(5)
print(data)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

Optimization and Training

torch.optimに各種最適化法が実装されている。

Creating Network Components in PyTorch

全てのネットワーク構成要素はnn.Moduleを継承し、forwardメソッドをオーバーライドする必要がある。

Example: Logistic Regression Bag-of-Words classifier

スパースなBag-of-Words表現を、"English", "Spanish"という2つのラベルの確率にmappingすることを考える。

data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2


class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # nn.Linearでアファイン写像を定義
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)


def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    # 次元を1つ増やす
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# ランダムなパラメータで初期化されている。
# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(param)

# 学習しないまま、モデルを実行する。
# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)
    print(log_probs)

学習を行う前に、まず2つのラベルがそれぞれスペイン語と英語のどちらに対応するかを決める。

label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

学習前にテストデータの予測結果とパラメータを確認しておく。

# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.6250, -0.7662]])
tensor([[-0.5870, -0.8119]])
tensor([ 0.0544,  0.1722])

当然、正しく予測できていない。

次に、実際にデータを用いて学習を行う。コスト関数を定義、勾配を算出してパラメータを更新する。NLLLossの入力は対数確率とする必要があるため、出力層にlog softmaxを使用している。（nn.CrossEntropyLossを用いるとlog softmaxによる変換も実行してくれる）

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.1210, -2.1721]])
tensor([[-2.7767, -0.0643]])
tensor([ 0.5004, -0.2738])

2つのテスト用の文章が何語か正しく判別できている。また、"creo"はスペイン語の単語なので、label=0 (Spanish)に対応する値が大きくなっている。

3. Word Embeddings: Encoding Lexical Semantics

words embeddingは実数値の密なベクトルで単語を表現したもので、単語のsemantics（意味論）を表すことができる。one-hot表現では行列のサイズが巨大になる上に、全ての単語を独立に扱い、意味の関連を表現できない。

deep learningでword embeddingを扱う際は、各単語にインデックスを割り当てる。このマッピングをword_to_ixとする。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

An Example: N-Gram Language Modeling

N-gramでは下記の確率を求める。ただし、文章中の$i$番目の単語を$w_i$とする。
$$P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )$$

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]

次に単語とインデックスのマッピングを作成する。

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

PyTorchでは、（vocabulary数）\x（embeddingの次元）の行列にembeddingが保存される。index $i$の単語のembeddingがこの行列の$I$番目の行に保存されている。
nn.Embeddingのインデックスをtorch.LongTensor（floatではなくint型とするため）で指定することで、embeddingを取り出すことができる。

今回はCONTEXT_SIZE=2としているため、modelのinputは2つの単語のindexとなる。
例えばcontextが['When', 'forty']の場合は、"When": 94, "forty": 33なので、context_idxs=[94, 33]（modelのinput）となる。
self.embeddings(inputs)では、embedding_dimは長さEMBEDDING_DIMの配列を2個取り出しており、view((1, -1))により、サイズが(1, CONTEXT_SIZE * EMBEDDING_DIM)になるように変換している。これをliner1で128次元に変換、reluを適用した上で、linear2でvocab_size（全単語数）の大きさに変換し、log_softmaxで各単語の確率を算出している。
ここまでの処理がコード中のStep3.となっている log_probs = model(context_idxs) に該当する。

次に、実際にcontextの2単語の後に出現した単語targetを正解ラベルとしてlossを算出、勾配を求めてパラメータを補正する。
これを各contextについて繰り返し、total_lossを算出する。

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

結果、学習が進むにつれて、total_lossが小さくなることが分かる。
チュートリアルはここで終わってるが、実際に予測できるのかが気になったので、試してみた。

pred_ix = model(torch.tensor([word_to_ix[w] for w in ['forty', 'winters']], dtype=torch.long)).argmax()
[k for k, v in word_to_ix.items() if v == pred_ix.item()]

['thy']

['forty', 'winters']の次は'shall'になるはずだが、確率が最大となったのは'thy'だった。
そもそも文章中に'thy'が多いので、確率が高くなりやすいのだろう。いくつか試したが'thy'が予測される確率が高かった。モデルの出力確率が高いものを選ぶだけでなく、文章中の出現確率も考慮したほうが良いのかもしれない。
今回はここまで。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up