Qiita Teams that are logged in
You are not logged in to any team

Log in to Qiita Team
OrganizationAdvent CalendarQiitadon (β)
Qiita JobsQiita ZineQiita Blog
Help us understand the problem. What is going on with this article?


More than 1 year has passed since last update.

深層学習のフレームワークとして、Facebookが開発しているPyTorchが便利と聞いたので、使ってみる。PyTorchのチュートリアルに"Deep Learning for NLP with PyTorch"というセクションがあったので、備忘録もかねて要点をまとめる。

1. Introduction to PyTorch

Introduction to Torch’s tensor library


# Author: Robert Guthrie

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


# torch.tensor(data) creates a torch.Tensor object with the given data.
V_data = [1., 2., 3.]
V = torch.tensor(V_data)

# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.tensor(M_data)

# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
          [[5., 6.], [7., 8.]]]
T = torch.tensor(T_data)


# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)

# If your tensors are not compatible, torch will complain.  Uncomment to see the error
# torch.cat([x_1, x_2])


x = torch.randn(2, 3, 4)
print(x.view(2, 12))  # Reshape to 2 rows, 12 columns
# Same as above.  If one of the dimensions is -1, its size can be inferred
print(x.view(2, -1))


Computation Graphs and Automatic Differentiation

computation graphにより、データがどのように組み合わされて出力が得られたかを特定できる。微分の計算に必要な情報が含まれ、バックプロパゲーションを自分で実装する必要がなくなる。

# Tensor factory methods have a ``requires_grad`` flag
x = torch.tensor([1., 2., 3], requires_grad=True)

# With requires_grad=True, you can still do all the operations you previously
# could
y = torch.tensor([4., 5., 6], requires_grad=True)
z = x + y

# BUT z knows something extra.

# Lets sum up all the entries in z
s = z.sum()

上記の例の場合、sはtensor zの和として算出され、zはxとyの和として算出されたという情報が保存されている。

$$s = \overbrace{x_0 + y_0}^\text{$z_0$} + \overbrace{x_1 + y_1}^\text{$z_1$} + \overbrace{x_2 + y_2}^\text{$z_2$}$$


$$\frac{\partial s}{\partial x_0}$$


# calling .backward() on any variable will run backprop, starting from it.


x = torch.randn(2, 2)
y = torch.randn(2, 2)
# By default, user created Tensors have ``requires_grad=False``
print(x.requires_grad, y.requires_grad)
z = x + y
# So you can't backprop through z

# ``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad``
# flag in-place. The input flag defaults to ``True`` if not given.
x = x.requires_grad_()
y = y.requires_grad_()
# z contains enough information to compute gradients, as we saw above
z = x + y
# If any input to an operation has ``requires_grad=True``, so will the output

# Now z has the computation history that relates itself to x and y
# Can we just take its values, and **detach** it from its history?
new_z = z.detach()

# ... does new_z have information to backprop to x and y?
# NO!
# And how could it? ``z.detach()`` returns a tensor that shares the same storage
# as ``z``, but with the computation history forgotten. It doesn't know anything
# about how it was computed.
# In essence, we have broken the Tensor away from its past history

# another way to stop autograd from tracking history 
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

2. Deep Learning with PyTorch


Affine Maps

$f(x)=Ax+b$と表せる関数$f(x)$をaffine mapという。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


lin = nn.Linear(5, 3)  # maps from R^5 to R^3, parameters A, b
# data is 2x5.  A maps from 5 to 3... can we map "data" under A?
data = torch.randn(2, 5)
print(lin(data))  # yes


affine mapを繰り返し適用しても、1つのaffine mapで表せてしまい、モデルに効力は生まれない。非線形を導入することで、性能の良いモデルを構築できる。シグモイド、tanh、ReLUなどの関数が一般的だが、シグモイドは勾配消失問題により、実際にはあまり使われない。

# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = torch.randn(2, 2)

Softmax and Probabilities


# Softmax is also in torch.nn.functional
data = torch.randn(5)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

Optimization and Training


Creating Network Components in PyTorch


Example: Logistic Regression Bag-of-Words classifier

スパースなBag-of-Words表現を、"English", "Spanish"という2つのラベルの確率にmappingすることを考える。

data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

VOCAB_SIZE = len(word_to_ix)

class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # nn.Linearでアファイン写像を定義
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)

def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    # 次元を1つ増やす
    return vec.view(1, -1)

def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])

model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# ランダムなパラメータで初期化されている。
# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():

# 学習しないまま、モデルを実行する。
# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)


label_to_ix = {"SPANISH": 0, "ENGLISH": 1}


# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)

# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])
tensor([[-0.6250, -0.7662]])
tensor([[-0.5870, -0.8119]])
tensor([ 0.0544,  0.1722])


次に、実際にデータを用いて学習を行う。コスト関数を定義、勾配を算出してパラメータを更新する。NLLLossの入力は対数確率とする必要があるため、出力層にlog softmaxを使用している。(nn.CrossEntropyLossを用いるとlog softmaxによる変換も実行してくれる)

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)

with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)

# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])
tensor([[-0.1210, -2.1721]])
tensor([[-2.7767, -0.0643]])
tensor([ 0.5004, -0.2738])

2つのテスト用の文章が何語か正しく判別できている。また、"creo"はスペイン語の単語なので、label=0 (Spanish)に対応する値が大きくなっている。

3. Word Embeddings: Encoding Lexical Semantics

words embeddingは実数値の密なベクトルで単語を表現したもので、単語のsemantics(意味論)を表すことができる。one-hot表現では行列のサイズが巨大になる上に、全ての単語を独立に扱い、意味の関連を表現できない。

deep learningでword embeddingを扱う際は、各単語にインデックスを割り当てる。このマッピングをword_to_ixとする。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)

An Example: N-Gram Language Modeling

$$P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )$$

# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

PyTorchでは、(vocabulary数)\x(embeddingの次元)の行列にembeddingが保存される。index $i$の単語のembeddingがこの行列の$I$番目の行に保存されている。

例えばcontextが['When', 'forty']の場合は、"When": 94, "forty": 33なので、context_idxs=[94, 33](modelのinput)となる。
self.embeddings(inputs)では、embedding_dimは長さEMBEDDING_DIMの配列を2個取り出しており、view((1, -1))により、サイズが(1, CONTEXT_SIZE * EMBEDDING_DIM)になるように変換している。これをliner1で128次元に変換、reluを適用した上で、linear2でvocab_size(全単語数)の大きさに変換し、log_softmaxで各単語の確率を算出している。
ここまでの処理がコード中のStep3.となっている log_probs = model(context_idxs) に該当する。


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
print(losses)  # The loss decreased every iteration over the training data!


pred_ix = model(torch.tensor([word_to_ix[w] for w in ['forty', 'winters']], dtype=torch.long)).argmax()
[k for k, v in word_to_ix.items() if v == pred_ix.item()]

['forty', 'winters']の次は'shall'になるはずだが、確率が最大となったのは'thy'だった。

農学部→ドイツ→医療系研究者→コンサルでデータサイエンティスト ベイジアン見習い。
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away