岩波データサイエンス Vol.2 所収の岡﨑直観「単語の意味をコンピュータに教える」を読むにあたって参考になりそうな情報を勝手にまとめました。

# サポートページ・発刊記念イベント

「岩波データサイエンス」シリーズサポートページ: https://sites.google.com/site/iwanamidatascience/

サポートページ内「分散表現(単語埋め込み)」 https://sites.google.com/site/iwanamidatascience/vol2/word-embedding

イベントページ: http://connpass.com/event/27135/

動画: https://www.youtube.com/watch?v=EyL_TC17MkQ 19:20くらいから

togetter: http://togetter.com/li/950232

# 参考文献・ほぼ年代順

参考文献として紹介されている論文とその参考情報を年代順に並べてみます。

## Distributional Structure (Z. Harris, 1954)

一番古いもの。分散仮説の元ネタ。

### 論文PDF(部分？)

- http://www.tandfonline.com/doi/abs/10.1080/00437956.1954.11659520
- http://www.tandfonline.com/doi/pdf/10.1080/00437956.1954.11659520

### 参考

Distributional semantics https://en.wikipedia.org/wiki/Distributional_semantics

分散表現の解説記事 http://www.slideshare.net/unnonouno/20140206-statistical-semantics

チュートリアル「Distributional Semantic Models」資料 http://wordspace.collocations.de/lib/exe/fetch.php/course:acl2010:naacl2010_part1.slides.pdf

## Distributed Representations of Words and Phrases and their Compositionality (T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, 2013)

Mikolovのword2vec論文（の一つ）。Skip-gramの高速化と謎のベクトルの加法性の紹介。現代の分散表現の嚆矢、らしい。

### 論文PDF

- https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
- http://arxiv.org/pdf/1310.4546.pdf

### Abstract

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling.

An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

### 動画

### 解説

- http://www.slideshare.net/unnonouno/nips2013-distributed-representations-of-words-and-phrases-and-their-compositionality
- http://hytae.hatenablog.com/entry/2015/05/15/Mikolov%E3%81%AEword2vec%E8%AB%96%E6%96%873%E6%9C%AC%E3%81%BE%E3%81%A8%E3%82%81
- http://qiita.com/nishio/items/3860fe198d65d173af6b

## GloVe: Global Vectors for Word Representation (J. Pennington, R. Socher, C. Manning, 2014)

word2vecに似た何か(実装ではなく手法が違うんだけど何が違うのかは理解できてません…)。

### 論文PDF

### Abstract

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

### 実装

### 関連サイト

### 動画

### 解説

- http://nzw.hatenablog.jp/entry/2015/06/07/223658
- http://nonbiri-tereka.hatenablog.com/entry/2015/10/25/223430

## Neural Word Embedding as Implicit Matrix Factorization (O. Levy and Y. Goldberg, 2014)

skip-gramのネガティブサンプリングとPMIの行列分解とが（ある条件下では？）同じになることを示した論文。

### 論文PDF

### Abstract

We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context.

We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks.

When dense low-dimensional vectors are preferred, exact factorization with SVD can achieve solutions that are at least as good as SGNS’s solutions for word similarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS’s factorization.

### 参考文献

- http://hytae.hatenablog.com/entry/2015/05/15/%E9%9B%91%E8%AA%AD%E8%AB%96%E6%96%87%E3%81%BE%E3%81%A8%E3%82%81
- http://www.slideshare.net/nttdata-msi/skip-gram-shirakawa20141121-41833306
- http://www.logos.t.u-tokyo.ac.jp/~hassy/deep_learning/word2vec_pmi/

## A Linear Dynamical System Model for Text (D. Belanger and S. Kakade, 2015)

### 論文PDF

### Abstract

Low dimensional representations of words allow accurate NLP models to be trained on limited annotated data. While most representations ignore words’ local context, a natural way to induce context-dependent representations is to perform inference in a probabilistic latent-variable sequence model. Given the recent success of continuous vector space word representations, we provide such an inference procedure for continuous

states, where words’ representations are given by the posterior mean of a linear dynamical system. Here, efficient inference can be performed using Kalman filtering. Our learning algorithm is extremely scalable, operating on simple cooccurrence counts for both parameter initialization using the method of moments and subsequent iterations of EM. In our experiments, we employ our inferred word embeddings as features in standard tagging tasks, obtaining significant accuracy improvements. Finally, the Kalman filter updates can be seen as a linear recurrent neural network. We demonstrate that using the parameters of our model to initialize a non-linear recurrent neural network language model reduces its training time by a day and yields lower perplexity.

### 実装

### 解説記事

## Improving Distributional Similarity with Lessons Learned from Word Embeddings(O. Levy, Y. Goldberg and I. Dagan, 2015)

### 論文PDF

### Abstract

Recent trends suggest that neuralnetwork-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.

### 実装

### 発表動画

### 参考

## A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution (S. Li, J. Zhu and C. Miao, 2015)

### 論文PDF

### Abctract

Most existing word embedding methods can be categorized into Neural Embedding Models and Matrix Factorization (MF)-based methods. However some models are opaque to probabilistic interpretation, and MF-based methods, typically solved using Singular Value Decomposition (SVD), may incur loss of corpus information.

In addition, it is desirable to incorporate global latent factors, such as topics, sentiments or writing styles, into the word embedding model. Since generative models provide a principled way to incorporate latent factors, we propose a generative word embedding model, which is easy to interpret, and can serve as a basis of more sophisticated latent factor models. The model inference reduces to a low rank weighted positive semidefinite approximation problem. Its optimization is approached by eigendecomposition on a submatrix, followed by online blockwise regression, which is scalable and avoids the information loss in SVD. In experiments on 7 common benchmark datasets, our vectors are competitive to word2vec, and better than other MF-based methods.

### 実装

## Model-based Word Embeddings from Decompositions of Count Matrices (K. Stratos, M. Collins, D. Hsu, 2015)

### 論文PDF

- http://www.cs.columbia.edu/~djhsu/papers/count_words.pdf
- http://aclweb.org/anthology/P/P15/P15-1124.pdf

### Abstract

This work develops a new statistical understanding of word embeddings induced from transformed count data. Using the class of hidden Markov models (HMMs) underlying Brown clustering as a generative model, we demonstrate how canonical correlation analysis (CCA) and certain count transformations permit efficient and effective recovery of model parameters with lexical semantics. We further show in experiments that these techniques empirically outperform existing spectral methods on word similarity and analogy tasks, and are also competitive with other popular methods such as WORD2VEC and GLOVE.

### スライド

### 実装

- Singular (C++) https://github.com/karlstratos/singular