From Word Embedding to Document Embedding, a pure-text introduction

Posted at 2023-11-28

To better understand word embedding, especially for people do not hold the necessary background, it is better to start from the idea of what I called conceptualization.

Conceptualization is just how we think of the word intuitively. For example, when you look up the dictionary for word 'apple', it will be a nice organized term and definition for it. Such definition puts word as the symbolization of abstract concepts, so the name conceptualization. The concept of the word itself, however, is sometimes not simply defined. Take again the example of word 'apple'. Few dozens years ago, this word mostly points to a round and red fruit with sweet and sour flavor. Since Apple Inc. has used this symbolization to build a mega tech company, 'apple' is now re-conceptualized with additional meaning overloaded. We human understand multiple definitions of the same word with ease. When think of apple, we immediately know it can mean a fruit, but also can mean the mega company name. Even if we do not know, we update them quickly without trouble.

But, In mathematics, we have to add additional work to describe such duality. First, let's consider any word is a vector ( or simply a point ) in a space, like stars in the sky. Then, let's assume that there is a plane where each point of word lying on it refers to a kind of fruit, such as banana, apple or orange. Word 'apple' thus lies on this fruit plane. When the meaning of tech company is added to the word 'apple', additional dimensions must be added to accommodate this new meaning. We can do this without losing any generality by adding a z dimension, making the world from 2D into 3D. We can then add the plane of technical companies, which will contains word refers to them. Now apple is not only in a plane of fruit, but also in the intersection of two different planes.

In general, conceptualization can add or lose or alternate meanings of words, by moving these points inside the mathematical space, where dimensions can be much larger than 3. For example, in Natural Language Processing, we use dimension of 300 or even thousands. However, such conceptualization which describes multiple dimensions of definitions simultaneously does not fit the usage in reality at most time.

To see why, let's go back to the example of apple. Human always understand what apple means when the context is clear. For example, 'Apple is now going to release his next phone next weekend.' and 'Apple is good for health.' are clearly distinguishable. Problem is, if we only use the definition of conceptualization which we introduced before, where word apple is embedded into a high dimensional space, we make an implicit assumption that word apple always holds two meanings simultaneously. The duality of apple happens only in the sense of virtual time, where word apple travels between sentences fast enough and changes itself to correct meaning in the corresponding context. When it means fruit, it seldom means tech brand and vise versa.

Thus, we should not pose duality on the word directly. We need to put contexts into equations which is hard to be mathematically defined. We do so by first reading the whole sentence to identify the context. A context world is then constructed. Then, according to that context, word apple will have a unique point, a conditional embedding. This is called dynamic embedding. To make it straight, consider the former example where word apple lies on a fruit plane. Dynamic embedding will first determine what is this plane. If the context is tech brands, the whole fruit plane will be replaced and word apple will be reallocated to the correct position on this new plane. For first-time readers who are new to AI, it may sounds particularly easy and straight. You first look at the shelf, and look up the book, very reasonable. But it takes a long time for the mathematicians to follow up the workable model with formulas, especially when mislead by the illusion of conceptualization. This chapter will describes main methods using conceptualization and Chapter 4 will continue on the methods using duality.

Despite Word2Vec (Mikolov et al. [2013b]) is the first modern model of word embeddings, it has actually been studied for many decades. As early as 1996, Lund and Burgess proposes that the meaning of a word should be reflected by its contexts. The idea was not further being noticed mostly due to the lack of computing sources, corpus data and the theory of Deep Learning. After Mikolov, many similar ideas were proposed and they are largely divided into two subgroups of ideas: static representation and dynamic representation.

Static representation is similar to the method of conceptualization in our first example. This sort of method use sparse and high-dimensional vector to represent word. Context is used to train the model but word will have a fixed representation in the semantic space, thus called static representation. It includes the earliest methods like one-hot vector and better methods all they to the Mikolov's word2vec.

Dynamic representation is similar to the second example. This kind of method will look for dynamic representation for word relative to the context. They are more stable, advanced but also harder to train and harder to understand.

In the following sections, I will briefly go through the methods of static representations and methods of dynamic representations.

In the very early era of NLP, words were represented by very naive form of high-dimensional zero-one vectors. It is generally a all-zero vectors with all the entries being zero except the single entry representing the word. For example, say we have a corpus (a large collection of sentences) with vocabulary (distinct kinds of words) |V|=300. We first sort the vocabulary to give unique index number from 1 to 300 to each word. Let's say word 'apple' gets index number 169. To build the one-hot vector, we build a zero vector of length 300 and change the 169th index to 1.

Representing words in this way is straight forward and especially easy. It merely is counting the number of the word and give them id number. However, the representation only uses one position in the vector and it is not possible to calculate distances between two word. When we want to find similarity between words, these drawbacks are phenomenal.

To solve the problem, scientists tried to add linguistic prior of many kinds. For example, they add morphology, speech tags, dictionary features into the model along with an early model of contexts: distributional representations.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up