Home ML Papers Tomas Mikolov - Efficient Estimation of Word Representations in Vector Space (2013)

Tomas Mikolov - Efficient Estimation of Word Representations in Vector Space (2013)

History / Edit / PDF / EPUB / BIB /
Created: July 14, 2017 / Updated: December 21, 2025 / Status: finished / Readability: technical / 4 min read (~632 words)
machine-learning

We expect that not only will similar words tend to be close to each other (in vector space), but that words can have multiple degrees of similarity

The training complexity is proportional to
$$ O = E \times T \times Q$$

where $E$ is the number of training epochs, $T$ is the number of the words in the training set and $Q$ is defined further for each model architecture

At the input layer, $N$ previous words are encoded using 1-of-V coding, where V is the size of the vocabulary
The input layer is then projected to a projection layer $P$ that has dimensionality $N \times D$, using a shared projection matrix
As only $N$ inputs are active at any given time, composition of the projection layer is a relative cheap operation
The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense
In our model, we use hierarchical softmax where the vocabulary is represented as a Huffman binary tree

Most of the complexity is caused by the non-linear hidden layer in the model
The new architectures directly follow those proposed in our earlier work, where it was found that neural network language model can be successfully trained in two steps:
- continuous word vectors are learned using simple model
- the N-gram NNLM is trained on top of these distributed representation of words

The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words; thus, all words get projected into the same position (their vectors are averaged)
We call this architecture a bag-of-words model as the order of words in the history does not influence the projection
We also use words from the future; we have obtained the best performance on the task introduced [in the next section] by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word

The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence
We use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word

We follow previous observation that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller
We further denote two pairs of words with the same relationship as a question, as we can ask: "What is the word that is similar to small in the same sense as biggset is similar to big"
These questions can be answered by performing simple algebraic operations with the vector representation of words
- To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector $X = vector("biggest") - vector("big") + vector("small")
- Then, we search in the vector space for the word closest to $X$ measured by cosine distance, and use it as the answer to the question

It can be seen that after some point, adding more dimensions or adding more training data provides diminishing improvements
We have to increase both vector dimensionality and the amount of the training data together

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).