Creating Word Embedding using Word2Vec

5 min readMar 12, 2020

Introduction

Word embedding is one of the most popular methods of representing words in a document. Word embeddings are helpful in providing information about the context of a word, its semantic and syntactic similarity with other words in a document. Word embeddings can also be viewed as vector representations of words in a particular document. According to Wikipedia,

Word embedding is the collective name for a set of language modeling and feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers.

Need

Word embeddings give us a really deep insight into a document which is helpful in obtaining better results for any NLP task. For example, if we have two sentences, “The dinner was good.” and “The dinner was amazing.” Now, if we create a vocabulary V for the given sentences, we get V = {The, dinner, was, good, amazing}. If we represent this vocabulary using one-hot vectors, we can represent all words as follows:

The = [1, 0, 0, 0, 0]; dinner= [0, 1, 0, 0, 0]; was= [0, 0, 1, 0, 0];

good = [0, 0, 0, 1, 0]; amazing= [0, 0, 0, 0, 1];

Now, if we analyze the one-hot encoded vectors, we can see that the vocabulary has been represented using 5 dimensions where each word occupies a different dimension. Now if we try to compute the similarity between the words “good” and “amazing” using vector algebra, we get 0, that is, these words are not similar, which we know, is not true. Word embedding is the solution to this problem and hence they are of great use while solving any NLP task.

How are word embeddings created?

Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method and explicit representation in terms of the context in which words appear. A neural network is a more preferred way to generate word embedding as it gives more accurate results. Word2Vec is a popular deep learning algorithm to create word embeddings. It was developed by Tomas Mikolov in 2013.

Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. A word2vec embedding is obtained from two different methods — The Continuous Bag of Words (CBOW) and Skip-gram.

The Continuous Bag of Words Model

The CBOW model architecture tries to predict the center word given the context words in fixed window size.

The input to the cbow model is the one-hot encoded vector of size V. In the above diagram, the model takes C context words as input. When W(VxN) is used to calculate hidden layer inputs, we take an average over all these C context word inputs. The hidden layer just copies the weighted sum of inputs to the next layer. There is no activation function used here. The only non-linearity is the softmax calculations in the output layer.

The Skip-Gram Model

Now, we have seen how to predict the target words given the context words. The skip-gram model architecture does just opposite of it. It takes as input a target word and predicts C possible context words.

We input the target word into the network. The model outputs C probability distributions. This means that for each context position we get C probability distributions of V probabilities.

Negative Sampling

Preparing a neural network implies taking a training set and modifying the entirety of the neuron loads somewhat with the goal that it predicts that preparation test all the more precisely. At the end of the day, each training set will change the entirety of the loads in the neural network.

As we talked about over, the size of our word vocabulary implies that our skip-gram neural network has a colossal number of weights, which would all be refreshed marginally by all of our billions of training samples!

Negative sampling addresses this by having each training sample just alter a little level of the weights, instead of every one of them.

As we can see in the above image, the vanilla skip-gram model updates the complete weight matrix whereas when negative sampling is used with skip-gram only a few of the weights are updated which makes the process of creating an embedding much faster and robust.

Creation of word embeddings in Python

I tried to create embeddings for the Hindi language using all the aforementioned techniques. I was able to create a fairly good embedding for the data I used.

Here sg=0 means we use the CBOW architecture and sg=1 means we use the skip-gram model. By default, sg=1. negative=15 means that we use 15 negative samples.

Conclusion

The above clarification is just a basic one. It just gives you a significant level of thought of what word embeddings are and how Word2Vec functions. There’s significantly more to it. For instance, to make the calculation computationally increasingly productive, stunts like Hierarchical Softmax can also be adopted.

Thanks for reading!