More than 3 years have passed since last update.

Bag of Words Model (BOW) from Zero

Posted at 2021-03-01

It’s common to interpret data in numbers in machine learning, deep learning, and statistical modeling. Particularly in NLP, it is necessary to represent text numerically. There are numerous ways to express text numerically. One of the known techniques is Bag of Words which is known as BOW. We can quickly build a BOW model using Scikit-Learn, NLTK and etc.

For a beginner, I think it’s better to understand the concept under the BOW. So I will implement the BOW model from zero and try to explain what is Bag of Words.

What is 'Bag of Words' ?

I think it’s more straightforward to grasp the concept of anything by using an example. Therefore let’s use an example to demonstrate the BOW model.

I have chosen 3 sentences. Let’s imagine that it’s our corpus.

I love to eat sushi.
Omalka and I went to eat yakiniku.
Do you eat sushi in the restaurant?

Ok, now we have our corpus. But how can we create a model using letters? We cannot because statistical or ML only can deal with the numeric data. So we have to find out a way to covert the above text data into numbers.

1) Tokenizing.

The first step of the building BOW model is to get individual words. We called it tokenizing. Let's tokenize the earlier sentences and build a table.


Sentence 1	I	love	to	eat	sushi
Sentence 2	Omalka	and	I	went	to	eat	yakiniku
Sentence 3	Do	you	eat	sushi	in	the	restaurant

2) Word frequency.

Next step to create a table/ dictionary that contains the words of our corpus with the number of occurrences. We can take words as dictionary keys and frequencies as values.

Word	Frequency
eat	3
I	2
to	2
sushi	2
love	1
omalka	1
and	1
went	1
yakiniku	1
do	1
you	1
in	1
the	1
restaurant	1

I made a sorted table by the frequency. We can see that every word of the corpus is included in the above table along with frequency. For example, sushi has appeared two times in the corpus, and its frequency is set to 2.

It seems easy, right? No, it's not if we are doing this manually. This seems easy since we are using only three sentences. In real-life scenarios, we have to deal with millions of sentences. That's the place where programming is handy. Normally

3) Word frequency.

Now it's time to create a matrix corresponding to the most frequently occurred words. In an actual situation, less frequent words will be dropped using some threshold value since they are not much helpful.
Since we have a limited frequency dictionary, we cannot use the threshold.

	eat	I	to	sushi	love	omalka	and	went	yakiniku	do	you	in	the	restuarant
Sentence 1	1	1	1	1	1	0	0	0	0	0	0	0	0	0
Sentence 2	1	1	1	0	0	1	1	1	1	0	0	0	0	0
Sentence 3	1	0	0	1	0	0	0	0	0	1	1	1	1	1

The first row of the above table is created using the first sentences. The word 'eat' occurs one time. So 1 is inserted. Word went is not seen in the first sentence, so 0 is inserted. If we got two occurrences in the same sentence, we could insert 2. This is the concept behind the BOW model. Let's move to the implementation part.

Implementation.

First, we need a corpus. For this one, let's use the same paragraph as in our last article.

bow.py

import nltk
nltk.download('punkt')
import re
import numpy as np
from nltk.probability import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from heapq import nlargest
import pandas as pd

text_ ='''It is bordered on the west by the Sea of Japan, and extends from the Sea of Okhotsk in the north toward the East China Sea and Taiwan in the south. Part of the Ring of Fire, Japan spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is Japan's capital and largest city; other major cities include Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto.
Japan is the eleventh-most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.57 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents.
Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first mentions of the archipelago appear in Chinese chronicles from the 1st century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and his imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō), and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-styled constitution and pursued a program of industrialization and modernization. In 1937, Japan invaded China; in 1941, it entered World War II as an Axis power. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution. Since 1947, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet.
Japan is a great power and a member of numerous international organizations, including the United Nations (since 1956), the OECD, and the Group of Seven. Although it has renounced its right to declare war, the country maintains Self-Defense Forces that are ranked as the world's fourth-most powerful military. After World War II, Japan experienced high economic growth, becoming the second-largest economy in the world by 1990 before being surpassed by China in 2010. Despite stagnant growth since the Lost Decade, the country's economy remains the third-largest by nominal GDP and the fourth-largest by PPP. A leader in the automotive and electronics industries, Japan has made significant contributions to science and technology. Ranked the second-highest country on the Human Development Index in Asia after Singapore, Japan has the world's second-highest life expectancy, though it is experiencing a decline in population. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent animation and video game industries.'''

Then we can use the NLTK library to split into sentences.

bow.py

corpus = nltk.sent_tokenize(text_)
print(corpus)

Then we have to remove unwanted symbols. There can be more or less text cleaning depend on the corpus data.
I'm just doing the simple cleaning here. I don't want any symbols in my dictionary. So I will remove them first. And then, before creating a dictionary, it's vital to convert text into a lower case. For example, the word, Eat and eat is the same. So we don't want them as separate keys.

bow.py

for i in range(len(corpus )):
    corpus[i] = re.sub(r'\W',' ',corpus[i])
    corpus[i] = re.sub(r'\s+',' ',corpus[i])
    corpus[i] = corpus [i].lower()

print(corpus)

Now it's time to build a frequency dictionary, we can use NLTK FreqDist function to build the frequency dictionary.

bow.py

fdist = FreqDist()
for sent in corpus:
    for word in word_tokenize(sent):
        fdist[word] += 1

#just to show the out put
pd.DataFrame(fdist, index=[0]).T.sort_values( by=0, ascending=False)

I mentioned earlier that less frequent words are not helpful. So we can distill out some of the words from our dictionary. We have 550 number of words. I will use 250 most frequently occurred words. We can achieve this by using heapq nlargest function.

bow.py

most_freq_words = nlargest(250, fdist, key=fdist.get)

print(most_freq_words)

So now we can build our matrix, which is also known as sentence vector.

bow.py

sent_vecs = []
for sent in corpus:
    sent_tokens = word_tokenize(sent)
    sent_vec = []
    for token in most_freq_words:
        if token in sent_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sent_vecs.append(sent_vec)
    
sentence_vectors = np.asarray(sent_vecs)

#just show output
pd.DataFrame(sentence_vectors)

This is the bag of word model. You can see there are 23 sentences as rows.

As I mentioned earlier, there are some libraries that we can use to build this model using 2,3 lines of coding.

Let's meet again, with another language model.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up