Here I’m going to build a very simple text summarizer. It will be a very simple one using statistical modeling.
Text Summarization
First, let’s see what is text summarization.
Can you email me the summary of your report?.
Sounds familiar?.
I have to frequently face this question when doing my master's in Tsukuba.
Many peoples and students have to deal with this when studying and even in a professional environment.
In short, Summarization means that converts original content to a brief version while preserving the essential information of the original text and overall meaning.
We can divide Summarization into two main categories.
1. Extractive Summarization
Distinguish the important sentences/phrases from the initial text. Then extract those sentences to create a shorter version. In this process, we don’t create new sentences. We solely use the sentences from our original text.
2. Abstractive Summarization
Here we can write new sentences from the original text. First, we need to recognize the essential phrases and meanings. Then we can use our way of writing to generate a shorter version of the original text.
I'm going to build a simple extractive summarization model. I'm hoping to continue writing about abstractive summarization models in the future.
This model is based on the sentence scoring method. Model is a very fundamental one.
First, we need to find an article to summarize. I will use a paragraph from Wikipedia which described Japan
It is bordered on the west by the Sea of Japan, and extends from the Sea of Okhotsk in the north toward the East China Sea and Taiwan in the south. Part of the Ring of Fire, Japan spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is Japan's capital and largest city; other major cities include Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto.
Japan is the eleventh-most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.57 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents.
Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first mentions of the archipelago appear in Chinese chronicles from the 1st century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and his imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō), and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-styled constitution and pursued a program of industrialization and modernization. In 1937, Japan invaded China; in 1941, it entered World War II as an Axis power. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution. Since 1947, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet.
Japan is a great power and a member of numerous international organizations, including the United Nations (since 1956), the OECD, and the Group of Seven. Although it has renounced its right to declare war, the country maintains Self-Defense Forces that are ranked as the world's fourth-most powerful military. After World War II, Japan experienced high economic growth, becoming the second-largest economy in the world by 1990 before being surpassed by China in 2010. Despite stagnant growth since the Lost Decade, the country's economy remains the third-largest by nominal GDP and the fourth-largest by PPP. A leader in the automotive and electronics industries, Japan has made significant contributions to science and technology. Ranked the second-highest country on the Human Development Index in Asia after Singapore, Japan has the world's second-highest life expectancy, though it is experiencing a decline in population. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent animation and video game industries.
The next thing is to set up the development environment and install the necessary libraries. First, install NLTK library. Then we can import the necessary libraries.
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from heapq import nlargest
import re
The next step is to define the paragraph. I will use the above text that about Japan.
text_ = ''' It is bordered on the west by the Sea of Japan, and extends from the Sea of Okhotsk in the north toward the East China Sea and Taiwan in the south. Part of the Ring of Fire, Japan spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is Japan's capital and largest city; other major cities include Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto.
Japan is" the eleventh-most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.57 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents.
Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first mentions of the archipelago appear in Chinese chronicles from the 1st century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and his imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō), and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-styled constitution and pursued a program of industrialization and modernization. In 1937, Japan invaded China; in 1941, it entered World War II as an Axis power. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution. Since 1947, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet.
Japan is a great power and a member of numerous international organizations, including the United Nations (since 1956), the OECD, and the Group of Seven. Although it has renounced its right to declare war, the country maintains Self-Defense Forces that are ranked as the world's fourth-most powerful military. After World War II, Japan experienced high economic growth, becoming the second-largest economy in the world by 1990 before being surpassed by China in 2010. Despite stagnant growth since the Lost Decade, the country's economy remains the third-largest by nominal GDP and the fourth-largest by PPP. A leader in the automotive and electronics industries, Japan has made significant contributions to science and technology. Ranked the second-highest country on the Human Development Index in Asia after Singapore, Japan has the world's second-highest life expectancy, though it is experiencing a decline in population. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent animation and video game industries.
'''
Now we have to do some preprocessing. I’m doing very simple preprocessing here by removing digits and special characters. Preprocessing can be more or less depend on content.
post_text_ = re.sub('[^a-zA-Z]', ' ', text_ )
post_text_ = re.sub(r'\s+', ' ', post_text_ )
print(post_text_)
The next step is to extract the sentences from the paragraph. We can utilize the sent_tokenize function from the NLTK library here. Please note that, Since the formatted_text_ don’t have the period mark, we have to use text_
sentences = sent_tokenize(text_)
print(sentences)
Now it’s time to remove stop words and build a word frequency dictionary. We can utilize the word_tokenize function in the NLTK to tokenize the sentences. Our algorithm will be a simple one based on the frequency of the words. Please note that we must use preprocessed cleaned text here.
stop_words = stopwords.words('english')
word_freq = {}
for word in word_tokenize(post_text_):
if word not in stop_words:
if word not in word_freq.keys():
word_freq[word] = 1
else:
word_freq[word] += 1
print(word_freq)
We can calculate word scores by word frequency ratios.
max_freq = max(word_freq.values())
word_score = {}
for word in word_freq.keys():
word_score[word] = (word_freq[word]/max_freq)
print(word_score)
In the next step, we can rank our sentences based on the "word frequency" scores. In the previous step, we have calculated the word frequency scores. In the text summary, we don’t want to see long sentences. So I omitted the sentences which are longer than 30 words.
sent_scores = {}
for sent in sentences:
for word in word_tokenize(sent.lower()):
if word in word_freq.keys():
if len(sent.split(' ')) <= 30:
if sent not in sent_scores.keys():
sent_scores[sent] = word_freq[word]
else:
sent_scores[sent] += word_freq[word]
print(sent_scores)
Now everything is completed, the only thing left is getting the text summary. sent_scores dictionary includes the sentence and ranking score. So we can get the top 5 sentences or N number of sentences with the highest-ranking scores by using heapq.nlargest() function.
summary_ = nlargest(5, sent_scores, key=sent_scores.get)
summary = ' '.join(summary_)
print(summary)
This our summarized paragraph.
*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.