Hello again. Last time I wrote about how to score good in the Kaggle Titanic challenge.
In this article let's look into the basic operations that we need to know when we deal with NLP.
I assume that everyone has a dev environment for data science like Jupyterlab.
Here I’m going to use the nltk library. We can use it to perform NLP tasks like tokenization, stemming etc..
In this article we are going to perform operation of;
1. Tokenization
2. Deal with stop words
3. Stemming
4. Lemmatization
5. wordnet.
6. POS tagging.
So the first step is to install the package.
!pip install nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
1. Tokenization
Dividing text into tokens called the tokenization. Sentences can be tokenized words. Paragraphs can be tokenized into sentences.
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
text_ex = "Hello again, it's very dark here in evening due to pre winter season. After work I'm going to play league of legends for fun."
print(word_tokenize(text_ex))
# output
# ['Hello', 'again', ',', 'it', "'s", 'very', 'dark', 'here', 'in', 'evening', 'due', 'to', 'pre', 'winter', 'season', '.', 'After', 'work', 'I', "'m", 'going', 'to', 'play', 'league', 'of', 'legends', 'for', 'fun', '.']
print(sent_tokenize(text_ex))
# output
# ["Hello again, it's very dark here in evening due to pre winter season.", "After work I'm going to play league of legends for fun."]
2. Tokenization
In most projects in NLP, stop words are considered as noise. Because it's too common and we cannot derive much information. But for some cases stop words can be useful.
So, let's take a look, how we can remove the stop words from the corpus. In NLTK we can see 179 stop words for English language.
from nltk.corpus import stopwords
print('Num of stop words: ', len(stopwords.words('english')))
print(stopwords.words('english'))
# output
# Num of stop words: 179
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
And this is the way how we can remove the stop words from texts
from nltk.corpus import stopwords
text = 'The COVID-19 pandemic has completely changed our lives'
text = word_tokenize(text)
text_with_no_stopwords = [word for word in text if word not in stopwords.words('english')]
print(text_with_no_stopwords)
# output
['The', 'COVID-19', 'pandemic', 'completely', 'changed', 'lives']
3. Stemming
In English there are lots of variations for words by adding affixes (suffixes and prefixes).
Stemming is the process that removes those affixes from the words and reduces them to word stems.
Example:
training → train
kids → kid
There are several stemming algorithms available in nltk package and we are going to use PorterStemmer.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
text = "Programers program with programming languages"
words = word_tokenize(text)
for w in words:
print(w, " : ", ps.stem(w))
# output
# Programers : program
# program : program
# with : with
# programming : program
# languages : languag
4. Lemmatizing
In short we can say this is advance version of stemming. While stemming focus on reducing affixes, lemmatization go beyond reducing affixes. For example: The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Lemmatization looks at surrounding text to determine a given word’s part of speech, it does not categorize phrases.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text = "Programers program with programming languages"
words = word_tokenize(text)
for w in words:
print(w, " : ", lemmatizer.lemmatize(w))
# output
# Programers : Programers
# program : program
# with : with
# programming : programming
# languages : language
Please check the difference of the output given by the stemming and Lemmatizing.
5. Finding Synonyms and Antonyms
WordNet is a kind of special dictionary built for natural language processing. In terms it is a lexical database. We can utilize the wordnet to get synonyms and antonyms for given words.
from nltk.corpus import wordnet
synonyms = []
antonyms =[]
for syns in wordnet.synsets("good"):
for i in syns.lemmas():
synonyms.append(i.name())
if i.antonyms():
antonyms.append(i.antonyms()[0].name())
print(set(synonyms))
# output
# {'thoroughly', 'just', 'goodness', 'secure', 'safe', 'serious', 'expert', 'sound', 'undecomposed', 'in_force', 'unspoilt', 'dependable', 'adept', 'unspoiled', 'good', 'full', 'ripe', 'near', 'in_effect', 'practiced', 'effective', 'salutary', 'skilful', 'dear', 'trade_good', 'well', 'right', 'skillful', 'respectable', 'soundly', 'proficient', 'estimable', 'commodity', 'honorable', 'beneficial', 'upright', 'honest'}
print(set(antonyms))
# output
# {'ill', 'evil', 'bad', 'badness', 'evilness'}
6. POS Tagging
Part of speech is very useful most nlp projects It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.
import nltk
from nltk.tokenize import word_tokenize
text = '''
Programers are program with programming languages
'''
words = word_tokenize(text)
print(nltk.pos_tag(words))
# output
# [('Programers', 'NNS'), ('are', 'VBP'), ('program', 'NN'), ('with', 'IN'), ('programming', 'NN'), ('languages', 'NNS')]
Cheers....
*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.