Python — Introduction to Natural Language Processing Part 2

Image for post
Image for post

Hello! This is part 2 of Introduction to Natural Language Processing. In this blog, I will explain the smoothing method that is used to find the probability of a sentence that is intended to contain words not included in the text.

You can get the whole code here:

What is Smoothing?

To prevent a language model from assigning zero probability to invisible events, we will need to cut some probability mass from some of the more frequent events and give it to events we have never seen. This is called smoothing.

For more information about smoothing you can check here :

To do smoothing in Python, first we will change least used three words with “UNK” word:

def replaceLeastWordsWithUNK(text,word1,word2,word3):
word1 : 'UNK',
word2 : 'UNK',
word3 : 'UNK',
newText = (' '.join(map(lambda w: replacements.get(w, w), text)))

Then, we will create smoothed bigrams:

def createSmoothingBigram(data):
listOfSmoothedBigrams = []
smoothedBigramCounts = {}
smoothedUnigramCounts = {}
words = re.findall(r"[\w']+|[.!?]", data)

for i in range(len(words)):
if i < len(words) - 1:
listOfSmoothedBigrams.append((words[i], words[i + 1]))
if (words[i], words[i + 1]) in smoothedBigramCounts:
smoothedBigramCounts[(words[i], words[i + 1])] += 1
smoothedBigramCounts[(words[i], words[i + 1])] = 1
if words[i] in smoothedUnigramCounts:
smoothedUnigramCounts[words[i]] += 1
smoothedUnigramCounts[words[i]] = 1

calcSmoothingProb(data, listOfSmoothedBigrams, smoothedUnigramCounts, smoothedBigramCounts)
return listOfSmoothedBigrams, smoothedUnigramCounts, smoothedBigramCounts

Now we will calculate probabilities of smoothed bigrams. We use Add-k smoothing method:

def calcSmoothingProb(data,listOfSmoothedBigrams, smoothedUnigramCounts, smoothedBigramCounts):
listOfProbSmoothedBigram = {}
smoothedValue= 0.5
v = 10 #new sample set to be added

for bigram in listOfSmoothedBigrams:
word1 = bigram[0]
word2 = bigram[1]
listOfProbSmoothedBigram[bigram] = (smoothedBigramCounts.get(bigram)+smoothedValue) / (smoothedUnigramCounts.get(word1)+smoothedValue*v)

return listOfProbSmoothedBigram

Now, when we enter a sentence from the keyboard, if there is a word in that sentence that does not appear in the text, we will replace that word with the word “UNK”:

def replaceWords(wordsChangeToBe, sentence):
for i in range(len(wordsChangeToBe)):
for j in range(len(sentence)):
if (wordsChangeToBe[i] == sentence[j]):
sentence[j]= 'UNK'
return sentence

Now we are ready to calculate the probability of a sentence entered from the keyboard:

probability= 1
smoothedProbabilityOfZeroBigrams =0
smoothedValue = 0.5
v = 10
sentence = input('Enter a sentence')

for i in listOfSmoothedBigramsofSentence:
a= list(listOfProbSmoothedBigram.values())[list(listOfProbSmoothedBigram.keys()).index(i)]
probability = probability*a
smoothedProbability = (smoothedProbabilityOfZeroBigrams + smoothedValue) / (
smoothedUnigramCounts.get(i[0]) + smoothedValue * v)
probability= probability*smoothedProbability
print('Probability of the sentence : ' , probability)
Image for post
Image for post

Thank you for reading :)

Computer Engineer & Industrial Engineer. Passionate about software. Always eager to learn.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store