Python — Introduction to Natural Language Processing Part 2

minelaydin
2 min readNov 18, 2020

Hello! This is part 2 of Introduction to Natural Language Processing. In this blog, I will explain the smoothing method that is used to find the probability of a sentence that is intended to contain words not included in the text.

You can get the whole code here:

https://github.com/minnela/IntroductionNaturalDataProcessing

What is Smoothing?

To prevent a language model from assigning zero probability to invisible events, we will need to cut some probability mass from some of the more frequent events and give it to events we have never seen. This is called smoothing.

For more information about smoothing you can check here : https://www.marekrei.com/pub/Machine_Learning_for_Language_Modelling_-_lecture2.pd

To do smoothing in Python, first we will change least used three words with “UNK” word:

def replaceLeastWordsWithUNK(text,word1,word2,word3):
replacements={
word1 : 'UNK',
word2 : 'UNK',
word3 : 'UNK',
}
newText = (' '.join(map(lambda w: replacements.get(w, w), text)))
createSmoothingBigram(newText)

Then, we will create smoothed bigrams:

def createSmoothingBigram(data):
listOfSmoothedBigrams = []
smoothedBigramCounts = {}
smoothedUnigramCounts = {}
words = re.findall(r"[\w']+|[.!?]", data)

for i in range(len(words)):
if i < len(words) - 1:
listOfSmoothedBigrams.append((words[i], words[i + 1]))
if (words[i], words[i + 1]) in smoothedBigramCounts:
smoothedBigramCounts[(words[i], words[i + 1])] += 1
else:
smoothedBigramCounts[(words[i], words[i + 1])] = 1
if words[i] in smoothedUnigramCounts:
smoothedUnigramCounts[words[i]] += 1
else:
smoothedUnigramCounts[words[i]] = 1

calcSmoothingProb(data, listOfSmoothedBigrams, smoothedUnigramCounts, smoothedBigramCounts)
return listOfSmoothedBigrams, smoothedUnigramCounts, smoothedBigramCounts

Now we will calculate probabilities of smoothed bigrams. We use Add-k smoothing method:

def calcSmoothingProb(data,listOfSmoothedBigrams, smoothedUnigramCounts, smoothedBigramCounts):
listOfProbSmoothedBigram = {}
smoothedValue= 0.5
v = 10 #new sample set to be added

for bigram in listOfSmoothedBigrams:
word1 = bigram[0]
word2 = bigram[1]
listOfProbSmoothedBigram[bigram] = (smoothedBigramCounts.get(bigram)+smoothedValue) / (smoothedUnigramCounts.get(word1)+smoothedValue*v)
maxSmoothedBigram(data,listOfProbSmoothedBigram,listOfSmoothedBigrams,smoothedUnigramCounts)

return listOfProbSmoothedBigram

Now, when we enter a sentence from the keyboard, if there is a word in that sentence that does not appear in the text, we will replace that word with the word “UNK”:

def replaceWords(wordsChangeToBe, sentence):
for i in range(len(wordsChangeToBe)):
for j in range(len(sentence)):
if (wordsChangeToBe[i] == sentence[j]):
sentence[j]= 'UNK'
return sentence

Now we are ready to calculate the probability of a sentence entered from the keyboard:


probability= 1
smoothedProbabilityOfZeroBigrams =0
smoothedValue = 0.5
v = 10
sentence = input('Enter a sentence')


for i in listOfSmoothedBigramsofSentence:
try:
a= list(listOfProbSmoothedBigram.values())[list(listOfProbSmoothedBigram.keys()).index(i)]
probability = probability*a
except:
smoothedProbability = (smoothedProbabilityOfZeroBigrams + smoothedValue) / (
smoothedUnigramCounts.get(i[0]) + smoothedValue * v)
probability= probability*smoothedProbability
print('Probability of the sentence : ' , probability)

Thank you for reading :)

--

--