In next few years, chatbot will be handling 90 percent of customer services interactions and Digital technologies are helping us to come up with interactive solutions to solve day to day problems.
I am sure you’ve seen chatbots while using different application on mobile as well as desktop where chatbots are solving your query in interactive mode.
What is chatbot?
Chatbot is an artificial Intelligence's application that tries to understand the customer's needs and then assist them to perform a particular task like transactions, form submission, bookings etc.
What are the types of chatbot?
There are two types of chatbots:
1.) Rule Based : In this approach, bot answers the questions based on some rules on which it trained on but it wouldn't be able to manage complex queries.
2.) Self Learning : In this approach, we uses Machine Learning-based approaches which are more efficient than rule base approach.
Self Learning Bots are classified further as Retrieval bots and Cognitive bots
Retrieval bots are bots which doesn't have cognitive ability but on the basic of context and messages of the conversation it would selects the best response from a predefined responses. Response can be generated from rule based if-else condition to machine learning cluster and classifiers.
Cognitive bots are bots which has cognitive ability which can generate the answers and provides different answers. These are more intelligent bots.
Chatbot Architecture
Input can be taken from Mobile,IoT, Web, and Voice applications which would be processed by chatbot server using NLP,NLU,NLG components.
In this article, we would look into simple retrieval chatbot to understand the working of chatbot using python.
Installation
1.) Python 3.6
2.) NLTK
3.) Scikit library
We would look into brief about NLP and NLTK library in brief, it would help us to perform language processing tasks.
NLP (Natural Language Processing)
NLP help us to perform language processing tasks which helps us to structure knowledge to perform tasks such as Tokenization, Stemming, Lemmatization, Relationship extraction, Topic segmentation etc.
NLTK
NLTK(Natural Language Toolkit) is platform for building application to work with human language. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenisation, stemming, tagging, parsing, and semantic reasoning etc for NLP libraries.
Downloading and installing NLTK
Install Anaconda with Python 3.7
Create environment with chatbox
Install NLTK and Sci-Learn Library in chatbox environment as shown in below screen shot.
Download the source code from here
or clone the repository ( git clone https://github.com/Sourabhsethi/Building-a-simple-chatbot-in-python.git)
Open this project in Spyder shown in below screen shot.
Run this command in console nltk.download('wordnet') to dowload the resource WordNet.
Run the chatbot.py Python file to run the project.
Before moving to code I would like to go thought text processing concepts and terms which would help us to understand code better.
Text Pre- Processing
1.) Converting text into lower case or upper case so that algorithm does not work on same words in different cases.
2.) Tokenisation : It is used to convert the normal text into list of tokens.
3.) NLTK data package includes a pre-trained Punkt tokenizer for English, It will remove noise i.e everything which is not a standard number or letter.
4.) It will also help us to remove Stop words, these are the most common words.
5.) Stemming: it is a process of reducing inflected (or sometimes derived) words to their stem for example ( “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”).
6.) Lemmatization: it is the process of find out the same form of the words. For example of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.
Bag of Word Model
One need to transform the text into a meaningful vector of numbers. it is representation of text that describes the occurrence of words within a document. The main idea behind the Bag of Words is that documents are similar if they have similar content.
For example, if our dictionary contains the words {Chatterbot, is, the, great, application}, and we want to vectorise the text “Chatterbot is great”, we would have the following vector: (1, 1, 0, 1, 0).
Document {Chatterbot, is, the, great, application}
Query : “Chatterbot is great”
Vector: (1, 1, 0, 1, 0).
TF-IDF Model
The problem with the Bag of Words approach is that highly frequent words start to dominate in the document but may not contain as much “informational content” another approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “an” that are also frequent across all documents are penalized. This approach is called Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency: is a scoring of the frequency of the word in the current document.
TF = (Number of times term t appears in a document)/(Number of terms in the document)
Inverse Document Frequency: is a scoring of how rare the word is across documents.
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus
Tf-IDF can be implemented in scikit learn as : from sklearn.feature_extraction.text import TfidfVectorizer
you can check the syntax and usage from here
Cosine Similarity
TF-IDF is a transformation applied to texts to get two real-valued vectors in vector space. We can then obtain the Cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms.
Using this formula we can find out the similarity between any two documents d1 and d2.
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
For More Details look into below Urls
Hurray, now we have some idea about the NLP process, now let's dig into code.
Import Libraries
import nltk
import warnings
warnings.filterwarnings("ignore")
# nltk.download() # for downloading packages
import numpy as np
import random
import io
import string # to process standard python strings
Corpus
we will be using the some information about chatbox, which is available in the code in a text file named ‘data.txt’. However, one can use any corpus of your choice.
Read from file
We will read in the corpus file and convert the entire corpus into a list of sentences and a list of words for further pre-processing.
readFile=io.open('data.txt','r',errors = 'ignore')
data=readFile.read()
lowerData=data.lower()# converts to lowercase
#nltk.download('punkt') # first-time use only # for downloading packages
#nltk.download('wordnet') # first-time use only # for downloading packages
sentences_tokens = nltk.sent_tokenize(lowerData)# converts to list of sentences
word_tokens = nltk.word_tokenize(lowerData)# converts to list of words
sentences_tokens[:2]
word_tokens[:5]
Pre-processing the raw text
We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
Keyword Matching
lets look into a function for a greeting by the chatterbot i.e if a user’s input is a greeting, the bot shall return a greeting response.
INPUTS = ("hello", "hi", "greetings", "good morning", "good evening","good afternoon",)
RESPONSES = ["hi", "hello",]
# Checking for greetings
def greeting(sentence):
"""If user's input is a greeting, return a greeting response"""
for word in sentence.split():
if word.lower() in INPUTS:
return random.choice(RESPONSES)
Response
Response would be generated using the concept of document similarity. So we begin by importing necessary modules.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
This will be used to find the similarity between words entered by the user and the words in the corpus.
We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses.
# Generating response
def response(user_response):
chatterBox_response=''
sentences_tokens.append(user_response)
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
tfidf = TfidfVec.fit_transform(sentences_tokens)
vals = cosine_similarity(tfidf[-1], tfidf)
idx=vals.argsort()[0][-2]
flat = vals.flatten()
flat.sort()
req_tfidf = flat[-2]
if(req_tfidf==0):
chatterBox_response=chatterBox_response+"I am sorry! I am not able to get you"
return chatterBox_response
else:
chatterBox_response = chatterBox_response+sentences_tokens[idx]
return chatterBox_response
we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.
flag=True
print("ChatterBox: My name is ChatterBox. I will answer your queries about ChatterBox. If you want to exit, sat bye or thanks!")
while(flag==True):
user_response = input()
user_response=user_response.lower()
if(user_response!='bye'):
if(user_response=='thanks' or user_response=='thank you' ):
flag=False
print("ChatterBox: You are welcome..")
else:
if(greeting(user_response)!=None):
print("ChatterBox: "+greeting(user_response))
else:
print("ChatterBox: ",end="")
print(response(user_response))
sentences_tokens.remove(user_response)
else:
flag=False
print("ChatterBox: Bye! have a nice day..")
Conclusion
It is simple bot without cognitive ability. It would help you to start with NLP processing and to know about the chatbot and this example would help you to understand the chatbox application and you can extend this to build your own bots.
Happy learning and coding!
Comments