#### Naive Bayes Demystified

__ What is Naive Bayes Theorem: __ It is a probabilistic classifier based on Bayes’ theorem. In simpler terms” it is theorem used for classification problem to figure out probabilities of word occurrence.” For e.g. to check if your new email is a spam, or if the new document contains adult content or to analyze the sentiments in a statement. This is done based on training of the data.

**Facts about Naïve Bayes**

– It is not considered as one of the best method for classification but still most widely used, as it is easy to implement and there is not a big difference between the results from other methods

– It assumes features (Words) used in Classification are independent of each other

– Probability of classification is of low quality, however its classification decisions are quite good

**Types of Naive Bayes and Use**

**Implementation**

Implementation of Naive Bayes theorem requires two main part.

First is **Feature Selection **which tells you what features (Words) to use while calculating probability of your new document being a certain type.

Second is **Classifier** which calculates the probability of your new document classification.

For eg. You trained your data with 30 IT based, 30 Bank based and 40 Manufacturing based documents. Feature selection finds a subset of words from the document to determine if the document is of IT, banking or manufacturing category. So when you feed the next document as input, the algorithm checks for a word in the document across this feature (subset words) and calculates the probability that for example 40% it might belong to IT, 10% it might belong to Manufacturing and 1% it could be Bank document.

·**Classifier (Detailed for Nerds)**

Naive Bayes classifier assumes that the features used in the classification are independent

Use words of the document to classify it on appropriate class , using a decision rule

X= max (log Prob(Class) + Sum (1 to n) Log (Prob(token|class))

This calculates probability of class in the set of classes. For e.g. if you have 50 documents belonging to IT, Manufacturing, Bank etc., It calculates what is the probability of this document is of IT (1/50) and add the probability of a particular word (Like computer is in the document when we know the document is of IT)

For e.g., using the algorithm, if you want to know whether a new document belongs to IT, Banking, Manufacturing sector etc., the algo goes through the words in the document and checks what the probability is, the document belongs to any category. But in a scenario when the word doesn’t exist in the document, the algo returns the result as zero. But from the above formula, Log (0) becomes undefined. To handle this scenario we change to log(Prob(token|class) ) and add 1 to each count, so it doesn’t result in zero.

**Feature Selection (Detailed for Nerds)**

It’s a way to select a specific subset of terms in training data and use them in classification

__ Chi Square: __ It uses Y test to determine independence of two events. In our case it tests if specific term exist in specific class. For e.g., it checks if the word ‘computer’ exists in an IT based document and how dependent the word ‘computer’ and the IT document are. It calculates a score; high score means high dependence

**X = no of total document**

**X1= no of document which has specific term(for e.g computer word in IT docs)**

**X2= no of document which doesn’t have that specific term (document without computer word)**

**X3= no of same document without specific term (No of IT document specifically without computer word)**

**X4= no of different document with specific term (No of non IT documents but has a word computer in it)**

For e.g. if you have 100 documents – 30 of IT, 30 of bank and 40 of manufacturing, it checks for the word ‘computer’, which, for example, exists in 23 out of 30 documents of IT, 5 out of 30 bank documents and 5 out of 40 manufacturing documents. So,

X=100 total document

X1= 23 (23 times in 30 IT docs)

X2= 67 (100-(23+10 has this word))

X3= 7 (same IT document but doesn’t have a word 30-23)

X4= 10 ( 5 (bank) +5 (Manufacturing)

**Formula to calculate:**

Y= (X ( X1* X3 – x2 *x4) pow(2) )/ ((X1+X4)*(X1+X2)* (X2+X3)* (X4+ X3))

Y determines the selection of variable and it dependence

__ Mutual Information: __ It is the most common method. It checks how much a term contributes to the information of making a correct classification.

For e.g if we know that a document has a particular word like computer how much this word (computer) contributes when we classify new document (input) as IT document.

**X = no of total document**

**X1= no of document which has specific term(for e.g computer word in IT docs)**

**X2= no of document which doesn’t have that specific term (document without computer word)**

**X3= no of same document without specific term (No of IT document specifically without computer word)**

**X4= no of different document with specific term (No of non IT documents but has a word computer in it)**

For e.g. if you have 100 documents – 30 of IT, 30 of bank and 40 of manufacturing, it checks for the word ‘computer’, which, for example, exists in 23 out of 30 documents of IT, 5 out of 30 bank documents and 5 out of 40 manufacturing documents. So,

X=100 total document

X1= 23 (23 times in 30 IT docs) (N11)

X2= 67 (23+10 has this word) (N10)

X3= 7 (same IT document but doesn’t have a word 30-23)(N00)

X4= 10 ( 5(bank) +5 (Manufacturing) (N01)

X5= X1+x3

X6=X2+x4

**Formula to calculate:**

MI = (X1/ X )* Log(X*X1)/X5*X5 + (X4/ X )* Log(X*X4)/X6*X5 + (X2/ X )* Log(X*X2)/X5*X6 + (X3/ X )* Log(X*X3)/X6*X6)

Using these formulas you can easily code the naïve bayes algorithm to determine the classification of new documents.

**References**

http://blog.datumbox.com/developing-a-naive-bayes-text-classifier-in-java/

https://en.wikipedia.org/wiki/Naive_Bayes_classifier