Using Naïve Bayes Machine Learning Algorithm to design a smart filter to automate removal of irrelevant responses and spam
TheBlueyed is a talent acquisition platform focused on high potential candidates. Candidates go through a six-stage verification process before their dossier is generated and they are pitched to employers. In Stage 3, candidates are asked questions about their work where they are expected to write 500 words for each field. These fields are prone to spam and irrelevant responses. We employed Naïve Bayes machine learning algorithm to filter out these irrelevant responses. This system resulted in 95% spam elimination.
Naïve Bayes Supervised Learning Alogorithm
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. Given a class variable and a dependent feature vector through , Bayes’ theorem states the following relationship
Using the naive independence assumption that
for all , this relationship is simplified to
Since is a constant given the input, we can use the following classification rule:
and we can use Maximum A Posteriori (MAP) estimation to estimate and ; the former is then the relative frequency of class in the training set.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of
Why we chose Naïve Bayes Supervised Learning Alogorithm
Naive Bayes classifiers works quite well in many real-world situations, popular document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters.
Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
Using Naïve Bayes Supervised Algorithm to detect irrelevant responses
In Stage 3, after every response is recorded, the response appears on a dashboard where the admin marks the response either relevant or irrelevant.
Over time, all recorded responses are classified as relevant or irrelevant.
Using Naïve Bayes Gaussian Algorithm, when a new response is recorded, the response is matched with all relevant responses and marked either relevant or irrelevant.
Accuracy of the Algorithm
On an average, the accuracy score is 0.97
Combating Information Loss
There is a 3% information loss due to accuracy. This loss reduces with more entries as the algorithm is able to predict better.
Meanwhile, to combat information loss, we do not delete any response and employ a manual verification regularly.
After employing Naïve Bayes Supervised Learning Algorithm to detect irrelevant responses, TheBlueyed has saved 600 hours of work per month or 4 employees have been freed to focus on other important things.
import sys from time import time sys.path.append("../tools/") from email_preprocess import preprocess #features_train and features_test are the features for the training and testing datasets, respectively labels_train and labels_test are the corresponding item labels features_train, features_test, labels_train, labels_test = preprocess() from sklearn.naive_bayes import GaussianNB clf = GaussianNB() #Train the model using the Training Data and sort authentic stories in the Test Data #Predicting pred = clf.fit(features_train, labels_train).predict(features_test) #Calculating Accuracy accuracy = clf.score(features_test, labels_test);