There are three branches of Machine learning one among them is called "Classification".
What is classification?
Classification is a supervised learning technique that learns, builds experience from the existing categorised documents (i. e. training data set) and tries to predict a category to previously unseen data.
Some of the examples are predicting diseases, spam email filtering and detection of fraudulent bank transactions.
What is supervised learning?
Supervised learning is a "Machine Learning" technique wherein the training dataset is given and their appropriate results to build concepts in the system. For example: Naive Bayes Classifier.
As humans, probably we have been doing human supervised learning unknowingly. We do not open mails with subject line "YOU WON THE LOTTERY" or "CHEAP MEDICINES". With prior experience, these words in the subject line specify that this email is a SPAM. There is no compulsion that sequence of words would be in same sequence, rather it keeps changing but will have similar wordings.might have words in the same sequence, but we could have seen enough emails with similar wordings.
Supervised learning also functions in a similar manner. In case, of building a classifier say for example "email spam classifier", we train using data which has already been labelled as "Spam" or "Non-Spam", and then use that classifier to make predictions on unseen emails.
Following are the steps involved in building a classifier
1) Get/build the training set
For building a classifier, we need training data which needs to be similar with the actual data that is to be classified. Here, a point to note is that, the classifier can only be as good as training data. For example:-email spam classifier, we will require the subject lines and their label spam/non-spam.
2) Selecting the features/dimensions
Once we have the dataset, the features/dimensions need to be selected which would be used to build the classification model. For example:- a)For email spam filter, it could be words in the subject line b)For bank transactions it can be amount, account number, location of the transaction, et
3) Dimension reduction/data preparation
Once we have identified the dimension, we need to bring it to the format which can be used with algorithm or can further split the input data set into test and training dataset.
4) Build and train the classifier
Build a classifier and train it using training data set.
5) Validate
Once, we have the classifier ready, run it on the test data set and verify if it works fine. If not, we might have to change the selected model or features.
Here is an email classifier built on Mahout which uses the free email data set from (This website provides classified data into spam and non-spam (termed as ‘ham’).)
We can use Mahout to build the Naive Baise Classifier to classify the emails.
382 were ham and were classified accurately as ham (True positive)
69 were spam and were classified as spam (True negative)
1 record was spam, but it has been classified as ham. (False negative)
1 record was ham, but it has been classified as spam. (False positive)
This matrix reveals that the classifier has classified the test data set with 99.5585% Accuracy.
If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at
We are a global digital services and solutions provider, who leverage emerging technologies and deep domain expertise to deliver real-world business impact for our clients. A focus on very select industries, a detailed understanding of the underlying processes of those industries, and partnerships with leading platforms provide us with a distinct perspective. We lead with our product engineering approach and leverage Cloud, Data, Integration, and Automation technologies to transform client businesses into intelligent, high-growth enterprises. Our proprietary platforms power critical business processes across our core verticals. We are located in 23 countries with 30 delivery centers across nine countries.
Explore our wide gamut of digital transformation capabilities and our work across industries.