Machine Learning – K Nearest Neighbours Algorithm
In this tutorial, we’ll explore an algorithm called K Nearest Neighbours. This is a widely used machine learning algorithm.
Remember some of the common tasks in ML include:
- Dimensionality Reduction
In classification, the algorithm must learn to predict outcomes which are classes based on one or more features.
Whether a movie will be a hit or flop
Deciding the category of a hurricane
Rating employees as High Performer, Average performer in performance appraisals
Whether a person will say YES to a date
Example of a classification algorithm is Naive Bayes
In regression, the algorithm must learn to predict the values of a response variable based on one or more features.
How much money a movie will make at the box office
The expected wind speed of a hurricane
Products sold by an employee
How many glasses of wine were consumed on a date
Example of regression algorithm is Simple Linear Regression
Both classification and regression are supervised learning tasks. This means we already have labeled data and our model learns from the training set based on the label and predicts outcomes for unseen data.
K-Nearest Neighbors (KNN) algorithm can be used for both classification and regression.
KNN is widely used in the real world in a variety of applications, including search and recommender systems
The K in KNN is a number. Example 3 Nearest neighbours. This number is a representation of training instances in a metric space.
Example, Singapore is most similar to which of the below countries. Brazil, United States, Hong Kong, Slovenia, Japan, Syria, Australia, Italy, Malaysia
You can start to evaluate this on various features. Population size, geographical mass, GDP, housing price etc.,
If we used just the population size and per capita income, which country would Singapore be most similar to? Maybe Hong Kong?
Well, we are data people, so we never speculate. We use data. K Nearest neighbours could help in this case.
We need a way to measure the distance on the features. Example, if we used population size, then we would have
From this, we can see that distance is the lowest for Hong Kong on the population parameter.
What if we add another feature to compare making it a 2-dimensional space. For example, per capita income.
For classification tasks, a set of tuples of feature vectors and class labels comprise the training set. KNN is a capable of binary, multi-class, and multi-label classification. Lets look at an example of binary classification.
The simplest KNN classifiers use the mode of the KNN labels to classify test instances, but other strategies can be used. Mode is the most frequently occurring instance.
Lets say we have a situation wherein we need to predict gender based on number of selfies. We have a 1 feature and corresponding labels.
Lets say we want to predict the gender and the only data we have is that a person took 155 selfies.
Lets look at the distance from each of the instances to the new instance.
If we were to set k as 3, then we see that out of 3 closest instances, 2 are male and one is female. So the new person with 155 selfies is likely to be a male based on the mode.
The k is often set to an odd number to prevent ties.
Lets say we have one more feature, number of friends. Now we have a 2 dimensional space
Plotting it on a graph gives us
We are now using features from two explanatory variables to predict the value of the response variable. KNN is not limited to two features; the algorithm can use an arbitrary number of features, but more than three features cannot be visualized.
From the plot we can see that women, denoted by the orange markers, tend to take more selfies and have more friends than men. This observation is probably consistent with your experience. Now let’s use KNN to predict whether a person with a given number of selfies and number of friends is a man or a woman.
Let’s assume that we want to predict the sex of a person who has 155 selfies and who has 70 friends. First, we must define our distance measure.
In this case, we will use Euclidean distance, the straight line distance between points in a Euclidean space. Euclidean distance in a two-dimensional space is given by the following formula:
The distance between two points on a graph can also be calculated using the pythogorus theorem. Lets look at an example.
In algebraic terms, a² + b² = c² where c is the hypotenuse while a and b are the legs of the triangle.
The horizontal distance between the points is 4 and the vertical distance is 3. Let’s introduce one more point (-2, -1). With this small addition we get a right-angled triangle with legs 3 and 4. By the Pythagorean theorem, the square of the hypotenuse is (hypotenuse)² = 3² + 4². Which gives the length of the hypotenuse as 5, same as the distance between the two points according to the distance formula.
Applying the distance formula on the table gives us:
We will set k to 5 and select the three nearest training instances. Which are the 5 nearest neighbours. We see that out of the 5, 3 are classified as male. So we predict that the person is male