#14-01, Raffles City Tower, 250 North Bridge Road, Singapore 179101
+65 66381203  

Category : Data Science

Home»Archive by Category "Data Science"

Technology Singularity

Technology Singularity

In machine learning, we provide data to with features and labels to the algorithms. For example, a bunch of data about individuals containing features such as age, gender, marital status, height, weight, nationality mapped alongside a label of income. An algorithm loops through the data and learns the combinations of factors that influences a person’s income. The machine is capable of factoring numerous features and processing data at massive scales. The objective here being that the model can predict incomes of people based on a set of features.

This underlying principle of features, labels applies to all supervised machine learning algorithms. This enables us to build systems such as autonomous cars, medical diagnosis, facial recognition, chat bots etc., Because these systems are built on data that we already have, the machines essentially are mimicking human behaviour to begin with.

As a result, the existing systems are very efficient at doing a narrow task well. Machine can beat the world’s top chess player, same goes with the Go game. A machine can detect faces in a crowd in an instant. The machine can answer a specific query satisfactorily.

The rate at which progress is being made, its a matter of time wherein the machines are going to surpass their ability in terms of narrow tasks. True AI is when the machines can start doing broader tasks through decision making. For example, a camera detects a face and then checks against a data base for criminal records, and then sends a message to the nearest police etc., The possibilities for what the machine can do becomes impossible to predict. This is the stage of Technology Singularity, a term used to describe the point wherein AI becomes smarter than human intelligence.

If machines get smarter than humans, a lot of our world problems may be solved instantly. On the other of the coin side, if machines determine that humans are the root cause of all problems, then we really need to worry.

Machine Learning – K Nearest Neighbours Algorithm

In this tutorial, we’ll explore an algorithm called K Nearest Neighbours. This is a widely used machine learning algorithm.

Remember some of the common tasks in ML include:

  • Classification
  • Regression
  • Clustering
  • Dimensionality Reduction



In classification, the algorithm must learn to predict outcomes which are classes based on one or more features.


Whether a movie will be a hit or flop


Deciding the category of a hurricane


Rating employees as High Performer, Average performer in performance appraisals


Whether a person will say YES to a date


Example of a classification algorithm is Naive Bayes

In regression, the algorithm must learn to predict the values of a response variable based on one or more features.



How much money a movie will make at the box office


The expected wind speed of a hurricane


Products sold by an employee


How many glasses of wine were consumed on a date


Example of regression algorithm is Simple Linear Regression

Both classification and regression are supervised learning tasks. This means we already have labeled data and our model learns from the training set based on the label and predicts outcomes for unseen data.

K-Nearest Neighbors (KNN) algorithm can be used for both classification and regression.

KNN is widely used in the real world in a variety of applications, including search and recommender systems

The K in KNN is a number. Example 3 Nearest neighbours. This number is a representation of training instances in a metric space.

Example, Singapore is most similar to which of the below countries. Brazil, United States, Hong Kong, Slovenia, Japan, Syria, Australia, Italy, Malaysia


You can start to evaluate this on various features. Population size, geographical mass, GDP, housing price etc.,

If we used just the population size and per capita income, which country would Singapore be most similar to? Maybe Hong Kong?


Well, we are data people, so we never speculate. We use data. K Nearest neighbours could help in this case.

We need a way to measure the distance on the features. Example, if we used population size, then we would have


From this, we can see that distance is the lowest for Hong Kong on the population parameter.


What if we add another feature to compare making it a 2-dimensional space. For example, per capita income.


For classification tasks, a set of tuples of feature vectors and class labels comprise the training set. KNN is a capable of binary, multi-class, and multi-label classification. Lets look at an example of binary classification.


The simplest KNN classifiers use the mode of the KNN labels to classify test instances, but other strategies can be used. Mode is the most frequently occurring instance.

Lets say we have a situation wherein we need to predict gender based on number of selfies. We have a 1 feature and corresponding labels.




Lets say we want to predict the gender and the only data we have is that a person took 155 selfies.



Lets look at the distance from each of the instances to the new instance.



If we were to set k as 3, then we see that out of 3 closest instances, 2 are male and one is female. So the new person with 155 selfies is likely to be a male based on the mode.

The k is often set to an odd number to prevent ties.

Lets say we have one more feature, number of friends. Now we have a 2 dimensional space



Plotting it on a graph gives us



We are now using features from two explanatory variables to predict the value of the response variable. KNN is not limited to two features; the algorithm can use an arbitrary number of features, but more than three features cannot be visualized.

From the plot we can see that women, denoted by the orange markers, tend to take more selfies and have more friends than men. This observation is probably consistent with your experience. Now let’s use KNN to predict whether a person with a given number of selfies and number of friends is a man or a woman.

Let’s assume that we want to predict the sex of a person who has 155 selfies  and who has 70 friends. First, we must define our distance measure.



In this case, we will use Euclidean distance, the straight line distance between points in a Euclidean space. Euclidean distance in a two-dimensional space is given by the following formula:




The distance between two points on a graph can also be calculated using the pythogorus theorem. Lets look at an example.

In algebraic terms, a² + b² = c² where c is the hypotenuse while a and b are the legs of the triangle.



The horizontal distance between the points is 4 and the vertical distance is 3. Let’s introduce one more point (-2, -1). With this small addition we get a right-angled triangle with legs 3 and 4. By the Pythagorean theorem, the square of the hypotenuse is (hypotenuse)² = 3² + 4². Which gives the length of the hypotenuse as 5, same as the distance between the two points according to the distance formula.


Applying the distance formula on the table gives us:


We will set k to 5 and select the three nearest training instances. Which are the 5 nearest neighbours. We see that out of the 5, 3 are classified as male. So we predict that the person is male

Data Science is not just for those with PhD’s

It is a common misconception that data science is a field only for PhD’s. Field of data science is a vast and welcoming playground for anyone with an aptitude for problem solving using data.

Data Science is the intersection of three key areas:


  • Domain knowledge
  • Computer programming
  • Math and statistics

Domain knowledge is just the knowledge of the industry or field one is working in. A financial analyst would have knowledge about the stock market. A hotel manager would have knowledge about the hospitality industry. A sales manager would have knowledge about what factors influence buying.


Computer programming is the ability to use code to solve various types of problems. Many of today’s programming languages have a syntax closely resembling general English. Picking up programming is not as difficult as it used to be.


Math and statistics is the use of equations and formulas to perform analysis. Many of these concepts are from school or college days. Of course one can dive deep into the various theorems and derivations. But many of today’s freely available tools handle these complexities behind the screen.


In order to gain knowledge from data, one must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses’ place in the domain we are in.


When starting off in the field of data science, one would rely on their core strengths depending on which area they come from. Someone with a background in programming would be exploring the application of code to derive mathematical models. Someone with a background in math would be trying out coding. Someone with domain knowledge would be creating hypotheses to test using code.


The intersection of math and programming leads to what is referred to as machine learning. However, one needs to be able to explicitly generalize models to a domain so that a business problem can be addressed. Any business problem can be represented as a mathematical equation. Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere

Hadoop and MapReduce – Simplified explanation

The most common use case for Hadoop is when you need to process large volumes of data with many machines. Example, lets say we end of with 500 million transactions. Each transaction contains a bunch of information such as date and time; item purchased, location, customer information, payment method, weather, humidity, top headlines that day etc.,

First obvious thought you may have is why not just use relational databases (RDB). Conventionally, relational databases have been used to store data. However, you are going to run into many problems. The most obvious one being cost. As the volume of data increases, the cost rises exponentially.

The second problem you will need to address is dealing with unstructured data. How do we show images, videos etc in a RDB.

This is where technologies such as Hadoop and NoSql come in.

What is Hadoop all about?

At its core, Hadoop is about processing large volumes of data with many machines. You would start with first splitting your data so that every machine gets a subset of the data. Imagine we have a data set of 500 million transactions. We would split the data into 100 million records for each sub set and pass it to the 5 machines.

Screen Shot 2017-05-02 at 3.33.40 PM

Question: From the 500 million transactions, find the average sales per day?

To answer this question, the data is split into subsets and sent to individual machines. We need to now take the records available in each machine, identify the specific header to evaluate and do some computations. In technical jargon, we call this Extract, Transform and Load (ETL)

Sometimes, it becomes necessary to shuffle or move data across different machines to arrange them in some order. This may be to do with moving the sales column before the tax column etc., This is called Shuffling where you move the results between intermediate machines

The next step is to aggregate the intermediate results. Results from the different machines are aggregated and stored.

This overall approach allows us to process more data than what can be done in a single machine. In way, this is similar to the divide and conquer mechanism of processing data.

Screen Shot 2017-05-02 at 4.56.53 PM

MapReduce Algorithm

At the heart of Hadoop technology is the MapReduce algorithm. This is a combination of two commonly used functions, map and reduce. These two functions are common to most modern high level languages.

Example, in JavaScript (ES6), the map function traverses an array from left to right invoking a callback function on each element with parameters For each callback the value returned becomes the element in the new array. After all elements have been traversed map() returns the new array with all the translated elements

The reduce function in ES6 is used when you want to find a cumulative or concatenated value based on elements across the array. Like map() it traverses the array from left to right invoking a callback function on each element. The value returned is the cumulative value passed from callback to callback. After all elements have been traversed reduce() returns the cumulative value

In our example of processing 500 million transactions, the extract layer addresses the “Map” portion of the algorithm. The aggregate layer addresses the “reduce” portion of the algorithm.

Screen Shot 2017-05-02 at 5.04.33 PM

Illustration of how MapReduce on Hadoop works

Imagine we have a very large file that contains records of items sold in a super market. Each line in the text file represents a given day.

What we would like to find out is how many GE bulbs, how many Hershey’s bar, how many perfumes etc., were sold overall?

Screen Shot 2017-05-02 at 6.17.09 PM

First thing to do break the file into lines. Then we are going to extract each of the items and generate key value pairs. The value being the number of times the item appears in a line. This is the “Map” element of the algorithm wherein we step through each item.

Screen Shot 2017-05-03 at 2.19.28 PM

The next stage of processing is called the Shuffle. We are going to move all the entries so that we have all similar items in one place. All of Hershey’s bar will be in one place, Steph’s perfume will be in one place etc.,

Screen Shot 2017-05-03 at 2.48.37 PM

The final step is to aggregate each set. This is called Reduce. We find that we have 3 Hershey’s bar, 1 Colgate brush, 3 Cleo’s food, 3 Steph’s perfume and 2 Samsung TV’s

Screen Shot 2017-05-03 at 3.14.09 PM

Generally speaking, the Map and Reduce part needs to be programmed by us. The Shuffle part is handled by the Hadoop system. This is done by associating the primary key with the data and moving all items with the same primary key to one place.

What this translates to is in the Hadoop ecosystem is that all of Hershey’s bar will be processed in one computer followed by all of Cleo’s food being processed in another computer etc.,

In the Reduce step, the results from the different machines are collected and stored in a separate file. MapReduce is usually considered a brute force approach working on multiple machines.

Let summarize the key takeaways of the MapReduce algorithm

Map Steps:

Data is split into pieces

Individual machines (worker nodes) process the pieces in parallel

Each individual machine stores the result locally

Reduce Steps:

Data is aggregated by individual machines

Aggregation is parallelized


With Hadoop, programmers can focus on just the Map, Reduce part through an API or a high level programming language. The Hadoop ecosystem takes care of things like fault tolerance, assigning machines to map and reduce, moving processes, shuffling and dealing with errors.

Logistic Regression Use Case – Classification Problems

Logistic regression is used when the outcome is a discrete variable. Example, trying to figure out who will win the election, whether a student will pass or fail an exam, whether a customer will come back, whether an email is a spam. This is commonly called as a classification problem because we are trying to determine which class the data set best fits.

Take for example, we have health records of patients with data such as gender, age, income, activity level, marriage status, number of kids, heart disease status. Logistic regression allows us to build a model that will predict the probability of someone having a heart disease based on the features.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logisticregression.

Unlike regression analysis, logistic regression does not directly model the values of the dependent variable. However, it does model the probability of the membership to a particular group or category. If the dependent variable consists of only two categories, logistic regression estimates the odds outcome of the dependent variable given a set of quantitative and/or categorical independent variables.

Logistics regression analysis starts with calculating the “Odds” of the dependent variable, which is the ratio of the probability that an individual (case) is a member of a particular group or category, p(y) divided by the probability that an individual is not a member of the group or category [1 p(y)]. It is represented as follows:

Odds = p(y) / [1 – p(y)]

It is important to note that unlike the probability values, which range from 0 to 1, the values of the odds can theoretically range from 0 to infinity.

In order to establish a linear relationship between the odds and the independent variables in the logistic regression model, the odds need to be transformed to logit (log-odds) by taking the natural logarithm (ln) of odds. The logarithmic transformation creates a continuous dependent variable of the categorical dependent variable

The goal of logistic regression is to detemine the probablity of an outcome based on a set of inputs.

Screen Shot 2017-04-26 at 7.14.51 PM

The key property of the logistic function, from the classification perspective, is that irrespective of the values of its argument, z, the logistic function always returns a value between 0 and 1. So this function compresses the whole real line into the interval [0, 1]. This function is also known as the sigmoid function, because of its characteristic S-shaped aspect

In simple terms, The formula to estimate probability from logistic regression is:

P(i) = 1 / 1 + e–Z

where Z = α + βXi.

Note that Z is something we can determine through linea regression.


Use case of Logistic Rgression

Lets say we are helping a local hyper market increase sales. We need to determine what products to suggest for customers who have already purchased some items in the store.

Through logistic regression on past data, we can determine the probability of a customer purchasing jewellery if they have already purchased items from the women’s accessories.

Look at the table below and try to figure out which products would you suggest to which customers.

Someone who just purchased kitchen stuff has a 72% chance of buying women’s accessories. Someone who purchased infant goods is very unlikely to purchase automobile parts.

Screen Shot 2017-04-26 at 7.37.06 PM

Parametric Versus Non-Parametric Estimating methods

A common hurdle that every data scientist will come at some point need to address is what machine learning model to use? At one end of the spectrum, you have simple models which are easy to interpret but less accurate and at the other end, you have models that are complex and difficult to read but provides higher accuracy.

Models that are easy to interpret are generally grouped as Parametric methods. These include models that make an assumption about the relationship between the features and the output as having a functional form. Regression models are generally considered as parametric methods. Example, linear regression. Other examples of commonly used parametric models include logistic regression, polynomial regression, linear discriminant analysis, quadratic discriminant analysis, (parametric) mixture models, and naïve Bayes (when parametric density estimation is used). Approaches often used in conjunction with parametric models for model selection purposes include ridge regression, lasso, and principal components regression.

There are models which seem like complex, black boxes which provide high accuracy and they don’t make assumptions about the functional relationship between features and predictors. These are called Non-parametric methods. A simple example of a nonparametric model is a classification tree. A classification tree is a series of recursive binary decisions on the input features. The classification tree learning algorithm uses the target variable to learn the optimal series of splits such that the terminal leaf nodes of the tree contain instances with similar values of the target.

Other examples of nonparametric approaches to machine learning include k-nearest neighbors, splines, basis expansion methods, kernel smoothing, generalized additive models, neural nets, bagging, boosting, random forests, and support vector machines.

Most machine learning applications tend to use non-parametric methods to reflect the underlying complexity of relationships.


Machine learning: What do you do with Missing Data?

As a data scientist, a big portion of your time will be spent in cleaning data. One of the areas of data ‘munging’ that you will need to address is that of missing data. Data could be missing for any number of reasons. A bank that is trying to target a section of its customers with the most potential for subscribing to a new product may find that the income field of some customers is missing. Since income is a self reported measure, some customer may have not bothered to fill it up. Sometimes, data is missing because of formatting issues. Example, log data with date time from a particular server may have been corrupt leading to a section of data with missing fields.

When dealing with missing data, here are some questions that you need to figure out answers before doing any machine learning:

1. Does the missing data have meaning?

2. Is the data set large wherein missing data is small (May make sense to remove instances of missing data)

3. Does the data set follow any distributions? (If not, you can use ML regression analysis to predict values for the missing data)

4. Does the data follow simple distributions? (If so, you can substitute the missing data with the mean or median)

5. Can the data set be ordered? (If so, you can replace the missing data with the preceding value)

Each of the above points requires a different set of tools to address the missing data problem. We’ll cover the specific tools with examples in a different blog.