391B Orchard Road #23-01 Ngee Ann City Tower B, Singapore 238874
+ 65 66381203

Blogs

Home»Blogs

Machine Learning – K Nearest Neighbours Algorithm

In this tutorial, we’ll explore an algorithm called K Nearest Neighbours. This is a widely used machine learning algorithm.

Remember some of the common tasks in ML include:

Classification
Regression
Clustering
Dimensionality Reduction

 

In classification, the algorithm must learn to predict outcomes which are classes based on one or more features.

Example,

Whether a movie will be a hit or flop

Deciding the category of a hurricane

Rating employees as High Performer, Average performer in performance appraisals

Whether a person will say YES to a date

Example of a classification algorithm is Naive Bayes

In regression, the algorithm must learn to predict the values of a response variable based on one or more features.

Example:

 

How much money a movie will make at the box office

The expected wind speed of a hurricane

Products sold by an employee

How many glasses of wine were consumed on a date

Example of regression algorithm is Simple Linear Regression

Both classification and regression are supervised learning tasks. This means we already have labeled data and our model learns from the training set based on the label and predicts outcomes for unseen data.

K-Nearest Neighbors (KNN) algorithm can be used for both classification and regression.

KNN is widely used in the real world in a variety of applications, including search and recommender systems

The K in KNN is a number. Example 3 Nearest neighbours. This number is a representation of training instances in a metric space.

Example, Singapore is most similar to which of the below countries. Brazil, United States, Hong Kong, Slovenia, Japan, Syria, Australia, Italy, Malaysia

You can start to evaluate this on various features. Population size, geographical mass, GDP, housing price etc.,

If we used just the population size and per capita income, which country would Singapore be most similar to? Maybe Hong Kong?

Well, we are data people, so we never speculate. We use data. K Nearest neighbours could help in this case.

We need a way to measure the distance

Data Science is not just for those with PhD’s

It is a common misconception that data science is a field only for PhD’s. Field of data science is a vast and welcoming playground for anyone with an aptitude for problem solving using data.

Data Science is the intersection of three key areas:

Domain knowledge
Computer programming
Math and statistics

Domain knowledge is just the knowledge of the industry or field one is working in. A financial analyst would have knowledge about the stock market. A hotel manager would have knowledge about the hospitality industry. A sales manager would have knowledge about what factors influence buying.

 

Computer programming is the ability to use code to solve various types of problems. Many of today’s programming languages have a syntax closely resembling general English. Picking up programming is not as difficult as it used to be.

 

Math and statistics is the use of equations and formulas to perform analysis. Many of these concepts are from school or college days. Of course one can dive deep into the various theorems and derivations. But many of today’s freely available tools handle these complexities behind the screen.

 

In order to gain knowledge from data, one must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses’ place in the domain we are in.

 

When starting off in the field of data science, one would rely on their core strengths depending on which area they come from. Someone with a background in programming would be exploring the application of code to derive mathematical models. Someone with a background in math would be trying out coding. Someone with domain knowledge would be creating hypotheses to test using code.

 

The intersection of math and programming leads to what is referred to as machine learning. However, one needs to be able to explicitly generalize models

Hadoop and MapReduce – Simplified explanation

The most common use case for Hadoop is when you need to process large volumes of data with many machines. Example, lets say we end of with 500 million transactions. Each transaction contains a bunch of information such as date and time; item purchased, location, customer information, payment method, weather, humidity, top headlines that day etc.,

First obvious thought you may have is why not just use relational databases (RDB). Conventionally, relational databases have been used to store data. However, you are going to run into many problems. The most obvious one being cost. As the volume of data increases, the cost rises exponentially.

The second problem you will need to address is dealing with unstructured data. How do we show images, videos etc in a RDB.

This is where technologies such as Hadoop and NoSql come in.

What is Hadoop all about?

At its core, Hadoop is about processing large volumes of data with many machines. You would start with first splitting your data so that every machine gets a subset of the data. Imagine we have a data set of 500 million transactions. We would split the data into 100 million records for each sub set and pass it to the 5 machines.

Question: From the 500 million transactions, find the average sales per day?

To answer this question, the data is split into subsets and sent to individual machines. We need to now take the records available in each machine, identify the specific header to evaluate and do some computations. In technical jargon, we call this Extract, Transform and Load (ETL)

Sometimes, it becomes necessary to shuffle or move data across different machines to arrange them in some order. This may be to do with moving the sales column before the tax column etc., This is called Shuffling where you move the results between intermediate

Logistic Regression Use Case – Classification Problems

Logistic regression is used when the outcome is a discrete variable. Example, trying to figure out who will win the election, whether a student will pass or fail an exam, whether a customer will come back, whether an email is a spam. This is commonly called as a classification problem because we are trying to determine which class the data set best fits.

Take for example, we have health records of patients with data such as gender, age, income, activity level, marriage status, number of kids, heart disease status. Logistic regression allows us to build a model that will predict the probability of someone having a heart disease based on the features.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logisticregression.

Unlike regression analysis, logistic regression does not directly model the values of the dependent variable. However, it does model the probability of the membership to a particular group or category. If the dependent variable consists of only two categories, logistic regression estimates the odds outcome of the dependent variable given a set of quantitative and/or categorical independent variables.

Logistics regression analysis starts with calculating the “Odds” of the dependent variable, which is the ratio of the probability that an individual (case) is a member of a particular group or category, p(y) divided by the probability that an individual is not a member of the group or category [1- p(y)]. It is represented as follows:

Odds = p(y) / [1 – p(y)]

It is important to note that unlike the probability values, which range from 0 to

Parametric Versus Non-Parametric Estimating methods

A common hurdle that every data scientist will come at some point need to address is what machine learning model to use? At one end of the spectrum, you have simple models which are easy to interpret but less accurate and at the other end, you have models that are complex and difficult to read but provides higher accuracy.

Models that are easy to interpret are generally grouped as Parametric methods. These include models that make an assumption about the relationship between the features and the output as having a functional form. Regression models are generally considered as parametric methods. Example, linear regression. Other examples of commonly used parametric models include logistic regression, polynomial regression, linear discriminant analysis, quadratic discriminant analysis, (parametric) mixture models, and naïve Bayes (when parametric density estimation is used). Approaches often used in conjunction with parametric models for model selection purposes include ridge regression, lasso, and principal components regression.

There are models which seem like complex, black boxes which provide high accuracy and they don’t make assumptions about the functional relationship between features and predictors. These are called Non-parametric methods. A simple example of a nonparametric model is a classification tree. A classification tree is a series of recursive binary decisions on the input features. The classification tree learning algorithm uses the target variable to learn the optimal series of splits such that the terminal leaf nodes of the tree contain instances with similar values of the target.

Other examples of nonparametric approaches to machine learning include k-nearest neighbors, splines, basis expansion methods, kernel smoothing, generalized additive models, neural nets, bagging, boosting, random forests, and support vector machines.

Most machine learning applications tend to use non-parametric methods to reflect the underlying complexity of relationships.

 

Machine learning: What do you do with Missing Data?

As a data scientist, a big portion of your time will be spent in cleaning data. One of the areas of data ‘munging’ that you will need to address is that of missing data. Data could be missing for any number of reasons. A bank that is trying to target a section of its customers with the most potential for subscribing to a new product may find that the income field of some customers is missing. Since income is a self reported measure, some customer may have not bothered to fill it up. Sometimes, data is missing because of formatting issues. Example, log data with date time from a particular server may have been corrupt leading to a section of data with missing fields.

When dealing with missing data, here are some questions that you need to figure out answers before doing any machine learning:

1. Does the missing data have meaning?

2. Is the data set large wherein missing data is small (May make sense to remove instances of missing data)

3. Does the data set follow any distributions? (If not, you can use ML regression analysis to predict values for the missing data)

4. Does the data follow simple distributions? (If so, you can substitute the missing data with the mean or median)

5. Can the data set be ordered? (If so, you can replace the missing data with the preceding value)

Each of the above points requires a different set of tools to address the missing data problem. We’ll cover the specific tools with examples in a different blog.

 

Emotional Intelligence Versus IQ

There was a time when the only factor that was considered while labeling someone ‘smart’ or not was his or her IQ or Intelligence Quotient. This is a score derived from one of several standardized tests designed to assess intelligence. IQ scores are used as predictors of educational achievement, special needs, job performance and income.

One of the challenges of using IQ scores as a sole indicator is that a persons overall intellectual abilities can hardly be summarized into one score. Also, IQ tests may assess logical thinking skills and memory, but fail to assess interpersonal skills or creativity, which are equally, if not more important in order to lead a full life. This is where EQ or Emotional Intelligence steps in.

EQ is the ability to identify, understand, and manage emotions in positive ways to relieve stress, communicate effectively, empathize with others, overcome challenges, and defuse conflict. A high EQ would enable people to be able to be aware of ones own emotional state and that of others, and use that to enrich communication with others and enhance relationships.

Humans are social beings. Interaction with others is crucial- whether it be at work, or in ones personal relationships. For example, which would be more effective? Appealing to reason and emotions to convince someone, or trying to convince someone by facts alone?

It’s not the smartest people that are the most successful or the most fulfilled in life. There are a number of people who are academically brilliant and yet are socially inept and unsuccessful at work or in their personal relationships. IQ isn’t enough on its own to be successful in life. From an education perspective, IQ can help get into college, but it’s the EQ that will help manage the stress and emotions when facing final exams, or managing relationships with other

Do Professional Certifications Matter: a Case Study from Singapore

  • Posted On December 3, 2013
  • Categorized In Blogs
  • Written By

In this study commissioned by iKompass, correlations between certifications and salary are explored. The study group was made of 127 people who were involved in taking up different certifications such as PMP (Project Management Professional) certification, Cloud certification, CCNA etc.
For a small country such as Singapore that has no natural resources, people are its biggest asset. Just as a country that has oil nurtures its petroleum industry with utmost care, Singapore nurtures its people by creating an environment conducive to talent. The certifications or credentials one holds presumably plays a big role in decision making related to careers.
The study involved establishing a correlation between the number of credentials and the corresponding salaries people earned. The sample size for the study was all from Singapore and randomly selected based on similar demographic criterion such as common university education, age, etc.

Leaders of organizations constantly gripe about not having enough talent. With constant change looming over the horizon, it is imperative that one stays ahead of the talent game. This is achieved through upgrading oneself through education. Beyond the industry need for certified professionals, people in Singapore seem to be intrinsically motivated to learn. 83% of the participants in the iKompass said that they enjoyed the process of learning beyond just landing a job.

Companies spend millions in upgrading their employees skills at all levels. 90% of our participants attended at least 1 in-house training required by the company and 68% of our participants attended a workshop of their own choice. The study also focused on the tangible value that resulted from people undergoing training.

In real terms, the study tracked the career progress of 40 credential holders who attended a PMP training in Singapore conducted by. As a control group, study included 40 non credential holders from the same organizations as that of the

Forget Privacy. This is what the Big Boys want you to believe

  • Posted On December 2, 2013
  • Categorized In Blogs
  • Written By

“Privacy is over. Get used to it” is often attributed to Scott McNealy, former CEO of Sun. Is he right? The statement by Scott McNealy when interpreted as individuals having no choice but to assign very little value to personal privacy lends to the notion that individuals do not value privacy anymore.  In this paper, we put forth a contrary view: People care about privacy and online privacy can and should be protected.

The two main concerns of losing one’s privacy are centered on online behavior being tracked and entities monitoring an individual for scrupulous reasons.  “Online predators use information divulged in online profiles and social networking sites to identify potential targets” [1]. These generally do not matter until one is racially profiled or considered a threat to security or is a victim of cyber bullying.

A nonchalant attitude towards privacy issues or outweighing the benefits of information sharing over privacy has a tendency to fuel companies increasing their monitoring activities and aggressively selling personal information for commercial reasons. “Letting the guard down on privacy could also cause harm to the most vulnerable section of the online demographic, children and teenagers who share the most information.” [2]

Parents and job counselors have been warning for years that teenagers and young adults must not post unflattering images to their Facebook pages because, even if deleted, they will persist somewhere on the internet and may be found by prospective colleges and employers [3]. One of the problems around private information being misused centers around how companies such as Google and Facebook use posted information.

Instagram, shortly after being acquired by Facebook, issued new terms of service that gave the company the right to use uploaded images without permission and without compensation. Imagine the damage it could do to a teenager who posted an impulsive “dirty” picture