391B Orchard Road #23-01 Ngee Ann City Tower B, Singapore 238874
+ 65 66381203



Data Science is not just for those with PhD’s

It is a common misconception that data science is a field only for PhD’s. Field of data science is a vast and welcoming playground for anyone with an aptitude for problem solving using data.

Data Science is the intersection of three key areas:

Domain knowledge
Computer programming
Math and statistics

Domain knowledge is just the knowledge of the industry or field one is working in. A financial analyst would have knowledge about the stock market. A hotel manager would have knowledge about the hospitality industry. A sales manager would have knowledge about what factors influence buying.


Computer programming is the ability to use code to solve various types of problems. Many of today’s programming languages have a syntax closely resembling general English. Picking up programming is not as difficult as it used to be.


Math and statistics is the use of equations and formulas to perform analysis. Many of these concepts are from school or college days. Of course one can dive deep into the various theorems and derivations. But many of today’s freely available tools handle these complexities behind the screen.


In order to gain knowledge from data, one must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses’ place in the domain we are in.


When starting off in the field of data science, one would rely on their core strengths depending on which area they come from. Someone with a background in programming would be exploring the application of code to derive mathematical models. Someone with a background in math would be trying out coding. Someone with domain knowledge would be creating hypotheses to test using code.


The intersection of math and programming leads to what is referred to as machine learning. However, one needs to be able to explicitly generalize models

Hadoop and MapReduce – Simplified explanation

The most common use case for Hadoop is when you need to process large volumes of data with many machines. Example, lets say we end of with 500 million transactions. Each transaction contains a bunch of information such as date and time; item purchased, location, customer information, payment method, weather, humidity, top headlines that day etc.,

First obvious thought you may have is why not just use relational databases (RDB). Conventionally, relational databases have been used to store data. However, you are going to run into many problems. The most obvious one being cost. As the volume of data increases, the cost rises exponentially.

The second problem you will need to address is dealing with unstructured data. How do we show images, videos etc in a RDB.

This is where technologies such as Hadoop and NoSql come in.

What is Hadoop all about?

At its core, Hadoop is about processing large volumes of data with many machines. You would start with first splitting your data so that every machine gets a subset of the data. Imagine we have a data set of 500 million transactions. We would split the data into 100 million records for each sub set and pass it to the 5 machines.

Question: From the 500 million transactions, find the average sales per day?

To answer this question, the data is split into subsets and sent to individual machines. We need to now take the records available in each machine, identify the specific header to evaluate and do some computations. In technical jargon, we call this Extract, Transform and Load (ETL)

Sometimes, it becomes necessary to shuffle or move data across different machines to arrange them in some order. This may be to do with moving the sales column before the tax column etc., This is called Shuffling where you move the results between intermediate

Logistic Regression Use Case – Classification Problems

Logistic regression is used when the outcome is a discrete variable. Example, trying to figure out who will win the election, whether a student will pass or fail an exam, whether a customer will come back, whether an email is a spam. This is commonly called as a classification problem because we are trying to determine which class the data set best fits.

Take for example, we have health records of patients with data such as gender, age, income, activity level, marriage status, number of kids, heart disease status. Logistic regression allows us to build a model that will predict the probability of someone having a heart disease based on the features.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logisticregression.

Unlike regression analysis, logistic regression does not directly model the values of the dependent variable. However, it does model the probability of the membership to a particular group or category. If the dependent variable consists of only two categories, logistic regression estimates the odds outcome of the dependent variable given a set of quantitative and/or categorical independent variables.

Logistics regression analysis starts with calculating the “Odds” of the dependent variable, which is the ratio of the probability that an individual (case) is a member of a particular group or category, p(y) divided by the probability that an individual is not a member of the group or category [1- p(y)]. It is represented as follows:

Odds = p(y) / [1 – p(y)]

It is important to note that unlike the probability values, which range from 0 to

Parametric Versus Non-Parametric Estimating methods

A common hurdle that every data scientist will come at some point need to address is what machine learning model to use? At one end of the spectrum, you have simple models which are easy to interpret but less accurate and at the other end, you have models that are complex and difficult to read but provides higher accuracy.

Models that are easy to interpret are generally grouped as Parametric methods. These include models that make an assumption about the relationship between the features and the output as having a functional form. Regression models are generally considered as parametric methods. Example, linear regression. Other examples of commonly used parametric models include logistic regression, polynomial regression, linear discriminant analysis, quadratic discriminant analysis, (parametric) mixture models, and naïve Bayes (when parametric density estimation is used). Approaches often used in conjunction with parametric models for model selection purposes include ridge regression, lasso, and principal components regression.

There are models which seem like complex, black boxes which provide high accuracy and they don’t make assumptions about the functional relationship between features and predictors. These are called Non-parametric methods. A simple example of a nonparametric model is a classification tree. A classification tree is a series of recursive binary decisions on the input features. The classification tree learning algorithm uses the target variable to learn the optimal series of splits such that the terminal leaf nodes of the tree contain instances with similar values of the target.

Other examples of nonparametric approaches to machine learning include k-nearest neighbors, splines, basis expansion methods, kernel smoothing, generalized additive models, neural nets, bagging, boosting, random forests, and support vector machines.

Most machine learning applications tend to use non-parametric methods to reflect the underlying complexity of relationships.


Machine learning: What do you do with Missing Data?

As a data scientist, a big portion of your time will be spent in cleaning data. One of the areas of data ‘munging’ that you will need to address is that of missing data. Data could be missing for any number of reasons. A bank that is trying to target a section of its customers with the most potential for subscribing to a new product may find that the income field of some customers is missing. Since income is a self reported measure, some customer may have not bothered to fill it up. Sometimes, data is missing because of formatting issues. Example, log data with date time from a particular server may have been corrupt leading to a section of data with missing fields.

When dealing with missing data, here are some questions that you need to figure out answers before doing any machine learning:

1. Does the missing data have meaning?

2. Is the data set large wherein missing data is small (May make sense to remove instances of missing data)

3. Does the data set follow any distributions? (If not, you can use ML regression analysis to predict values for the missing data)

4. Does the data follow simple distributions? (If so, you can substitute the missing data with the mean or median)

5. Can the data set be ordered? (If so, you can replace the missing data with the preceding value)

Each of the above points requires a different set of tools to address the missing data problem. We’ll cover the specific tools with examples in a different blog.


Emotional Intelligence Versus IQ

There was a time when the only factor that was considered while labeling someone ‘smart’ or not was his or her IQ or Intelligence Quotient. This is a score derived from one of several standardized tests designed to assess intelligence. IQ scores are used as predictors of educational achievement, special needs, job performance and income.

One of the challenges of using IQ scores as a sole indicator is that a persons overall intellectual abilities can hardly be summarized into one score. Also, IQ tests may assess logical thinking skills and memory, but fail to assess interpersonal skills or creativity, which are equally, if not more important in order to lead a full life. This is where EQ or Emotional Intelligence steps in.

EQ is the ability to identify, understand, and manage emotions in positive ways to relieve stress, communicate effectively, empathize with others, overcome challenges, and defuse conflict. A high EQ would enable people to be able to be aware of ones own emotional state and that of others, and use that to enrich communication with others and enhance relationships.

Humans are social beings. Interaction with others is crucial- whether it be at work, or in ones personal relationships. For example, which would be more effective? Appealing to reason and emotions to convince someone, or trying to convince someone by facts alone?

It’s not the smartest people that are the most successful or the most fulfilled in life. There are a number of people who are academically brilliant and yet are socially inept and unsuccessful at work or in their personal relationships. IQ isn’t enough on its own to be successful in life. From an education perspective, IQ can help get into college, but it’s the EQ that will help manage the stress and emotions when facing final exams, or managing relationships with other

Quitting your Job to Start your Own Company

  • Posted On December 6, 2013
  • Categorized In Blogs
  • Written By

Quitting your job and starting your own company

This is a thought that crosses most peoples minds at some point in their lives. Some act on it, others don’t. So what is the barrier that seems to stand between the thought and action? Mostly, fear. Fear of failure, fear of losing stability, that regular paycheck. It is a definite risk to leave the comfort of the job in hand for the uncertainty of venturing into unknown territory.

The question is- is the expected benefit worth the risk? Analyzing the status quo, and then comparing it against the perceived benefit of starting one’s own business can answer that question. There are two elements of perceived benefits- tangible and intangible. The measurable benefits are of course the monetary ones- for example, expected returns.

The more challenging ones to identify and weigh are the intangible ones, like satisfaction. How about the skills required to start a company? Bringing a business idea to life involves a lot of different aspects. There is the business angle like registration, taxes, bank accounts and such. Then there is the operational side for example, the location of business. Is it going to be leased or purchased? Another area is the staffing. The product/service itself is a very important area. Then there is the sales and marketing of the product/service.

While making a decision, all these areas should be considered and of course, the skills required to manage all aspects of the business. What are the sources of support available? Evaluating ones monthly financial commitments may also be a good idea. It would of help if one had financial security to buffer some of the risk. Being aware of the all important factor of luck is also very important. Quitting a job and starting ones own business may be a scary thought, but

What makes a training interesting

  • Posted On December 4, 2013
  • Categorized In Blogs
  • Written By

What makes a workshop, training or seminar interesting?

Too often people walk out of a workshop/training feeling underwhelmed. They don’t feel like that they’ve learnt something new or useful, they feel that it was a waste of their time. The key word to take away is the ‘feeling’. Participants in any seminar remember the experience, how they felt. In reality, the workshop may have been very educational; the facilitators may have been very knowledgeable.

Then why is there is a disconnect in the delivery of the seminar and the experience of the participants? The key is to enhance the experience, while not reducing the focus from the subject. So how can that be achieved? There are many elements that go into creating an interesting seminar.

An analogy may be that of a gift. The gift itself maybe very valuable, like a diamond ring. However, if the ring is given in a paper bag, the person giving the gift should not be surprised if the response is not a very enthusiastic one! Similarly, the subject of a seminar may be a very important one, but without the right packaging, the effect may be less than spectacular. Yes, focus should be given on building the content such as the slides, handouts, etc.

But equal importance should be given to the delivery of this content- the environment, involvement of the participants, the facilitators skills and so on. The environment, or the venue of the seminar, should be one that is conducive to learning and participation- brightly lit, comfortable seating, collaborative seating layout, etc.

For example, a dimly lit warm classroom may induce drowsiness. Another area that should be given importance is the structure of the seminar. It should be designed in a fashion that increases participation from the attendees. A monologue-type seminar would lose the participants interest very

Self study or attend class?

  • Posted On December 4, 2013
  • Categorized In Blogs
  • Written By

Self study versus attending a workshop

It’s a question a lot of learners ask themselves and training providers. A difficult one to give a blanket response. While the content may be the same in both offerings, the answer really lies in the individuals learning style. Worskhops are structured in nature. The instructor takes the participants through the concepts in detail, bundled with exercises and discussions.

The participants can benefit from the facilitators knowledge and can also learn from the experiences of other learners. There is constant sharing of information- both formally ( during class) and informally (during breaks). According to our PMP certification instructors in Singapore, the interactive nature of the classroom allows for immediate clarification of questions and concerns.

On the downside, the classroom will move at a determined pace, catering to the average level of the class. Due to the constraint of time, there is limited flexibility of catering to the different levels of the participants. Some learners may feel the class is moving too fast while others may feel it is moving too slow.

Also, the learners need to take time out of their busy schedules to participate in the workshop- which is usually a full day at the very least, and could go upto many more days, depending on the subject. Self-paced learning on the other hand, allows for a lot of flexibility. The learner can go through the material at their convenience, and the pace that they feel comfortable.

On the downside, such learning requires that the students exercise a lot of self-discipline. Unlike the classroom training, where during a specified time period, they are a captive audience, in the self-study mode, they need to make the time to study. There arises a question of support as well. They need to establish a source of support in terms of clarification

Exam Jitters

  • Posted On December 4, 2013
  • Categorized In Blogs
  • Written By

What to do about Exam Jitters, especially when you are over 30?

After interacting with hundreds of test takers for exams such as PMP certification in Singapore, wherein the average age is 35, we have collected a set of best practices that anyone planning to take an exam should be aware of.

Exams are stressful for people of any age. Young students who are used to taking exams on a regular basis form coping mechanisms that then become built-in for a longer term. However adults take fewer exams, and hence, don’t have a built in coping mechanism.

Therefore, candidates must be aware of their anxiety and prepare to take steps to recognize, prevent, and reduce it, as the case may be. A moderate level of anxiety is healthy. It can sharpen performance and give candidates that extra boost they need for their preparation.

However, there is a thin line between anxiety and panic. Managing anxiety is much easier than panic. There are two broad areas that must be kept in mind.

Mental preparation
Physical preparation

Mental preparation includes planning, executing, and monitoring the tasks required to prepare for the exam.

Accepting that a long journey comprises of many steps is a crucial part. One cannot reach the goal in a day. Setting realistic goals and staying on the path with determination will get the traveller to their destination and ease much of that dreaded anxiety. The biggest trap to watch out for is waiting for the last moment to do the studying.

Such cramming will only magnify the stress levels sending candidates over that thin line into panic territory, which is much harder to manage. The mind can only absorb so much information at any given time. Studies show that a healthy adult has an attention span of not longer than 40 minutes on one task.

Taking breaks while studying