Machine learning: What do you do with Missing Data?
As a data scientist, a big portion of your time will be spent in cleaning data. One of the areas of data ‘munging’ that you will need to address is that of missing data. Data could be missing for any number of reasons. A bank that is trying to target a section of its customers with the most potential for subscribing to a new product may find that the income field of some customers is missing. Since income is a self reported measure, some customer may have not bothered to fill it up. Sometimes, data is missing because of formatting issues. Example, log data with date time from a particular server may have been corrupt leading to a section of data with missing fields.
When dealing with missing data, here are some questions that you need to figure out answers before doing any machine learning:
1. Does the missing data have meaning?
2. Is the data set large wherein missing data is small (May make sense to remove instances of missing data)
3. Does the data set follow any distributions? (If not, you can use ML regression analysis to predict values for the missing data)
4. Does the data follow simple distributions? (If so, you can substitute the missing data with the mean or median)
5. Can the data set be ordered? (If so, you can replace the missing data with the preceding value)
Each of the above points requires a different set of tools to address the missing data problem. We’ll cover the specific tools with examples in a different blog.