Logistic Regression Use Case – Classification Problems
Logistic regression is used when the outcome is a discrete variable. Example, trying to figure out who will win the election, whether a student will pass or fail an exam, whether a customer will come back, whether an email is a spam. This is commonly called as a classification problem because we are trying to determine which class the data set best fits.
Take for example, we have health records of patients with data such as gender, age, income, activity level, marriage status, number of kids, heart disease status. Logistic regression allows us to build a model that will predict the probability of someone having a heart disease based on the features.
What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logisticregression.
Unlike regression analysis, logistic regression does not directly model the values of the dependent variable. However, it does model the probability of the membership to a particular group or category. If the dependent variable consists of only two categories, logistic regression estimates the odds outcome of the dependent variable given a set of quantitative and/or categorical independent variables.
Logistics regression analysis starts with calculating the “Odds” of the dependent variable, which is the ratio of the probability that an individual (case) is a member of a particular group or category, p(y) divided by the probability that an individual is not a member of the group or category [1- p(y)]. It is represented as follows:
Odds = p(y) / [1 – p(y)]
It is important to note that unlike the probability values, which range from 0 to 1, the values of the odds can theoretically range from 0 to infinity.
In order to establish a linear relationship between the odds and the independent variables in the logistic regression model, the odds need to be transformed to logit (log-odds) by taking the natural logarithm (ln) of odds. The logarithmic transformation creates a continuous dependent variable of the categorical dependent variable
The goal of logistic regression is to detemine the probablity of an outcome based on a set of inputs.
The key property of the logistic function, from the classification perspective, is that irrespective of the values of its argument, z, the logistic function always returns a value between 0 and 1. So this function compresses the whole real line into the interval [0, 1]. This function is also known as the sigmoid function, because of its characteristic S-shaped aspect
In simple terms, The formula to estimate probability from logistic regression is:
P(i) = 1 / 1 + e–Z
where Z = α + βXi.
Note that Z is something we can determine through linea regression.
Use case of Logistic Rgression
Lets say we are helping a local hyper market increase sales. We need to determine what products to suggest for customers who have already purchased some items in the store.
Through logistic regression on past data, we can determine the probability of a customer purchasing jewellery if they have already purchased items from the women’s accessories.
Look at the table below and try to figure out which products would you suggest to which customers.
Someone who just purchased kitchen stuff has a 72% chance of buying women’s accessories. Someone who purchased infant goods is very unlikely to purchase automobile parts.