Introduction:
Logistic regression is a classification technique . Logistic Regression is more of statistical technique where this technique is used to predicting the probability of the binary responses. The response will be based on the one or more independent variable(s). Logistic Regression will helps to find a function between the dependent variable(This must be categorical variable) and independent variable(s) (This can be both categorical or continuous variable). The dependent variables will take only two possible values i,e 1 or 0, true or false, yes or no, Default or not Default.
Examples of Logistic Regression:
- Suppose we are interested in the factors that influence whether RCB will win an IPL in 2020 or not.
- A bank which typically receives thousands of applications for the new credit card. The applications contains several data like age, gender, annual salary and past credits and debits. So we need to categorize the people in two types like good credit people and bad credit people
Credits.Google
Table of contents:
- Understanding how Logistic Regression works
- Case study
- Implementation in python
- Understanding the python code
- Logistic Regression applications
Understanding how Logistic Regression works
In order to understand how logistic regression works. Let us first examine a hypothetical prediction model.
We have a historical data from past years about students in our class. The data contains maths score, physics score, activity score and final board exam score. Students come back to school for re - union after 6 years . Like we have the data about 12 years worth data. By this data we should predict whether the students are successful in life or not.
So first we have to see that the students graduating in this current year are going to be 6 years from now (by this we will only be considering whether they are successful in life or not ). By high school data we cannot predict or say that the students are successful in life or not but, for time being lets take this as example for understanding purpose and see how logistic regression actually works.
Consider the below fig 2 as our sample data
Sample Data Set
For the below sample data ( Ref : Fig 2) we will add Rohan in our data. Say Rohan scored 82 in maths, 80 in physics, 70 in activities and total board exam score is 500. For this data we will predict how successful Rohan will be in 6 years.
This type of problem is called with “Classification problem” where we have to classify an data point to say whether this data point belongs to successful or not. So, logistic regression is the best suited for this type of problems.
How logistic regression works
As I told before logistic regression is more of statistical technique and makes prediction using probability.
Understanding exactly what probability means (an overview )
0 = you are absolutely sure that this person is not going to successful in future.
1 = you are absolutely sure that this person is going to be successful in future.
Say we have the value above 0.5 then we are sure that the person will succeed. And say you predict the value i,e 0.8 means 80% confident that the person will succeed in the future.
If the value is below 0.5 then we can say that the person will not succeed in the future.
How does it makes this prediction?
By developing a model using training data.
We have the scores (independent variables) and we also know that whether this person will succeed or not ( dependent variables). You will come up with one prediction and will see how our prediction align with our recorded data.
If you predicted 0.9 on Bhuvan and you are very near or close to the predictions and we can say that we have developed a pretty good model. You also predicted 0.4 on sagar, then your model is way off in predicting sagar is successful or not. Then we look into various models (not randomly) And we will find out which model will fits closely to our recorded data. The step by step by which we arrive at a model is called as “Model Selection”
After all these steps, then will plug into Rohan’s score ( we can include your whole class) into the model and splits the data between i,e number between 0 to 1. By looking at this if the probability value greater than 0.5 or more then we can predict that the Rohan is successful in his life or he is not successful.
Case study
Data Set Information:
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Data set download : https://github.com/heroorkrishna/Bank-ADDITIONAL
Bank data with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
Source:
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Attributes Information:
Bank data details
Input variables
1. age - (numeric)
2. job - Type of job ( Categorical: ‘admin’ , ‘blue - collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self - employed’, ‘services’, ‘student’ , ‘technician’ , ‘unemployed’, ‘unknown’ )
3. martial - martial status ( Categorical : ‘divorced’, ‘married’, ‘single’, ‘unknown’ ,Note: ‘divorced’ means divorced or widowed)
4. education - (Categorical : ‘basic.4y’, ‘basic.6y’, ‘high school’, ‘illiterate’, ‘professional.course’,’university.degree’, ‘unknown’)
5. default - is credit has in default? ( categorical: ‘yes’, ‘no’, ‘unknown’)
6. housing: is housing has in loan ? ( categorical: ‘yes’ ,’no’, ‘unknown’)
7. loan- has personal loan ? ( categorical: ‘yes’, ‘no’, ‘ unknown’)
Related with the last contact of current campaign:
8. contact : contact communication type( categorical: ‘cellular’, ‘telephone’)
9. month: last contact month of the year( categorical : ‘jan’, ‘feb’, ‘mar’,…..,’nov’, ’dec’)
10. day_of_the_week : last cantact day of the week ( categorical: ‘mon’, ‘tue’, ‘wed’, ‘thu’, ‘ fri’)
11. duration : last contact duration, in seconds (numeric).
Important note : This attribute is highly affects the output target variable ( example , if your duration is zero then output variable (y) =’no’).
Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should be included only for the benchmark purposes and should be discarded if the intention is to have a realistic predictive or classification model.
Other Attributes:
12. campaign : number of contact performed during this campaign and for this client (‘numeric’, includes last contact)
13. pdays : number of days tat passed after the client last contacted from a previous campaign (numeric ; 999 mean client was not previously connected)
14. previous : number of contacts performed before this campaign and for this client (numeric)
15. poutcome : outcome of the previous marketing campaign ( categorical : ‘failure’ , ‘nonexistant’, ‘success’)
Social and economic content attributes :
16. emp.var.rate : employment variation rate - quarterly indicator (numeric)
17. cons.price.idx : consumer price index - monthly indicator (numeric)
18. cons.conf.idx : consumer confidence index - monthly indicator (numeric)
19. euribor3m : euribor 3 months rate - daily indicator (numeric)
20. nr.employed : number of employees - quarterly indicator (numeric)
Output Variable :
Output variable (desired target) :
21. y has the client subscribed a term deposit ? (binary : ‘yes’, ‘no’ )
Problem Statement
The data is from the direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a team deposit.
Implementation in python
This section includes the code for a building a straightforward Logistic Regression . We will go through the code to understand it in more detain in the next section.
Code:
Understanding the python code
Now, let us go through the code to understand it and how it actually works.
Importing required libraries and tools
import pandas as pd
import numpy as np
Read the csv file and store it in a “ bank” data frame
bank= pd.read_csv(“bank-additional_full.csv)
bank.head()
Output:
age job marital ... euribor3m nr.employed y
0 30 blue-collar married ... 1.313 5099.1 no
1 39 services single ... 4.855 5191.0 no
2 25 services married ... 4.962 5228.1 no
3 38 services married ... 4.959 5228.1 no
4 47 admin. married ... 4.191 5195.8 no
List all columns (for reference)
bank.columns
Output : Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays' ,
'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
'cons.conf.idx' , 'euribor3m', 'nr.employed', 'y'],
dtype='object')
Recording the y (output variable) response:
bank['outcome'] = bank.y.map({'no':0, 'yes':1})
In the above code, the output variable is in categorical format so we are converting this response into numeric values and stores in new column.
Comment On Features :
Importing the library
import matplotlib.pyplot as pt
1. age
bank.boxplot(column='age', by='outcome')
Output
In this feature we can clearly see there are lot of outliers and can say this is not a great feature.
2. job
bank.groupby('job').outcome.mean()
Output
In this output, all the values are revolving around the same space. So we can say this is the useful feature.
In the next section, we are going to create the dummies using ‘job_dummies’ as name. Later we will add it into our bank data frame.
job_dummies = pd.get_dummies(bank.job, prefix='job')
job_dummies.drop(job_dummies.columns[0], axis=1, inplace=True)
3. default :
bank.groupby('default').outcome.mean()
Output:
default
no 0.121267
unknown 0.061021
yes 0.000000
Name: outcome, dtype: float64
The values revolve around same space and this is also an useful feature.
So, lets treat this as a 2 - class feature rather than 3 - class feature
bank['default'] = bank.default.map({'no':0, 'unknown':1, 'yes':1})
4. contact
bank['contact'] = bank.contact.map({'cellular':0, 'telephone':1})
As we discussed above, the contact variable is a categorical type. In this step we will convert into numeric.
5. Month
bank.groupby('month').outcome.mean()
Output :
This also like a useful feature at first glance.
But, when we look at their success rate it is actually correlated with the number of calls. The month feature is unlikely to generalize.
bank.groupby('month').outcome.agg(['count', 'mean']).sort_values('count')
Output :
6. duration
bank.boxplot(column='duration', by='outcome')
Output :
By fig, It looks like a excellent feature . But you can cannot know the call duration beforehand. So
We can’t be used in the model.
7.1 previous
bank.groupby('previous').outcome.mean()
Output:
previous
0 0.082884
1 0.208421
2 0.410256
3 0.600000
4 0.714286
5 1.000000
6 0.500000
Name: outcome, dtype: float64
This looks like a excellent feature too.
7.2 poutcome
bank.groupby('poutcome').outcome.mean()
Output:
poutcome
failure 0.147577
nonexistent 0.082884
success 0.647887
Name: outcome, dtype: float64
Now, creating the poutcome_dummies
poutcome_dummies = pd.get_dummies(bank.poutcome, prefix='poutcome')
poutcome_dummies.drop(poutcome_dummies.columns[0], axis=1, inplace=True)
After creating the poutcome_dummies then the next step is concatenating i,e concatenating bank data frame with job_dummies and poutcome_dummies.
bank = pd.concat([bank, job_dummies, poutcome_dummies], axis=1)
7.3 euribor3m
bank.boxplot(column='euribor3m', by='outcome')
Output :
Preparing a boxplot on euribor3m by outcome, and comment on the ‘euribor3m’ feature .As we see figure we can say that this is an excellent feature.
After the comment on feature step we will move into model building.
Model building
feature_cols = ['default', 'contact', 'previous', 'euribor3m'] + list(bank.columns[-13:])
X = bank[feature_cols]
Creating y
y = bank.outcome
X.head()
Output :
default contact .. . poutcome_nonexistent poutcome_success
0 0 0 ... 1 0
1 0 1 ... 1 0
2 0 1 ... 1 0
3 0 1 ... 1 0
4 0 0 ... 1 0
The final poutcome_success says, the outcome 1 from the above output says that the client has subscribed a term deposit and 0 says the client not subscribed the term deposit
Logistic Regression applications:
Logistic regression is used for the predicting/measuring the categorical data with two or more levels. Some examples are gender of a person, outcome of the football match etc.
In real world applications logistic regression can be used for
1. Credit scoring.
2. Measuring the success rate of marketing campaigns
3. Predict the revenue of product
4. Is there going to be an end of the earth on a particular day?