Logistic Regression¶

Import Libraries¶

import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf


from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

Read and Confirm Data¶

df = pd.read_csv('data/ccdefault.csv').round(1)

df.head()

	default	student	balance	income
0	No	No	729.5	44361.6
1	No	Yes	817.2	12106.1
2	No	No	1073.5	31767.1
3	No	No	529.3	35704.5
4	No	No	785.7	38463.5

# remap default = 'Yes' to 1; 'No' to 0
df['default'] = np.where(df['default'] == "Yes", 1, 0)
df['student'] = np.where(df['student'] == "Yes", 1, 0)

# size balance and income to be 100s of $
df['balance'] = np.round(df['balance']/100,0)
df['income'] = np.round(df['income']/100,0)

df.head()

	default	student	balance	income
0	0	0	7.0	444.0
1	0	1	8.0	121.0
2	0	0	11.0	318.0
3	0	0	5.0	357.0
4	0	0	8.0	385.0

Logistic Regression¶

lr = smf.logit(formula='default ~ balance + C(student)',data=df).fit() 

Optimization terminated successfully.
         Current function value: 0.078644
         Iterations 10

lr.summary()

Logit Regression Results
Dep. Variable:	default	No. Observations:	10000
Model:	Logit	Df Residuals:	9997
Method:	MLE	Df Model:	2
Date:	Mon, 27 Dec 2021	Pseudo R-squ.:	0.4615
Time:	16:35:43	Log-Likelihood:	-786.44
converged:	True	LL-Null:	-1460.3
Covariance Type:	nonrobust	LLR p-value:	2.172e-293

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-10.7703	0.371	-29.019	0.000	-11.498	-10.043
C(student)[T.1]	-0.7004	0.147	-4.761	0.000	-0.989	-0.412
balance	0.5746	0.023	24.680	0.000	0.529	0.620

Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Evaluate Model¶

X = df[['balance','student']]
y = df['default']
y_probabilities = lr.predict(X)

y_hat = list(map(round,y_probabilities))

print(accuracy_score(y,y_hat))

0.9735

print(confusion_matrix(y,y_hat))

[[9618   49]
 [ 216  117]]

print(classification_report(y,y_hat))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      9667
           1       0.70      0.35      0.47       333

    accuracy                           0.97     10000
   macro avg       0.84      0.67      0.73     10000
weighted avg       0.97      0.97      0.97     10000

Predictions¶

# predict new points
data_new = {'balance': [5.2,10.1,12.3,20.1,22.6], 
        'student': [1,0,1,0,1]}
df_new = pd.DataFrame(data_new)

df_new['probability'] = lr.predict(df_new).round(2)

df_new

	balance	student	probability
0	5.2	1	0.00
1	10.1	0	0.01
2	12.3	1	0.01
3	20.1	0	0.69
4	22.6	1	0.82

np.exp(lr.params) 

Intercept          0.000021
C(student)[T.1]    0.496375
balance            1.776450
dtype: float64

(np.exp(lr.params)-1)*100

Intercept         -99.997898
C(student)[T.1]   -50.362525
balance            77.644961
dtype: float64

previous

Multiple Linear Regression

next

KMeans