Logistic Regression¶
Import Libraries¶
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
Read and Confirm Data¶
df = pd.read_csv('data/ccdefault.csv').round(1)
df.head()
default | student | balance | income | |
---|---|---|---|---|
0 | No | No | 729.5 | 44361.6 |
1 | No | Yes | 817.2 | 12106.1 |
2 | No | No | 1073.5 | 31767.1 |
3 | No | No | 529.3 | 35704.5 |
4 | No | No | 785.7 | 38463.5 |
# remap default = 'Yes' to 1; 'No' to 0
df['default'] = np.where(df['default'] == "Yes", 1, 0)
df['student'] = np.where(df['student'] == "Yes", 1, 0)
# size balance and income to be 100s of $
df['balance'] = np.round(df['balance']/100,0)
df['income'] = np.round(df['income']/100,0)
df.head()
default | student | balance | income | |
---|---|---|---|---|
0 | 0 | 0 | 7.0 | 444.0 |
1 | 0 | 1 | 8.0 | 121.0 |
2 | 0 | 0 | 11.0 | 318.0 |
3 | 0 | 0 | 5.0 | 357.0 |
4 | 0 | 0 | 8.0 | 385.0 |
Logistic Regression¶
lr = smf.logit(formula='default ~ balance + C(student)',data=df).fit()
Optimization terminated successfully.
Current function value: 0.078644
Iterations 10
lr.summary()
Dep. Variable: | default | No. Observations: | 10000 |
---|---|---|---|
Model: | Logit | Df Residuals: | 9997 |
Method: | MLE | Df Model: | 2 |
Date: | Mon, 27 Dec 2021 | Pseudo R-squ.: | 0.4615 |
Time: | 16:35:43 | Log-Likelihood: | -786.44 |
converged: | True | LL-Null: | -1460.3 |
Covariance Type: | nonrobust | LLR p-value: | 2.172e-293 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -10.7703 | 0.371 | -29.019 | 0.000 | -11.498 | -10.043 |
C(student)[T.1] | -0.7004 | 0.147 | -4.761 | 0.000 | -0.989 | -0.412 |
balance | 0.5746 | 0.023 | 24.680 | 0.000 | 0.529 | 0.620 |
Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Evaluate Model¶
X = df[['balance','student']]
y = df['default']
y_probabilities = lr.predict(X)
y_hat = list(map(round,y_probabilities))
print(accuracy_score(y,y_hat))
0.9735
print(confusion_matrix(y,y_hat))
[[9618 49]
[ 216 117]]
print(classification_report(y,y_hat))
precision recall f1-score support
0 0.98 0.99 0.99 9667
1 0.70 0.35 0.47 333
accuracy 0.97 10000
macro avg 0.84 0.67 0.73 10000
weighted avg 0.97 0.97 0.97 10000
Predictions¶
# predict new points
data_new = {'balance': [5.2,10.1,12.3,20.1,22.6],
'student': [1,0,1,0,1]}
df_new = pd.DataFrame(data_new)
df_new['probability'] = lr.predict(df_new).round(2)
df_new
balance | student | probability | |
---|---|---|---|
0 | 5.2 | 1 | 0.00 |
1 | 10.1 | 0 | 0.01 |
2 | 12.3 | 1 | 0.01 |
3 | 20.1 | 0 | 0.69 |
4 | 22.6 | 1 | 0.82 |
np.exp(lr.params)
Intercept 0.000021
C(student)[T.1] 0.496375
balance 1.776450
dtype: float64
(np.exp(lr.params)-1)*100
Intercept -99.997898
C(student)[T.1] -50.362525
balance 77.644961
dtype: float64