Logistic Regression

Import Libraries

import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf


from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

Read and Confirm Data

df = pd.read_csv('data/ccdefault.csv').round(1)
df.head()
default student balance income
0 No No 729.5 44361.6
1 No Yes 817.2 12106.1
2 No No 1073.5 31767.1
3 No No 529.3 35704.5
4 No No 785.7 38463.5
# remap default = 'Yes' to 1; 'No' to 0
df['default'] = np.where(df['default'] == "Yes", 1, 0)
df['student'] = np.where(df['student'] == "Yes", 1, 0)
# size balance and income to be 100s of $
df['balance'] = np.round(df['balance']/100,0)
df['income'] = np.round(df['income']/100,0)
df.head()
default student balance income
0 0 0 7.0 444.0
1 0 1 8.0 121.0
2 0 0 11.0 318.0
3 0 0 5.0 357.0
4 0 0 8.0 385.0

Logistic Regression

lr = smf.logit(formula='default ~ balance + C(student)',data=df).fit() 
Optimization terminated successfully.
         Current function value: 0.078644
         Iterations 10
lr.summary()
Logit Regression Results
Dep. Variable: default No. Observations: 10000
Model: Logit Df Residuals: 9997
Method: MLE Df Model: 2
Date: Mon, 27 Dec 2021 Pseudo R-squ.: 0.4615
Time: 16:35:43 Log-Likelihood: -786.44
converged: True LL-Null: -1460.3
Covariance Type: nonrobust LLR p-value: 2.172e-293
coef std err z P>|z| [0.025 0.975]
Intercept -10.7703 0.371 -29.019 0.000 -11.498 -10.043
C(student)[T.1] -0.7004 0.147 -4.761 0.000 -0.989 -0.412
balance 0.5746 0.023 24.680 0.000 0.529 0.620


Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Evaluate Model

X = df[['balance','student']]
y = df['default']
y_probabilities = lr.predict(X)
y_hat = list(map(round,y_probabilities))
print(accuracy_score(y,y_hat))
0.9735
print(confusion_matrix(y,y_hat))
[[9618   49]
 [ 216  117]]
print(classification_report(y,y_hat))
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      9667
           1       0.70      0.35      0.47       333

    accuracy                           0.97     10000
   macro avg       0.84      0.67      0.73     10000
weighted avg       0.97      0.97      0.97     10000

Predictions

# predict new points
data_new = {'balance': [5.2,10.1,12.3,20.1,22.6], 
        'student': [1,0,1,0,1]}
df_new = pd.DataFrame(data_new)
df_new['probability'] = lr.predict(df_new).round(2)
df_new
balance student probability
0 5.2 1 0.00
1 10.1 0 0.01
2 12.3 1 0.01
3 20.1 0 0.69
4 22.6 1 0.82
np.exp(lr.params) 
Intercept          0.000021
C(student)[T.1]    0.496375
balance            1.776450
dtype: float64
(np.exp(lr.params)-1)*100
Intercept         -99.997898
C(student)[T.1]   -50.362525
balance            77.644961
dtype: float64