Problem Statement¶

A consumer finance company specialises in lending various types of loans to urban customers. When the company receives a loan application, it has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

When a person applies for a loan, there are two types of decisions that could be taken by the company:¶

Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:

Fully paid: Applicant has fully paid the loan (the principal and the interest rate)

Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'.

Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan

Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

# importing required modules
import pandas as pd
import numpy as np
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

plt.style.use('ggplot')

# importing dataset
df = pd.read_csv("D:\\DataSets\\UPGRAD\\Assignments\\Loan Group Project\\loan.csv", low_memory=False)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39717 entries, 0 to 39716
Columns: 111 entries, id to total_il_high_credit_limit
dtypes: float64(74), int64(13), object(24)
memory usage: 33.6+ MB

df.describe()

# Display percentage of NA Values for every column (columns with 100% NA values will be discarded)
print("Percentage of NA Values for every column")
print(round(df.isna().sum() / len(df) * 100, 2))

Percentage of NA Values for every column
id                                  0.00
member_id                           0.00
loan_amnt                           0.00
funded_amnt                         0.00
funded_amnt_inv                     0.00
term                                0.00
int_rate                            0.00
installment                         0.00
grade                               0.00
sub_grade                           0.00
emp_title                           6.19
emp_length                          2.71
home_ownership                      0.00
annual_inc                          0.00
verification_status                 0.00
issue_d                             0.00
loan_status                         0.00
pymnt_plan                          0.00
url                                 0.00
desc                               32.58
purpose                             0.00
title                               0.03
zip_code                            0.00
addr_state                          0.00
dti                                 0.00
delinq_2yrs                         0.00
earliest_cr_line                    0.00
inq_last_6mths                      0.00
mths_since_last_delinq             64.66
mths_since_last_record             92.99
open_acc                            0.00
pub_rec                             0.00
revol_bal                           0.00
revol_util                          0.13
total_acc                           0.00
initial_list_status                 0.00
out_prncp                           0.00
out_prncp_inv                       0.00
total_pymnt                         0.00
total_pymnt_inv                     0.00
total_rec_prncp                     0.00
total_rec_int                       0.00
total_rec_late_fee                  0.00
recoveries                          0.00
collection_recovery_fee             0.00
last_pymnt_d                        0.18
last_pymnt_amnt                     0.00
next_pymnt_d                       97.13
last_credit_pull_d                  0.01
collections_12_mths_ex_med          0.14
mths_since_last_major_derog       100.00
policy_code                         0.00
application_type                    0.00
annual_inc_joint                  100.00
dti_joint                         100.00
verification_status_joint         100.00
acc_now_delinq                      0.00
tot_coll_amt                      100.00
tot_cur_bal                       100.00
open_acc_6m                       100.00
open_il_6m                        100.00
open_il_12m                       100.00
open_il_24m                       100.00
mths_since_rcnt_il                100.00
total_bal_il                      100.00
il_util                           100.00
open_rv_12m                       100.00
open_rv_24m                       100.00
max_bal_bc                        100.00
all_util                          100.00
total_rev_hi_lim                  100.00
inq_fi                            100.00
total_cu_tl                       100.00
inq_last_12m                      100.00
acc_open_past_24mths              100.00
avg_cur_bal                       100.00
bc_open_to_buy                    100.00
bc_util                           100.00
chargeoff_within_12_mths            0.14
delinq_amnt                         0.00
mo_sin_old_il_acct                100.00
mo_sin_old_rev_tl_op              100.00
mo_sin_rcnt_rev_tl_op             100.00
mo_sin_rcnt_tl                    100.00
mort_acc                          100.00
mths_since_recent_bc              100.00
mths_since_recent_bc_dlq          100.00
mths_since_recent_inq             100.00
mths_since_recent_revol_delinq    100.00
num_accts_ever_120_pd             100.00
num_actv_bc_tl                    100.00
num_actv_rev_tl                   100.00
num_bc_sats                       100.00
num_bc_tl                         100.00
num_il_tl                         100.00
num_op_rev_tl                     100.00
num_rev_accts                     100.00
num_rev_tl_bal_gt_0               100.00
num_sats                          100.00
num_tl_120dpd_2m                  100.00
num_tl_30dpd                      100.00
num_tl_90g_dpd_24m                100.00
num_tl_op_past_12m                100.00
pct_tl_nvr_dlq                    100.00
percent_bc_gt_75                  100.00
pub_rec_bankruptcies                1.75
tax_liens                           0.10
tot_hi_cred_lim                   100.00
total_bal_ex_mort                 100.00
total_bc_limit                    100.00
total_il_high_credit_limit        100.00
dtype: float64

# Finding columns with a single occuring value
for col in df.columns:
    if df[col].nunique() == 1:
        print(col, df[col].nunique())

pymnt_plan 1
initial_list_status 1
collections_12_mths_ex_med 1
policy_code 1
application_type 1
acc_now_delinq 1
chargeoff_within_12_mths 1
delinq_amnt 1
tax_liens 1

Data Cleaning¶

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Removing columns with single occuring values
drop_cols = [col for col in df.columns if df[col].isna().all()]
df.drop(drop_cols, axis=1, inplace=True)

# Dropping columns where ALL values are NA
na_columns = [col for col in df.columns if df[col].nunique() < 2]
df.drop(na_columns, axis=1, inplace=True)

# Drop other unnecessary columns 
drop_cols = ['member_id', 'url', 'title', 'desc', 'zip_code']
df.drop(drop_cols, axis=1, inplace=True)

# Removing % sign and converting to float for 'int_rate', 'revol_util'
df['int_rate'], df['revol_util'] = df['int_rate'].str.replace('%', '').astype(float), df['revol_util'].str.replace('%', '').astype(float)

# changing strings to standard format
cols = ['term', 'emp_title', 'home_ownership', 'verification_status', 'loan_status', 'purpose']
for col in cols:
    df[col] = df[col].str.strip().str.lower().str.replace(' ', '_')

# If pub_rec_bankruptcies is not known we assume there are no public records.
df['pub_rec_bankruptcies'].fillna(0, inplace=True)
df['pub_rec_bankruptcies'] = df['pub_rec_bankruptcies'].astype(int)

# Change data type from string to datetime for all columns with dates
cols = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']
for col in cols:
    df[col] = pd.to_datetime(df[col], format='%b-%y')

df.head()

Feature Extraction¶

Creating 'defaulted' column¶

# Creating binary column "defaulted" -> States whether that borrower defaulted on particular loan
df['defaulted'] = 0
df.loc[df['loan_status'] == 'charged_off', 'defaulted'] = 1

Creating 'issue_year', 'issue_month'¶

# Creating columns for "Issue Year", "Issue Month" -> (Year and month of issue of loan)
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

Placing borrower's income into categories¶

Annual_inc: This is the annual salary of the borrower in question.

plt.figure(figsize = (12, 6))
plt.subplot(121)

df['annual_inc'].hist(color='k')
plt.xlabel("Income", fontsize=12)
plt.ylabel("Count", fontsize=12)

plt.subplot(122)
df['annual_inc'].plot(kind='box', logy=True)

plt.tight_layout()
plt.show()

print(df['annual_inc'].describe())

count    3.971700e+04
mean     6.896893e+04
std      6.379377e+04
min      4.000000e+03
25%      4.040400e+04
50%      5.900000e+04
75%      8.230000e+04
max      6.000000e+06
Name: annual_inc, dtype: float64

Above histogram is not clearly represented due to outliers. By removing outliers from the visualisation we can get a clearer picture of how income is distributed.

plt.figure(figsize = (12, 6))
plt.subplot(121)

df[df['annual_inc'] < 400000]['annual_inc'].hist(color='k')
plt.xlabel("Income", fontsize=12)
plt.ylabel("Count", fontsize=12)

plt.subplot(122)
df[df['annual_inc'] < 400000]['annual_inc'].plot(kind='box')

plt.tight_layout()
plt.show()

# Low Income: 100000 or less, Mid Income: Between 100000 to 200000, high Income: Above 200000
df['inc_cat'] = 'mid_inc'
df.loc[df['annual_inc'] <= 100000, 'inc_cat'] = 'low_inc'
df.loc[df['annual_inc'] > 200000, 'inc_cat'] = 'high_inc'

Grouping the states into regions¶

United States are grouped into 5 Major Regions according to this information -> https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States

# Grouping the states into regions -> North_east, South_East, South_West, Mid_West, West
def finding_regions(state):
    north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']
    south_west = ['AZ', 'TX', 'NM', 'OK']
    south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
    west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
    mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
    
    if state in west:
        return 'west'
    elif state in south_west:
        return 'south_west'
    elif state in south_east:
        return 'south_east'
    elif state in mid_west:
        return 'mid_west'
    elif state in north_east:
        return 'north_east'
    

df['region'] = np.nan
df['region'] = df['addr_state'].apply(finding_regions)

df.head()

EDA¶

Loan_amnt: The loan amount requested by the borrower/applicant.

# Distribution Of Loan Amount

plt.figure(figsize=(12,6))

plt.subplot(121)
g = sns.distplot(df["loan_amnt"], bins=12)
g.set_xlabel("", fontsize=12)
g.set_ylabel("Frequency Dist", fontsize=12)
g.set_title("Frequency Distribuition", fontsize=20)

plt.subplot(122)
g1 = sns.violinplot(y="loan_amnt", data=df, 
               inner="quartile", palette="pastel")
g1.set_xlabel("", fontsize=12)
g1.set_ylabel("Amount Dist", fontsize=12)
g1.set_title("Amount Distribuition", fontsize=20)

plt.show()

C:\Users\Aakash\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Loan Status: This tells us whether the loan has been successfully paid off, has been written off (borrower has defaulted), or is currently in process.

# Count of Loan Statuses
with plt.style.context('ggplot'):
    plt.figure(figsize = (14, 8))

    plt.subplot(121)
    x = df['loan_status'].unique()
    y = [len(df[df.loan_status == cat]) for cat in x]
    plt.title("Count of Loan Status", fontsize=16)
    plt.ylabel("Count", fontsize=12)
    plt.bar(x, y, color='xkcd:blue green')

    plt.subplot(122)
    g = sns.countplot(x='issue_year', data=df,
                      hue='loan_status', palette='Set2')
    g.set_xticklabels(g.get_xticklabels(),rotation=90)
    g.set_xlabel("Year", fontsize=12)
    g.set_ylabel("Count", fontsize=12)
    g.legend(loc='upper left')
    g.set_title("Loan Status by Year", fontsize=16)

    plt.tight_layout()
    plt.show()

# Count of 'income by category'
with plt.style.context('ggplot'):
    plt.figure(figsize = (10,12))

    x = df['inc_cat'].unique()
    y = np.array([len(df[df.inc_cat == cat]) for cat in x])
    y2 = np.array([len(df[(df.inc_cat == cat) & (df.defaulted == 1)]) for cat in x])

    plt.ylabel("Count", fontsize=12)
    plt.bar(x, y, width=0.4, color='k')
    plt.bar(x, y2, width=0.4, color='xkcd:blue green')
    plt.legend(labels=["Total", "Defaulted"])
    plt.show()

    print("Ratio of defaulted to total for each income category:")
    for inc_cat, pct in zip(x, y2/y):
        print(f"{inc_cat}: {round(pct*100, 2)}", end='\t')

Ratio of defaulted to total for each income category:
low_inc: 14.8	mid_inc: 10.43	high_inc: 10.24

Most borrowers belong in the low income category, just the defaulters make up for all borowers in the mid_income category.

Grade: This is a grade assigned to a borrower depending on their credit history and likelyhood of repayment of loan. The better the grade, the lesser the interest rates on the loan.

Sub Grade: Grades can further be divided into subgrades based on finer margins.

# Count of Grade and sub Grade

plt.figure(figsize = (14,8))

x = sorted(df['grade'].unique())
y = np.array([len(df[df.grade == cat]) for cat in x])
y2 = np.array([len(df[(df.grade == cat) & (df.defaulted == 1)]) for cat in x])

plt.bar(x, y, color='k')
plt.bar(x, y2, color='xkcd:blue green')
plt.ylabel("Count", fontsize=12)
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("Ratio of defaulted to total for each grade:")
for grade, pct in zip(x, y2/y):
    print(f"{grade}: {round(pct*100, 2)}", end='\t')

plt.figure(figsize = (14,8))
x = sorted(df['sub_grade'].unique())
y = [len(df[df.sub_grade == cat]) for cat in x]
y2 = [len(df[(df.sub_grade == cat) & (df.defaulted == 1)]) for cat in x]
plt.xticks(rotation='vertical', fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.bar(x, y, color='k')
plt.bar(x, y2, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

Ratio of defaulted to total for each grade:
A: 5.97	B: 11.86	C: 16.63	D: 21.07	E: 25.16	F: 30.41	G: 31.96

Although grade B and C loans have higher number of defaults, the highest default ratios are for grade F and G.

Below, we check how income is distributed among borrowers in bottom 3 grades E, F, G

# how income is distributed among borrowers in grade E, F, G
sub_df = df[(df['grade'] == 'E') | (df['grade'] == 'F') | (df['grade'] == 'G')]

plt.figure(figsize = (10,12))

x = sub_df['inc_cat'].unique()
y = np.array([len(sub_df[sub_df.inc_cat == cat]) for cat in x])
y2 = np.array([len(sub_df[(sub_df.inc_cat == cat) & (sub_df.defaulted == 1)]) for cat in x])

plt.ylabel("Count", fontsize=12)
plt.bar(x, y, width=0.4, color='k')
plt.bar(x, y2, width=0.4, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("Ratio of defaulted to total for each income category:")
for inc_cat, pct in zip(x, y2/y):
    print(f"{inc_cat}: {round(pct*100, 2)}", end='\t')

Ratio of defaulted to total for each income category:
low_inc: 28.47	mid_inc: 21.58	high_inc: 22.95

Finding high income individuals with a bad grade is uncommon. 28.5% of all low income borrowers from categories E, F, G tend to default.

Verification Status: This indicates at what level the income of the borrower was verified.

# Count of verification Status
plt.figure(figsize = (10,8))

x = df['verification_status'].unique()
y = [len(df[df.verification_status == cat]) for cat in x]
y2 = [len(df[(df.verification_status == cat) & (df.defaulted == 1)]) for cat in x]

plt.bar(x, y, color='k')
plt.bar(x, y2, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

No distinguishable characteristics exist between above categories.

Term: Indicates number of payments the loan is divided into. It can either be 36 or 60.

# Count of Term
plt.figure(figsize = (10,8))

x = df['term'].unique()
y = [len(df[df.term == cat]) for cat in x]
y2 = [len(df[(df.term == cat) & (df.defaulted == 1)]) for cat in x]
plt.bar(x, y, width=0.4, color='k')
plt.bar(x, y2, width=0.4, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

pub_rec_bankruptcies: Public record of number of times borrower as filed for bankruptcy.

# Bankruptcies
plt.figure(figsize = (12,10))

x = ['0', '1', '2']
y = [len(df[df.pub_rec_bankruptcies == int(cat)]) for cat in x]
y2 = [len(df[(df.pub_rec_bankruptcies == int(cat)) & (df.defaulted == 1)]) for cat in x]
plt.bar(x, y, width=0.4, color='k')
plt.bar(x, y2, width=0.4, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

emp_title: Indicates who the employer of the borrower is.

# Count of top 20 employment titles
emp_df = df.groupby('emp_title', as_index=False).agg({'id': 'count', 'defaulted': 'sum'}).sort_values('id', ascending=False).head(20)
plt.figure(figsize=(12,6))

x = emp_df['emp_title']
y = emp_df['id']
y2 = emp_df['defaulted']

plt.xticks(rotation='vertical', fontsize=12)
plt.bar(x, y, color='k')
plt.bar(x, y2, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("Ratio of defaulted to total seggregated by employment:")
for emp, pct in zip(x, y2/y):
    print(f"{emp}: {round(pct*100, 2)}")

Ratio of defaulted to total seggregated by employment:
us_army: 14.15
bank_of_america: 17.39
at&t: 19.28
walmart: 24.39
wells_fargo: 11.27
kaiser_permanente: 11.59
ibm: 13.04
verizon_wireless: 20.0
ups: 26.98
self: 13.33
usaf: 3.45
usps: 18.97
us_air_force: 10.53
self_employed: 17.54
walgreens: 15.09
us_navy: 6.12
lockheed_martin: 6.12
state_of_california: 10.2
home_depot: 17.02
us_postal_service: 22.22

Most number of borrowers are employed at the US Army followed by Bank Of America, but highest default ratios are from employees of:

UPS: 27%

WALMART: 24.4%

US POSTAL: 22.22%

emp_length: Number of years the borrower has spent in the work force.

# Count of borrowers seggregated by length of employment
emp_df = df.groupby('emp_length', as_index=False).agg({'id': 'count', 'defaulted': 'sum'}).sort_values('id', ascending=False)
plt.figure(figsize=(12,10))

x = emp_df['emp_length']
y = emp_df['id']
y2 = emp_df['defaulted']

plt.xticks(rotation='vertical', fontsize=12)
plt.bar(x, y, color='k')
plt.bar(x, y2,color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("Ratio of defaulted to total seggregated by years of employment:")
for emp, pct in zip(x, y2/y):
    print(f"{emp}: {round(pct*100, 2)}")

Ratio of defaulted to total seggregated by years of employment:
10+ years: 14.99
< 1 year: 13.94
2 years: 12.92
3 years: 13.55
4 years: 13.45
5 years: 13.95
1 year: 14.07
6 years: 13.77
7 years: 14.83
8 years: 13.73
9 years: 12.56

Purpose: Purpose of taking a loan

# Most frequently stated reasons
emp_df = df.groupby('purpose', as_index=False).agg({'id': 'count', 'defaulted': 'sum'}).sort_values('id', ascending=False).head(20)
plt.figure(figsize=(12,10))

x = emp_df['purpose']
y = emp_df['id']
y2 = emp_df['defaulted']

plt.xticks(rotation='vertical', fontsize=12)
plt.bar(x, y, color='k')
plt.bar(x, y2, color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("Ratio of defaulted to total seggregated by purpose:")
for purpose, pct in zip(x, y2/y):
    print(f"{purpose}: {round(pct*100, 2)}")
    
print(f"\n\nDebt consolidation is the most common reason to take a loan and makes up for {round(len(df[df['purpose'] == 'debt_consolidation'])/len(df)*100, 2)}% of client base")

Ratio of defaulted to total seggregated by purpose:
debt_consolidation: 14.84
credit_card: 10.57
other: 15.85
home_improvement: 11.66
major_purchase: 10.15
small_business: 25.98
car: 10.33
wedding: 10.14
medical: 15.3
moving: 15.78
house: 15.49
vacation: 13.91
educational: 17.23
renewable_energy: 18.45


Debt consolidation is the most common reason to take a loan and makes up for 46.93% of client base

Highest default ratio exist for the small_buisness (26%) followed by renewable_energy (18.5%)

# Count of loans per state and region

state_df = df.groupby('addr_state', as_index=False).agg({'id': 'count', 'defaulted': 'sum'}).sort_values('id', ascending=False)
plt.figure(figsize=(16,8))

x = state_df['addr_state']
y = state_df['id']
y2 = state_df['defaulted']
plt.xticks(rotation='vertical', fontsize=12)
plt.bar(x, y, color='k')
plt.bar(x, y2,color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("California (CA) is the state with highest number of borrowers at 7000 followed by New York (NY) at around 4000")

region_df = df.groupby('region', as_index=False).agg({'id': 'count', 'defaulted': 'sum'}).sort_values('id', ascending=False)
plt.figure(figsize=(16,6))

x = region_df['region']
y = region_df['id']
y2 = region_df['defaulted']
plt.xticks(fontsize=12)
plt.bar(x, y, color='k')
plt.bar(x, y2,color='xkcd:blue green')
plt.legend(labels=["Total", "Defaulted"])
plt.show()

print("Ratio of defaulted to total seggregated by geographical region:")
for region, pct in zip(x, y2/y):
    print(f"{region}: {round(pct*100, 2)}")

California (CA) is the state with highest number of borrowers at 7000 followed by New York (NY) at around 4000

Ratio of defaulted to total seggregated by geographical region:
north_east: 13.25
west: 15.71
south_east: 14.64
mid_west: 13.59
south_west: 12.43

Installments: Monthly payment owed by the borrower.

# distribution of installments 
plt.figure(figsize=(14,10))

sns.distplot(df['installment'], color='k')
sns.distplot(df[df['defaulted'] == 1]['installment'],color='xkcd:blue green')
plt.legend(['Total', 'defaulted'])
plt.show()

C:\Users\Aakash\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Int_rate: The interest rate on the loan issued

#distribution of int_rate
plt.figure(figsize=(14,10))

sns.distplot(df['int_rate'])
sns.distplot(df[df['defaulted'] == 1]['int_rate'])
plt.legend(['Total', 'defaulted'])
plt.show()

C:\Users\Aakash\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

# avg interest rate per year
by_interest = df.groupby(['issue_year', 'defaulted']).int_rate.mean()

by_interest.unstack().plot()
plt.title('Average Interest rate by Loan Condition', fontsize=14)
plt.ylabel('Interest Rate (%)', fontsize=12)
plt.show()

On average, defaulters tend to have higher interest rates on their loans. This is likely due to the fact that their score/grade is low (thus a higher interest rate)

total_rec_prncp: Principal amount recieved from borrower till date.

# total_rec_prncp vs installment
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='total_rec_prncp', y='installment', hue='loan_status', palette='Set1')
plt.show()

total_pymnt: Payment recieved from borrower till date.

# total_rec_prncp vs total_pymnt
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='total_rec_prncp', y='total_pymnt', hue='loan_status', palette='Set1')
plt.show()

total_rec_int: Interests recieved from borrower till date

# total_rec_int vs total_pymnt
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='total_rec_int', y='total_pymnt', hue='loan_status', palette='Set1')
plt.show()

Heatmap¶

Heatmmap will visually tell how the relevant features are correlated.

numeric_variables = df[['loan_amnt','funded_amnt','funded_amnt_inv','int_rate','installment','annual_inc','dti','delinq_2yrs','inq_last_6mths', 'open_acc','pub_rec','revol_bal','revol_util','total_acc', 'out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp','total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee','last_pymnt_amnt', 'defaulted']]

plt.figure(figsize=(14, 10))
sns.heatmap(numeric_variables.corr(), cmap='RdYlGn', annot=True)
plt.show()

Summary¶

Five significant predictors of a borrower defaulting are:¶

Income:¶

Borrowers from the 'Low Income' category make up for 85.68% of the client base of loanees, which is an overwhelming majority. 15% of these tend to default their loan. Probability to default is 4% more in case of low income borrowers compared to mid and high income. 28.5% of all low income, bad grade (E, F, G) borrowers tend to default.

Grade:¶

Grade and rate of default are heavily inversely correlated. The highest default ratios are present for the three worst grades namely G: 31.96% F: 30.41% E: 25.16%

Interest Rate:¶

There's a 20% positive correlation between interest rate and tendency to default. Interest rate is heavily correlated with grade as better the grade, lesser the interest rate. Therefore borrowers without a good grade (A, B) will recieve higher interest rates that decrease their chance of paying of their loan. (Thus increase chance of default.)

Total Reccuring Principal:¶

There is a strong negative correlation (34%) between Total Reccuring Principal and tendency to default. This effect can be seen that defaulters usually have higher installment charges and higher total payments for an increasing value of TRP.

Purpose of Loan:¶

47% of borrowers stated 'Debt Consolidation' as pupose of loan and make up the majority of borrowers. The purpose with the highest default ratio is 'Small Buisness' (26%) followed by 'Renewable Energy' (18%).

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	installment	annual_inc	dti	delinq_2yrs	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	pub_rec	revol_bal	total_acc	out_prncp	out_prncp_inv	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_amnt	collections_12_mths_ex_med	mths_since_last_major_derog	policy_code	annual_inc_joint	dti_joint	verification_status_joint	acc_now_delinq	tot_coll_amt	tot_cur_bal	open_acc_6m	open_il_6m	open_il_12m	open_il_24m	mths_since_rcnt_il	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m	acc_open_past_24mths	avg_cur_bal	bc_open_to_buy	bc_util	chargeoff_within_12_mths	delinq_amnt	mo_sin_old_il_acct	mo_sin_old_rev_tl_op	mo_sin_rcnt_rev_tl_op	mo_sin_rcnt_tl	mort_acc	mths_since_recent_bc	mths_since_recent_bc_dlq	mths_since_recent_inq	mths_since_recent_revol_delinq	num_accts_ever_120_pd	num_actv_bc_tl	num_actv_rev_tl	num_bc_sats	num_bc_tl	num_il_tl	num_op_rev_tl	num_rev_accts	num_rev_tl_bal_gt_0	num_sats	num_tl_120dpd_2m	num_tl_30dpd	num_tl_90g_dpd_24m	num_tl_op_past_12m	pct_tl_nvr_dlq	percent_bc_gt_75	pub_rec_bankruptcies	tax_liens	tot_hi_cred_lim	total_bal_ex_mort	total_bc_limit	total_il_high_credit_limit
count	3.971700e+04	3.971700e+04	39717.000000	39717.000000	39717.000000	39717.000000	3.971700e+04	39717.000000	39717.000000	39717.000000	14035.000000	2786.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39717.000000	39661.0	0.0	39717.0	0.0	0.0	0.0	39717.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	39661.0	39717.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	39020.000000	39678.0	0.0	0.0	0.0	0.0
mean	6.831319e+05	8.504636e+05	11219.443815	10947.713196	10397.448868	324.561922	6.896893e+04	13.315130	0.146512	0.869200	35.900962	69.698134	9.294408	0.055065	13382.528086	22.088828	51.227887	50.989768	12153.596544	11567.149118	9793.348813	2263.663172	1.363015	95.221624	12.406112	2678.826162	0.0	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.043260	0.0	NaN	NaN	NaN	NaN
std	2.106941e+05	2.656783e+05	7456.670694	7187.238670	7128.450439	208.874874	6.379377e+04	6.678594	0.491812	1.070219	22.020060	43.822529	4.400282	0.237200	15885.016641	11.401709	375.172839	373.824457	9042.040766	8942.672613	7065.522127	2608.111964	7.289979	688.744771	148.671593	4447.136012	0.0	NaN	0.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.204324	0.0	NaN	NaN	NaN	NaN
min	5.473400e+04	7.069900e+04	500.000000	500.000000	0.000000	15.690000	4.000000e+03	0.000000	0.000000	0.000000	0.000000	0.000000	2.000000	0.000000	0.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	0.0	NaN	NaN	NaN	NaN
25%	5.162210e+05	6.667800e+05	5500.000000	5400.000000	5000.000000	167.020000	4.040400e+04	8.170000	0.000000	0.000000	18.000000	22.000000	6.000000	0.000000	3703.000000	13.000000	0.000000	0.000000	5576.930000	5112.310000	4600.000000	662.180000	0.000000	0.000000	0.000000	218.680000	0.0	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	0.0	NaN	NaN	NaN	NaN
50%	6.656650e+05	8.508120e+05	10000.000000	9600.000000	8975.000000	280.220000	5.900000e+04	13.400000	0.000000	1.000000	34.000000	90.000000	9.000000	0.000000	8850.000000	20.000000	0.000000	0.000000	9899.640319	9287.150000	8000.000000	1348.910000	0.000000	0.000000	0.000000	546.140000	0.0	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	0.0	NaN	NaN	NaN	NaN
75%	8.377550e+05	1.047339e+06	15000.000000	15000.000000	14400.000000	430.780000	8.230000e+04	18.600000	0.000000	1.000000	52.000000	104.000000	12.000000	0.000000	17058.000000	29.000000	0.000000	0.000000	16534.433040	15798.810000	13653.260000	2833.400000	0.000000	0.000000	0.000000	3293.160000	0.0	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	0.0	NaN	NaN	NaN	NaN
max	1.077501e+06	1.314167e+06	35000.000000	35000.000000	35000.000000	1305.190000	6.000000e+06	29.990000	11.000000	8.000000	120.000000	129.000000	44.000000	4.000000	149588.000000	90.000000	6311.470000	6307.370000	58563.679930	58563.680000	35000.020000	23563.680000	180.200000	29623.350000	7002.190000	36115.200000	0.0	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.000000	0.0	NaN	NaN	NaN	NaN

	id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	purpose	addr_state	dti	earliest_cr_line	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	revol_bal	revol_util	total_acc	out_prncp	out_prncp_inv	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_d	last_pymnt_amnt	next_pymnt_d	last_credit_pull_d
0	1077501	5000	5000	4975.0	36_months	10.65	162.87	B	B2	NaN	10+ years	rent	24000.0	verified	2011-12-01	fully_paid	credit_card	AZ	27.65	1985-01-01	1	NaN	NaN	3	13648	83.7	9	0.00	0.00	5863.155187	5833.84	5000.00	863.16	0.00	0.00	0.00	2015-01-01	171.62	NaT	2016-05-01
1	1077430	2500	2500	2500.0	60_months	15.27	59.83	C	C4	ryder	< 1 year	rent	30000.0	source_verified	2011-12-01	charged_off	car	GA	1.00	1999-04-01	5	NaN	NaN	3	1687	9.4	4	0.00	0.00	1008.710000	1008.71	456.46	435.17	0.00	117.08	1.11	2013-04-01	119.66	NaT	2013-09-01
2	1077175	2400	2400	2400.0	36_months	15.96	84.33	C	C5	NaN	10+ years	rent	12252.0	not_verified	2011-12-01	fully_paid	small_business	IL	8.72	2001-11-01	2	NaN	NaN	2	2956	98.5	10	0.00	0.00	3005.666844	3005.67	2400.00	605.67	0.00	0.00	0.00	2014-06-01	649.91	NaT	2016-05-01
3	1076863	10000	10000	10000.0	36_months	13.49	339.31	C	C1	air_resources_board	10+ years	rent	49200.0	source_verified	2011-12-01	fully_paid	other	CA	20.00	1996-02-01	1	35.0	NaN	10	5598	21.0	37	0.00	0.00	12231.890000	12231.89	10000.00	2214.92	16.97	0.00	0.00	2015-01-01	357.48	NaT	2016-04-01
4	1075358	3000	3000	3000.0	60_months	12.69	67.79	B	B5	university_medical_group	1 year	rent	80000.0	source_verified	2011-12-01	current	other	OR	17.94	1996-01-01	0	38.0	NaN	15	27783	53.9	38	524.06	524.06	3513.330000	3513.33	2475.94	1037.39	0.00	0.00	0.00	2016-05-01	67.79	2016-06-01	2016-05-01

	id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	purpose	addr_state	dti	earliest_cr_line	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	revol_bal	revol_util	total_acc	out_prncp	out_prncp_inv	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_d	last_pymnt_amnt	next_pymnt_d	last_credit_pull_d	defaulted	issue_year	issue_month	inc_cat	region
0	1077501	5000	5000	4975.0	36_months	10.65	162.87	B	B2	NaN	10+ years	rent	24000.0	verified	2011-12-01	fully_paid	credit_card	AZ	27.65	1985-01-01	1	NaN	NaN	3	13648	83.7	9	0.00	0.00	5863.155187	5833.84	5000.00	863.16	0.00	0.00	0.00	2015-01-01	171.62	NaT	2016-05-01	0	2011	12	low_inc	south_west
1	1077430	2500	2500	2500.0	60_months	15.27	59.83	C	C4	ryder	< 1 year	rent	30000.0	source_verified	2011-12-01	charged_off	car	GA	1.00	1999-04-01	5	NaN	NaN	3	1687	9.4	4	0.00	0.00	1008.710000	1008.71	456.46	435.17	0.00	117.08	1.11	2013-04-01	119.66	NaT	2013-09-01	1	2011	12	low_inc	south_east
2	1077175	2400	2400	2400.0	36_months	15.96	84.33	C	C5	NaN	10+ years	rent	12252.0	not_verified	2011-12-01	fully_paid	small_business	IL	8.72	2001-11-01	2	NaN	NaN	2	2956	98.5	10	0.00	0.00	3005.666844	3005.67	2400.00	605.67	0.00	0.00	0.00	2014-06-01	649.91	NaT	2016-05-01	0	2011	12	low_inc	mid_west
3	1076863	10000	10000	10000.0	36_months	13.49	339.31	C	C1	air_resources_board	10+ years	rent	49200.0	source_verified	2011-12-01	fully_paid	other	CA	20.00	1996-02-01	1	35.0	NaN	10	5598	21.0	37	0.00	0.00	12231.890000	12231.89	10000.00	2214.92	16.97	0.00	0.00	2015-01-01	357.48	NaT	2016-04-01	0	2011	12	low_inc	west
4	1075358	3000	3000	3000.0	60_months	12.69	67.79	B	B5	university_medical_group	1 year	rent	80000.0	source_verified	2011-12-01	current	other	OR	17.94	1996-01-01	0	38.0	NaN	15	27783	53.9	38	524.06	524.06	3513.330000	3513.33	2475.94	1037.39	0.00	0.00	0.00	2016-05-01	67.79	2016-06-01	2016-05-01	0	2011	12	low_inc	west