Problem Statement¶

An education company named X Education sells online courses to industry professionals. The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead.

Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

"The company requires a model that will assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance."

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
style.use('ggplot')

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, precision_score, recall_score, precision_recall_curve, f1_score, accuracy_score

import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.max_columns", 50)

C:\Users\Aakash\Anaconda3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

# Importing data
df = pd.read_csv("D:\\DataSets\\UPGRAD\\Assignments\\Lead+Scoring+Case+Study\\Leads.csv")
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
Prospect ID                                      9240 non-null object
Lead Number                                      9240 non-null int64
Lead Origin                                      9240 non-null object
Lead Source                                      9204 non-null object
Do Not Email                                     9240 non-null object
Do Not Call                                      9240 non-null object
Converted                                        9240 non-null int64
TotalVisits                                      9103 non-null float64
Total Time Spent on Website                      9240 non-null int64
Page Views Per Visit                             9103 non-null float64
Last Activity                                    9137 non-null object
Country                                          6779 non-null object
Specialization                                   7802 non-null object
How did you hear about X Education               7033 non-null object
What is your current occupation                  6550 non-null object
What matters most to you in choosing a course    6531 non-null object
Search                                           9240 non-null object
Magazine                                         9240 non-null object
Newspaper Article                                9240 non-null object
X Education Forums                               9240 non-null object
Newspaper                                        9240 non-null object
Digital Advertisement                            9240 non-null object
Through Recommendations                          9240 non-null object
Receive More Updates About Our Courses           9240 non-null object
Tags                                             5887 non-null object
Lead Quality                                     4473 non-null object
Update me on Supply Chain Content                9240 non-null object
Get updates on DM Content                        9240 non-null object
Lead Profile                                     6531 non-null object
City                                             7820 non-null object
Asymmetrique Activity Index                      5022 non-null object
Asymmetrique Profile Index                       5022 non-null object
Asymmetrique Activity Score                      5022 non-null float64
Asymmetrique Profile Score                       5022 non-null float64
I agree to pay the amount through cheque         9240 non-null object
A free copy of Mastering The Interview           9240 non-null object
Last Notable Activity                            9240 non-null object
dtypes: float64(4), int64(3), object(30)
memory usage: 2.6+ MB

df.describe()

Data Cleaning¶

# Standardising column names
df.columns = ['_'.join(name.lower().split()[:3]) for name in df.columns]

#standardising categorical values
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols] = df[cat_cols].apply(lambda x: x.str.lower().str.replace(' ', '_').str.strip())

# Taking a look at all unique values for every column to check for abnormalities/inconsistencies
for col in df.columns[2:]:
    print(col.upper(), end=': ')
    print(df[col].unique())
    print()

LEAD_ORIGIN: ['api' 'landing_page_submission' 'lead_add_form' 'lead_import'
 'quick_add_form']

LEAD_SOURCE: ['olark_chat' 'organic_search' 'direct_traffic' 'google' 'referral_sites'
 'welingak_website' 'reference' 'facebook' nan 'blog' 'pay_per_click_ads'
 'bing' 'social_media' 'welearn' 'click2call' 'live_chat'
 'welearnblog_home' 'youtubechannel' 'testone' 'press_release' 'nc_edm']

DO_NOT_EMAIL: ['no' 'yes']

DO_NOT_CALL: ['no' 'yes']

CONVERTED: [0 1]

TOTALVISITS: [  0.   5.   2.   1.   4.   8.  11.   6.   3.   7.  13.  17.  nan   9.
  12.  10.  16.  14.  21.  15.  22.  19.  18.  20.  43.  30.  23.  55.
 141.  25.  27.  29.  24.  28.  26.  74.  41.  54. 115. 251.  32.  42.]

TOTAL_TIME_SPENT: [   0  674 1532 ...  603  483  927]

PAGE_VIEWS_PER: [ 0.    2.5   2.    1.    4.    8.    2.67 11.    5.    6.    3.    1.33
  1.5   3.5   7.    2.33 13.    8.5   5.5   1.67   nan  4.5   3.33 16.
 12.    1.71  1.8   6.5   4.33 14.    3.4  10.    1.25  1.75  2.63 15.
  2.25  3.67  1.43  9.    2.6   4.75  1.27  3.25  5.33  2.57  2.17  2.75
  2.8   2.2   2.86  3.91  1.4   5.67  3.2   1.38  2.09  2.4  55.    5.25
  6.71  3.57  2.22  1.83  3.6   1.2   1.57  1.56  5.4   4.25  1.31  1.6
  2.9   1.23  1.78  3.83  7.5   1.14  2.71  1.45  2.38  1.86  2.29  1.21
 12.33  3.43  2.56  6.33  1.64  8.21  4.4   3.17  8.33  1.48  1.22 24.
  3.75  6.67  1.54  2.13  2.14  2.45  3.29  4.17  1.63  3.38  1.17 14.5
  3.8   1.19  3.82  2.83  1.93 11.5   2.08]

LAST_ACTIVITY: ['page_visited_on_website' 'email_opened' 'unreachable'
 'converted_to_lead' 'olark_chat_conversation' 'email_bounced'
 'email_link_clicked' 'form_submitted_on_website' 'unsubscribed'
 'had_a_phone_conversation' 'view_in_browser_link_clicked' nan
 'approached_upfront' 'sms_sent' 'visited_booth_in_tradeshow'
 'resubscribed_to_emails' 'email_received' 'email_marked_spam']

COUNTRY: [nan 'india' 'russia' 'kuwait' 'oman' 'united_arab_emirates'
 'united_states' 'australia' 'united_kingdom' 'bahrain' 'ghana'
 'singapore' 'qatar' 'saudi_arabia' 'belgium' 'france' 'sri_lanka' 'china'
 'canada' 'netherlands' 'sweden' 'nigeria' 'hong_kong' 'germany'
 'asia/pacific_region' 'uganda' 'kenya' 'italy' 'south_africa' 'tanzania'
 'unknown' 'malaysia' 'liberia' 'switzerland' 'denmark' 'philippines'
 'bangladesh' 'vietnam' 'indonesia']

SPECIALIZATION: ['select' 'business_administration' 'media_and_advertising' nan
 'supply_chain_management' 'it_projects_management' 'finance_management'
 'travel_and_tourism' 'human_resource_management' 'marketing_management'
 'banking,_investment_and_insurance' 'international_business' 'e-commerce'
 'operations_management' 'retail_management' 'services_excellence'
 'hospitality_management' 'rural_and_agribusiness' 'healthcare_management'
 'e-business']

HOW_DID_YOU: ['select' 'word_of_mouth' 'other' nan 'online_search' 'multiple_sources'
 'advertisements' 'student_of_someschool' 'email' 'social_media' 'sms']

WHAT_IS_YOUR: ['unemployed' 'student' nan 'working_professional' 'businessman' 'other'
 'housewife']

WHAT_MATTERS_MOST: ['better_career_prospects' nan 'flexibility_&_convenience' 'other']

SEARCH: ['no' 'yes']

MAGAZINE: ['no']

NEWSPAPER_ARTICLE: ['no' 'yes']

X_EDUCATION_FORUMS: ['no' 'yes']

NEWSPAPER: ['no' 'yes']

DIGITAL_ADVERTISEMENT: ['no' 'yes']

THROUGH_RECOMMENDATIONS: ['no' 'yes']

RECEIVE_MORE_UPDATES: ['no']

TAGS: ['interested_in_other_courses' 'ringing'
 'will_revert_after_reading_the_email' nan 'lost_to_eins'
 'in_confusion_whether_part_time_or_dlp' 'busy' 'switched_off'
 'in_touch_with_eins' 'already_a_student' 'diploma_holder_(not_eligible)'
 'graduation_in_progress' 'closed_by_horizzon' 'number_not_provided'
 'opp_hangup' 'not_doing_further_education' 'invalid_number'
 'wrong_number_given' 'interested__in_full_time_mba' 'still_thinking'
 'lost_to_others' 'shall_take_in_the_next_coming_month' 'lateral_student'
 'interested_in_next_batch' 'recognition_issue_(dec_approval)'
 'want_to_take_admission_but_has_financial_problems'
 'university_not_recognized']

LEAD_QUALITY: ['low_in_relevance' nan 'might_be' 'not_sure' 'worst' 'high_in_relevance']

UPDATE_ME_ON: ['no']

GET_UPDATES_ON: ['no']

LEAD_PROFILE: ['select' 'potential_lead' nan 'other_leads' 'lateral_student'
 'dual_specialization_student' 'student_of_someschool']

CITY: ['select' 'mumbai' nan 'thane_&_outskirts' 'other_metro_cities'
 'other_cities' 'other_cities_of_maharashtra' 'tier_ii_cities']

ASYMMETRIQUE_ACTIVITY_INDEX: ['02.medium' '01.high' '03.low' nan]

ASYMMETRIQUE_PROFILE_INDEX: ['02.medium' '01.high' '03.low' nan]

ASYMMETRIQUE_ACTIVITY_SCORE: [15. 14. 13. 17. 16. 11. 12. 10.  9.  8. 18. nan  7.]

ASYMMETRIQUE_PROFILE_SCORE: [15. 20. 17. 18. 14. 16. 13. 19. 12. nan 11.]

I_AGREE_TO: ['no']

A_FREE_COPY: ['no' 'yes']

LAST_NOTABLE_ACTIVITY: ['modified' 'email_opened' 'page_visited_on_website' 'email_bounced'
 'email_link_clicked' 'unreachable' 'unsubscribed'
 'had_a_phone_conversation' 'olark_chat_conversation' 'sms_sent'
 'approached_upfront' 'resubscribed_to_emails'
 'view_in_browser_link_clicked' 'form_submitted_on_website'
 'email_received' 'email_marked_spam']

Issues by column:¶

By going through the above unique entires of every column, we have observed the following inconsistencies:

lead_source -> WeLearn and WeLearnblog_home are one and the same. Facebook and social media are two different categories.

country -> "Australia", the asian Countries and "Asia/Pacific Region" are two different categories.

tags -> "invalid number" and "wrong number given" are one and the same.

last_activity and last_notable_activity are highly correlated. One of them should be dropped to avoid multi-collinearity.

Other potential issues:¶

Many columns exist with only a single category (Redundant columns).
same category percieved as different due to not matching case. (ex. Google -> google)
The value Select is equivalent to NaN.
Possible overlaps may exist in City column.

# removing redundant columns
df.drop(['prospect_id', 'lead_number', 'receive_more_updates', 'update_me_on', 'get_updates_on', 'i_agree_to', 'last_activity'], axis=1, inplace=True)

# Replacing all 'select' values with NaN
df = df.replace('select', np.nan)

# replacing "wrong number given" with "invalid number"
df.tags = df.tags.replace("wrong_number_given", "invalid_number")

# Cleaning "lead_source column"
value_counts = df.lead_source.value_counts()

# Values with frequency less than 30 are classified under "others"
df.lead_source = df.lead_source.replace(value_counts[value_counts < 30].index, "others")

# Cleaning "country"
value_counts = df.country.value_counts()

# We will categorise Country in binary as 'India' and 'Other' 
df.country = df.country.replace(df.country[df.country != 'india'].dropna().unique(), "others")

Dealing with NaN values¶

# Percentage of NA values in every column
round(df.isna().sum().sort_values(ascending=False)/len(df)*100, 2)

how_did_you                    78.46
lead_profile                   74.19
lead_quality                   51.59
asymmetrique_profile_score     45.65
asymmetrique_activity_score    45.65
asymmetrique_profile_index     45.65
asymmetrique_activity_index    45.65
city                           39.71
specialization                 36.58
tags                           36.29
what_matters_most              29.32
what_is_your                   29.11
country                        26.63
page_views_per                  1.48
totalvisits                     1.48
lead_source                     0.39
total_time_spent                0.00
converted                       0.00
do_not_call                     0.00
do_not_email                    0.00
last_notable_activity           0.00
magazine                        0.00
search                          0.00
a_free_copy                     0.00
newspaper_article               0.00
x_education_forums              0.00
newspaper                       0.00
digital_advertisement           0.00
through_recommendations         0.00
lead_origin                     0.00
dtype: float64

# Column "how_did_you" has too many NA values that can not be imputed. It is better to drop the column. 
df.drop('how_did_you', axis=1, inplace=True)

# Column "lead_profile" has too many NA values that can not be imputed.
print(df.lead_profile.value_counts())

print("\n\nWe could convert the column to binary by defining whether it is a 'potential_lead' or not.")
print(f"\nIf made binary variable:\nis_potential_lead: {df.lead_profile.value_counts()[0]}\nis_not_lead: {len(df) - df.lead_profile.value_counts()[0]}\n")
print("But given the high frequency of NA values, There is a lot of uncertainty whether those values are leads or not. This forces us to drop the column altogether.")

df.drop('lead_profile', axis=1, inplace=True)

potential_lead                 1613
other_leads                     487
student_of_someschool           241
lateral_student                  24
dual_specialization_student      20
Name: lead_profile, dtype: int64


We could convert the column to binary by defining whether it is a 'potential_lead' or not.

If made binary variable:
is_potential_lead: 1613
is_not_lead: 7627

But given the high frequency of NA values, There is a lot of uncertainty whether those values are leads or not. This forces us to drop the column altogether.

# Column "lead_quality" has 52% NA values. We'll classify them as "not_sure".
df.lead_quality.fillna("not_sure", inplace=True)
print(df.lead_quality.value_counts())

not_sure             5859
might_be             1560
high_in_relevance     637
worst                 601
low_in_relevance      583
Name: lead_quality, dtype: int64

# Asymmetrique scores
for col in ['profile_score', 'activity_score']:
    sns.distplot(df['asymmetrique_'+col].dropna())
    plt.show()
    
print(f"Both scores roughly follow standard distribution, therefore imputation can be done by median value.\nThough instead of using scores, we shall use the categorical index columns derived from these scores.\nMissing values will be filled by most frequently occuring category.")

# Dropping Scores (Only Indexes will be used)
df.drop(['asymmetrique_profile_score', 'asymmetrique_activity_score'], axis=1, inplace=True)

# Removing initial numbers from category naming
df['asymmetrique_profile_index'] = df['asymmetrique_profile_index'].str[3:]
df['asymmetrique_activity_index'] = df['asymmetrique_activity_index'].str[3:]

# Replacing nan with most commonly occuring category
df['asymmetrique_activity_index'].fillna(df['asymmetrique_activity_index'].mode().values[0], inplace=True)
df['asymmetrique_profile_index'].fillna(df['asymmetrique_profile_index'].mode().values[0], inplace=True)

Both scores roughly follow standard distribution, therefore imputation can be done by median value.
Though instead of using scores, we shall use the categorical index columns derived from these scores.
Missing values will be filled by most frequently occuring category.

# Imputing for 'Country' and 'City'
print("\nWe'll first try understanding how cities are labelled for different country categories\n")

for country in df.country.dropna().unique():
    print(country.upper())
    print(df[df.country == country]['city'].value_counts())
    print()
    
print("It looks like customers outside India are wrongly marked as being from domestic cities.\n")

print("\nLets look at how countries are labeled w.r.t city categories\n")
for city in df.city.dropna().unique():
    print(city.upper())
    print(df[df.city == city]['country'].value_counts())
    print()
    
# We will make a new city label ('international') for countries labeled as 'others'
df.city[df.country == "others"] = df.city[df.country == "others"].replace(df.city[df.country == "others"].unique(), 'international')

# We'll also assume that rows with a domestic city label will have to have country label as 'india'.
df.country[(df.city != "international") & ~df.city.isna()] = 'india'

# We'll assume that rows with a country label 'india 'will have most frequently occuring city label.
df.city[(df.country == "india") & (df.city.isna())] = 'mumbai'

We'll first try understanding how cities are labelled for different country categories

INDIA
mumbai                         2918
thane_&_outskirts               651
other_cities                    564
other_cities_of_maharashtra     391
other_metro_cities              345
tier_ii_cities                   69
Name: city, dtype: int64

OTHERS
mumbai                         93
other_cities                   67
other_cities_of_maharashtra    23
thane_&_outskirts              21
other_metro_cities             20
tier_ii_cities                  3
Name: city, dtype: int64

It looks like customers outside India are wrongly marked as being from domestic cities.


Lets look at how countries are labeled w.r.t city categories

MUMBAI
india     2918
others      93
Name: country, dtype: int64

THANE_&_OUTSKIRTS
india     651
others     21
Name: country, dtype: int64

OTHER_METRO_CITIES
india     345
others     20
Name: country, dtype: int64

OTHER_CITIES
india     564
others     67
Name: country, dtype: int64

OTHER_CITIES_OF_MAHARASHTRA
india     391
others     23
Name: country, dtype: int64

TIER_II_CITIES
india     69
others     3
Name: country, dtype: int64

print(f"There are {len(df[(df.city.isna()) & (df.country.isna())])} entries where both country and city are NA")
print("We'll mark these as 'unknown'")
df[['city', 'country']] = df[['city', 'country']].fillna('unknown', axis=1)

There are 2055 entries where both country and city are NA
We'll mark these as 'unknown'

# Few columns have less than 2% na values. We can afford to drop their respective rows altogehter. 
df.dropna(subset=['totalvisits', 'page_views_per', 'lead_source'], inplace=True)

# There are too many NA values with no logical way to impute in these columns so we will drop them entirely
df.drop(['specialization', 'tags', 'what_matters_most', 'what_is_your'] , axis=1, inplace=True)

round(df.isna().sum().sort_values(ascending=False), 2)

last_notable_activity          0
a_free_copy                    0
lead_source                    0
do_not_email                   0
do_not_call                    0
converted                      0
totalvisits                    0
total_time_spent               0
page_views_per                 0
country                        0
search                         0
magazine                       0
newspaper_article              0
x_education_forums             0
newspaper                      0
digital_advertisement          0
through_recommendations        0
lead_quality                   0
city                           0
asymmetrique_activity_index    0
asymmetrique_profile_index     0
lead_origin                    0
dtype: int64

Some features have hierarchial categories, which requires ordinal encoding. But then the features will not be scale invariant. It is thus better to one-hot encode all categorical variables.

# dummy encoding for the categorical variables
dummies = pd.get_dummies(df.select_dtypes(include=['object']), drop_first=True)

# getting the cleaned df
clean_df = df.drop(df.select_dtypes(include=['object']).columns, axis=1)
clean_df = pd.concat([clean_df, dummies], axis=1)

clean_df.head()

print(f"{round(len(clean_df)/9239*100,2)}% data has been retained after data cleaning.")

98.21% data has been retained after data cleaning.

EDA¶

Correlations¶

# visualizing correlation by heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(clean_df.corr())
plt.show()

# heatmap of only continuous variables
sns.heatmap(clean_df[clean_df.columns[:4]].corr())
plt.show()

# columns pairs in order of highest absolute correlation
clean_df.corr().abs().unstack().sort_values(ascending=False).drop_duplicates().head(12)

last_notable_activity_view_in_browser_link_clicked  last_notable_activity_view_in_browser_link_clicked    1.000000
lead_source_facebook                                lead_origin_lead_import                               0.983684
lead_origin_lead_add_form                           lead_source_reference                                 0.866191
country_unknown                                     lead_source_olark_chat                                0.803772
x_education_forums_yes                              newspaper_article_yes                                 0.707068
lead_quality_not_sure                               lead_quality_might_be                                 0.597667
city_unknown                                        lead_origin_landing_page_submission                   0.566471
lead_origin_landing_page_submission                 a_free_copy_yes                                       0.564863
page_views_per                                      country_unknown                                       0.556781
country_unknown                                     city_mumbai                                           0.536886
lead_source_olark_chat                              lead_origin_landing_page_submission                   0.528424
asymmetrique_activity_index_low                     asymmetrique_activity_index_medium                    0.526720
dtype: float64

# Dropping variables with high multi-collinearity
clean_df.drop(['lead_source_facebook', 'lead_origin_lead_add_form', 'lead_source_olark_chat'], axis=1, inplace=True)

# Top 5 features correlated with target variable
clean_df.corr()['converted'].abs().sort_values(ascending=False).head(6)[1:]

lead_quality_not_sure             0.443920
last_notable_activity_sms_sent    0.360233
total_time_spent                  0.359261
lead_quality_might_be             0.349936
last_notable_activity_modified    0.263532
Name: converted, dtype: float64

Univariate and Outlier Analysis¶

#Histogram
def plot_bars():
    plt.figure(figsize=(12, 4))
    plt.subplot(121)
    sns.distplot(clean_df['total_time_spent'])

    plt.subplot(122)
    sns.distplot(clean_df['totalvisits'])

    plt.tight_layout()
    plt.show()
    
plot_bars()

# Boxplots before outlier removal
num_df = clean_df[['converted', 'totalvisits',  'page_views_per', 'total_time_spent']]

def plot_boxes():
    plt.figure(figsize=(12, 4))
    plt.subplot(121)
    sns.boxplot(data=clean_df, x='total_time_spent')

    plt.subplot(122)
    sns.boxplot(data=clean_df, x='totalvisits')

    plt.tight_layout()
    plt.show()

    plt.figure(figsize=(6, 6))
    box_long = pd.melt(num_df.drop('total_time_spent', axis=1), id_vars='converted')
    sns.boxplot(x='converted', y='value', hue='variable', data=box_long)
    plt.show()
    
plot_boxes()

# Removing outliers
out_df = num_df.drop('converted', axis=1)
q = out_df.quantile(0.97)
out_df = out_df[out_df < q]
out_df = out_df.dropna()

clean_df = clean_df.loc[out_df.index].reset_index(drop=True)
num_df = num_df.loc[out_df.index].reset_index(drop=True)

print(f"{round(len(clean_df)/9239*100,2)}% data has been retained after outlier removal.")

88.78% data has been retained after outlier removal.

# After Outlier removal
plot_bars()
plot_boxes()

Bivariate Analysis¶

sns.pairplot(data=clean_df, vars=clean_df.columns[1:4], hue='converted', kind='reg', height=3,
             plot_kws={'scatter_kws': {'alpha': 0.1}})
plt.show()

Building the Predictive Model¶

Splitting into Train and Test¶

We will perform stratified split to prevent the effect of class imbalance of target variable, thus preventing bias.

# Stratified Train Test Split
X = clean_df.drop('converted', axis=1)
y = clean_df['converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

Scaling (Normalisation)¶

We will perform Min Max scaling on the Continuous numerical variables

# Min Max Scaling
scaler = MinMaxScaler()
cols = X_train.columns
scaled_X_train = pd.DataFrame(scaler.fit_transform(X_train[cols[:3]]), columns=cols[:3])
scaled_X_train = pd.concat([scaled_X_train, X_train.drop(cols[:3], axis=1).reset_index(drop=True)], axis=1)

scaled_X_test = pd.DataFrame(scaler.transform(X_test[cols[:3]]), columns=cols[:3])
scaled_X_test = pd.concat([scaled_X_test, X_test.drop(cols[:3], axis=1).reset_index(drop=True)], axis=1)

PCA¶

# Generating principal components of continuous variables
pca = PCA(random_state=42)
decomp_df = pca.fit_transform(scaled_X_train[cols[:3]])
pc_frame = pd.DataFrame({'Features': cols[:3], 'PC1': pca.components_[0], 'PC2': pca.components_[1], 'PC3': pca.components_[2]})
pc_frame

# Plotting 2 principal compnonents for the numeric variables
plt.figure(figsize=(12, 6))
sns.scatterplot(data=pc_frame, x='PC1', y='PC2')
for i, text in enumerate(pc_frame.Features):
    plt.annotate(text, (pc_frame.PC1[i], pc_frame.PC2[i]))

plt.show()

# No correlation exists between the PCs
decomp_df = pd.DataFrame(decomp_df, columns=['PC1', 'PC2', 'PC3'])
sns.heatmap(decomp_df.corr())
plt.show()

# Pairplot Of all 3 PCs
sns.pairplot(data=pd.concat([decomp_df, y_train.reset_index(drop=True)], axis=1),
             vars=['PC1', 'PC2', 'PC3'], hue='converted', kind='reg', height=3,
             plot_kws={'scatter_kws': {'alpha': 0.1}})

plt.show()

# We'll use all 3 principal components
scaled_X_train = pd.concat([scaled_X_train.reset_index(drop=True), decomp_df], axis=1)

# applying to test set
decomp_test = pca.transform(scaled_X_test[cols[:3]])
decomp_test = pd.DataFrame(decomp_test, columns=['PC1', 'PC2', 'PC3'])
scaled_X_test = pd.concat([scaled_X_test.reset_index(drop=True), decomp_test], axis=1)

Recurssive Feature Elimination and Cross Validation¶

By iterating through number of features we will see which model gives highest result.

def optimal_features(min, max):
    
    opt = list()
    for features in range(min, max):
        log_reg = LogisticRegression(C=2, random_state=42)
        rfe = RFE(log_reg, features)
        rfe.fit(scaled_X_train, y_train)
        cols = scaled_X_train.columns[rfe.support_]

        # Cross Validation
        scores = cross_validate(log_reg, scaled_X_train[cols], y_train, return_train_score=True, cv=5, scoring=['accuracy'])
        opt.append((features, scores['test_accuracy'].mean()))
        
    opt = np.array(opt) 
    return opt, opt[opt[:, 1].argmax()]
    

feat_array, opt_features = optimal_features(10, 50)

Here we'll plot how accuracy changes with number of features considered.

plt.figure(figsize=(12, 6))
plt.plot(feat_array[:, 0], feat_array[:, 1])
plt.xlabel("Number of Features")
plt.ylabel("Cross Validated Mean Accuracy")
plt.show()

print(f"Optimal number of features to use is {opt_features[0]} which gives {opt_features[1]} accuracy.")

Optimal number of features to use is 18.0 which gives 0.8591471127691239 accuracy.

# RFE
log_reg = LogisticRegression(C=2, random_state=42)
rfe = RFE(log_reg, int(opt_features[0]))
rfe.fit(scaled_X_train, y_train)
cols = scaled_X_train.columns[rfe.support_]
print(f"The columns we'll be using are:\n\n{cols}")

The columns we'll be using are:

Index(['totalvisits', 'total_time_spent', 'lead_source_reference',
       'lead_source_welingak_website', 'do_not_email_yes', 'country_unknown',
       'search_yes', 'lead_quality_might_be', 'lead_quality_not_sure',
       'lead_quality_worst', 'asymmetrique_activity_index_low',
       'last_notable_activity_had_a_phone_conversation',
       'last_notable_activity_olark_chat_conversation',
       'last_notable_activity_resubscribed_to_emails',
       'last_notable_activity_sms_sent', 'last_notable_activity_unreachable',
       'last_notable_activity_unsubscribed', 'PC2'],
      dtype='object')

VIF¶

# Manual Elimination by VIF analysis
def get_vif():
    vif = pd.DataFrame()
    vif['Features'] = scaled_X_train[cols].columns
    vif['VIF'] = [variance_inflation_factor(scaled_X_train[cols].values, i) for i in range(scaled_X_train[cols].shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return vif
    
get_vif()

The VIF of total_time_spent is too high. We will eliminate the feature and check the VIF again.

cols = cols.drop('total_time_spent')
get_vif().head()

The factors for the columns are now in check, we can proceed further with our analysis.

Assesing the model with statsmodel¶

to_drop = ['last_notable_activity_resubscribed_to_emails', 'lead_source_welingak_website']
X_train_sm = sm.add_constant(scaled_X_train[cols].drop(to_drop, axis=1))
logm2 = sm.GLM(list(y_train), X_train_sm, family=sm.families.Binomial())
res = logm2.fit()
res.summary()

Fitting the model¶

# Cross Validation
scores = cross_validate(log_reg, scaled_X_train[cols], y_train, return_train_score=True, cv=5, scoring=['accuracy'])
print(f"Cross validated mean accuracy: {round(scores['test_accuracy'].mean(), 3)}")

log_reg.fit(scaled_X_train[cols], y_train)
pred = log_reg.predict(scaled_X_train[cols])
prob_est = log_reg.predict_proba(scaled_X_train[cols])

Cross validated mean accuracy: 0.857

Measuring Model Performance¶

The data available at hand has class imbalance and therefore accuracy is not a good enough metric to measure if model is good enough.

Sensitivity (Recall) tells us what percentage of leads that were converted, were correctly identified as converted.

Specificity tells is what percentage of leads that were NOT converted were correctly identified.

Precision is, given a positive test result, the sample is positive.

If correctly identifying positives is important for us, then we should choose a model with higher Sensitivity. However, if correctly identifying negatives is more important, then we should choose specificity as the measurement metric.

F1 score is the weighted average of the precision and recall, and is a good metric to hold the model against.

def draw_roc(actual_values, probability_estimates):
    fpr, tpr, thresholds = roc_curve(actual_values, probability_estimates, drop_intermediate=False)
    auc_score = roc_auc_score(actual_values, probability_estimates)
    
    plt.figure(figsize=(5, 5))
    plt.plot(fpr, tpr, label=f"ROC curve (area = {round(auc_score, 2)})")
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

def draw_prec_recall(actual_values, probability_estimates):
    
    p, r, thresholds = precision_recall_curve(actual_values, probability_estimates)
    plt.plot(thresholds, p[:-1], "b-", label="Precision")
    plt.plot(thresholds, r[:-1], "r-", label="Recall")
    plt.title("Precison - Recall Trade off")
    plt.legend(loc="lower right")
    plt.show()

def get_metrics(y, pred, prob_est):
    # Confusion Matrix
    tn, fp, fn, tp = confusion_matrix(y, pred).ravel()
    precision = precision_score(y, pred)
    recall = recall_score(y, pred)
    f = f1_score(y, pred)

    # Sensitivity, Specificity
    print(f"Sensitivity (Recall): {recall}\nSpecificity: {tn/(tn+fp)}\nPrecision: {precision}\nF-Score: {f}")

    # Reciever Operating Characteristic Curve
    draw_roc(y, prob_est[:, 1])

    # Precision Recall Curve
    draw_prec_recall(y, prob_est[:, 1])
    
get_metrics(y_train, pred, prob_est)

Sensitivity (Recall): 0.7707608155146694
Specificity: 0.9067164179104478
Precision: 0.8266666666666667
F-Score: 0.7977354606278951

Finding Optimal Cut-Off¶

# Making y_train_pred_final
y_train_pred_final = pd.DataFrame({'Converted':y_train, 'probability': prob_est[:, 1]})
numbers = np.round(np.linspace(0,1,40,endpoint=False), 2)
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.probability.map(lambda x: 1 if x > i else 0)

# Making cutoffs
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
num = np.round(np.linspace(0,1,40,endpoint=False), 2)
for i in num:
    cm1 = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]

cutoff_df['var'] = np.var([cutoff_df.accuracy, cutoff_df.sensi, cutoff_df.speci], axis=0)

# plotting accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

cutoff_value = cutoff_df.sort_values('var').head(1)['prob'].values[0]
print(f"Optimum cut-off value is: {cutoff_value}")

Optimum cut-off value is: 0.32

# new predicted values based on cut-off
pred = (log_reg.predict_proba(scaled_X_train[cols])[:, 1] >= cutoff_value).astype(int)
print(f"Accuracy: {accuracy_score(y_train, pred)}")
get_metrics(y_train, pred, prob_est)

Accuracy: 0.8453139217470428
Sensitivity (Recall): 0.8518150174042765
Specificity: 0.8415614236509759
Precision: 0.7562913907284768
F-Score: 0.8012160898035549

Measuring Performance on Test Set¶

# Scoring against Test Set
log_reg.fit(scaled_X_train[cols], y_train)
pred = (log_reg.predict_proba(scaled_X_test[cols])[:, 1] >= cutoff_value).astype(int)
prob_est = log_reg.predict_proba(scaled_X_test[cols])
print(f"accuracy: {accuracy_score(y_test, pred)}")
get_metrics(y_test, pred, prob_est)

accuracy: 0.8507572958995198
Sensitivity (Recall): 0.8525252525252526
Specificity: 0.8497379149679674
Precision: 0.7658802177858439
F-Score: 0.8068833652007649

Finally, we have an overall accuracy of about 0.85 on our Logistic Regression model. That is, there is 85% chance that our predicted leads will be converted. This meets the CEO's target of atleast 80% lead conversion.

Lead Scoring¶

We will perform lead-scoring on the test set.

scores = pd.DataFrame({'lead_score':(prob_est[:, 1]*100).astype('int'), 'predicted_outcome': pred, 'actual_outcome': y_test}).reset_index(drop=True)
scores.head()

	Lead Number	Converted	TotalVisits	Total Time Spent on Website	Page Views Per Visit	Asymmetrique Activity Score	Asymmetrique Profile Score
count	9240.000000	9240.000000	9103.000000	9240.000000	9103.000000	5022.000000	5022.000000
mean	617188.435606	0.385390	3.445238	487.698268	2.362820	14.306252	16.344883
std	23405.995698	0.486714	4.854853	548.021466	2.161418	1.386694	1.811395
min	579533.000000	0.000000	0.000000	0.000000	0.000000	7.000000	11.000000
25%	596484.500000	0.000000	1.000000	12.000000	1.000000	14.000000	15.000000
50%	615479.000000	0.000000	3.000000	248.000000	2.000000	14.000000	16.000000
75%	637387.250000	1.000000	5.000000	936.000000	3.000000	15.000000	18.000000
max	660737.000000	1.000000	251.000000	2272.000000	55.000000	18.000000	20.000000

	Features	PC1	PC2	PC3
0	totalvisits	0.512978	-0.369512	0.774800
1	total_time_spent	0.597301	0.801911	-0.013018
2	page_views_per	0.616510	-0.469467	-0.632073

Dep. Variable:	y	No. Observations:	5495
Model:	GLM	Df Residuals:	5479
Model Family:	Binomial	Df Model:	15
Link Function:	logit	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-2005.8
Date:	Sun, 09 Jun 2019	Deviance:	4011.6
Time:	14:43:12	Pearson chi2:	5.52e+03
No. Iterations:	7

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	0.5615	0.154	3.647	0.000	0.260	0.863
totalvisits	3.0614	0.243	12.579	0.000	2.584	3.538
lead_source_reference	1.7742	0.273	6.509	0.000	1.240	2.308
do_not_email_yes	-1.2528	0.209	-5.993	0.000	-1.662	-0.843
country_unknown	1.3248	0.130	10.171	0.000	1.069	1.580
search_yes	-2.1603	1.321	-1.635	0.102	-4.749	0.429
lead_quality_might_be	-1.5194	0.166	-9.180	0.000	-1.844	-1.195
lead_quality_not_sure	-3.5546	0.152	-23.415	0.000	-3.852	-3.257
lead_quality_worst	-5.6548	0.414	-13.655	0.000	-6.466	-4.843
asymmetrique_activity_index_low	-1.8210	0.297	-6.122	0.000	-2.404	-1.238
last_notable_activity_had_a_phone_conversation	2.4401	1.251	1.950	0.051	-0.012	4.892
last_notable_activity_olark_chat_conversation	-1.0916	0.363	-3.009	0.003	-1.803	-0.380
last_notable_activity_sms_sent	1.8976	0.092	20.558	0.000	1.717	2.078
last_notable_activity_unreachable	2.1003	0.642	3.272	0.001	0.842	3.359
last_notable_activity_unsubscribed	1.1198	0.611	1.834	0.067	-0.077	2.316
PC2	4.2052	0.192	21.892	0.000	3.829	4.582

	Prospect ID	Lead Number	Lead Origin	Lead Source	Do Not Email	Do Not Call	Converted	TotalVisits	Total Time Spent on Website	Page Views Per Visit	Last Activity	Country	Specialization	How did you hear about X Education	What is your current occupation	What matters most to you in choosing a course	Search	Magazine	Newspaper Article	X Education Forums	Newspaper	Digital Advertisement	Through Recommendations	Receive More Updates About Our Courses	Tags	Lead Quality	Update me on Supply Chain Content	Get updates on DM Content	Lead Profile	City	Asymmetrique Activity Index	Asymmetrique Profile Index	Asymmetrique Activity Score	Asymmetrique Profile Score	I agree to pay the amount through cheque	A free copy of Mastering The Interview	Last Notable Activity
0	7927b2df-8bba-4d29-b9a2-b6e0beafe620	660737	API	Olark Chat	No	No	0	0.0	0	0.0	Page Visited on Website	NaN	Select	Select	Unemployed	Better Career Prospects	No	No	No	No	No	No	No	No	Interested in other courses	Low in Relevance	No	No	Select	Select	02.Medium	02.Medium	15.0	15.0	No	No	Modified
1	2a272436-5132-4136-86fa-dcc88c88f482	660728	API	Organic Search	No	No	0	5.0	674	2.5	Email Opened	India	Select	Select	Unemployed	Better Career Prospects	No	No	No	No	No	No	No	No	Ringing	NaN	No	No	Select	Select	02.Medium	02.Medium	15.0	15.0	No	No	Email Opened
2	8cc8c611-a219-4f35-ad23-fdfd2656bd8a	660727	Landing Page Submission	Direct Traffic	No	No	1	2.0	1532	2.0	Email Opened	India	Business Administration	Select	Student	Better Career Prospects	No	No	No	No	No	No	No	No	Will revert after reading the email	Might be	No	No	Potential Lead	Mumbai	02.Medium	01.High	14.0	20.0	No	Yes	Email Opened
3	0cc2df48-7cf4-4e39-9de9-19797f9b38cc	660719	Landing Page Submission	Direct Traffic	No	No	0	1.0	305	1.0	Unreachable	India	Media and Advertising	Word Of Mouth	Unemployed	Better Career Prospects	No	No	No	No	No	No	No	No	Ringing	Not Sure	No	No	Select	Mumbai	02.Medium	01.High	13.0	17.0	No	No	Modified
4	3256f628-e534-4826-9d63-4a8b88782852	660681	Landing Page Submission	Google	No	No	1	2.0	1428	1.0	Converted to Lead	India	Select	Other	Unemployed	Better Career Prospects	No	No	No	No	No	No	No	No	Will revert after reading the email	Might be	No	No	Select	Mumbai	02.Medium	01.High	15.0	18.0	No	No	Modified

	Features	VIF
1	total_time_spent	19.12
0	totalvisits	14.11
17	PC2	10.39
8	lead_quality_not_sure	3.72
5	country_unknown	2.51
7	lead_quality_might_be	1.94
14	last_notable_activity_sms_sent	1.52
9	lead_quality_worst	1.23
4	do_not_email_yes	1.19
2	lead_source_reference	1.18
16	last_notable_activity_unsubscribed	1.09
3	lead_source_welingak_website	1.07
10	asymmetrique_activity_index_low	1.06
12	last_notable_activity_olark_chat_conversation	1.06
11	last_notable_activity_had_a_phone_conversation	1.01
15	last_notable_activity_unreachable	1.01
6	search_yes	1.00
13	last_notable_activity_resubscribed_to_emails	1.00

	Features	VIF
7	lead_quality_not_sure	3.51
0	totalvisits	3.24
4	country_unknown	2.15
6	lead_quality_might_be	1.85
13	last_notable_activity_sms_sent	1.50

	actual_outcome	lead_score	predicted_outcome
0	0	15	0
1	1	93	1
2	0	0	0
3	1	78	1
4	1	93	1