Ryan L. Buchanan
Student ID: 001826691
Masters Data Analytics (12/01/2020)
Program Mentor: Dan Estes
(385) 432-9281 (MST)
rbuch49@wgu.edu
</span>
Which customers are at high risk of churn? And, which customer features/variables are most significant to churn?
Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.
Most relevant to our decision making process is the dependent variable of "Churn" which is binary categorical with only two values, "Yes" or "No". In cleaning the data, we discovered relevance of the continuous numerical data columns "Tenure" (the number of months the customer has stayed with the provider), "MonthlyCharge" (the average monthly charge to the customer) & "Bandwidth_GB_Year" (the average yearly amount of data used, in GB, per customer). Finally, the discrete numerical data from the survey responses from customers regarding various customer service features is relevant in the decision-making process. In the surveys, customers provided ordinal numerical data by rating 8 customer service factors ("timely response", "timely fixes", "timely replacements", "reliability", "options", "respectful response", "courteous exchange" & "evidence of active listening") on a scale of 1 to 8 (1 = most important, 8 = least important).
Chi-square testing will be used.
# Standard data science imports
import numpy as np
import pandas as pd
from pandas import DataFrame
# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Statistics packages
import pylab
import statsmodels.api as sm
import statistics
from scipy import stats
# Import chisquare from SciPy.stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
# Load data set into Pandas dataframe
df = pd.read_csv('Data/churn_clean.csv')
# Rename last 8 survey columns for better description of variables
df.rename(columns = {'Item1':'TimelyResponse',
'Item2':'Fixes',
'Item3':'Replacements',
'Item4':'Reliability',
'Item5':'Options',
'Item6':'Respectfulness',
'Item7':'Courteous',
'Item8':'Listening'},
inplace=True)
contingency = pd.crosstab(df['Churn'], df['TimelyResponse'])
contingency
contingency_pct = pd.crosstab(df['Churn'], df['TimelyResponse'], normalize='index')
contingency_pct
plt.figure(figsize=(12,8))
sns.heatmap(contingency, annot=True, cmap="YlGnBu")
# Chi-square test of independence
c, p, dof, expected = chi2_contingency(contingency)
print('p-value = ' + str(p))
In this analysis, we are looking at churn from a telecom company ("Did customers stay with or leave the company?"). "Churn" is a binomial, categorical dependent variable. Therefore, we will use chi-square testing as it is a non-parametric test for this "yes/no" target variable. Our other categorical variable, "TimelyResponse", is at the ordinal level.
Two continuous variables:
1. MonthlyCharge
2. Bandwidth_GB_Year
Two categorical (ordinal) variables:
1. Item1 (Timely response) - relabeled "TimelyResponse"
2. Item7 (Courteous exchange) - relabeled "Courteous"
df.describe()
# Create histograms of contiuous & categorical variables
df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']].hist()
plt.savefig('churn_pyplot.jpg')
plt.tight_layout()
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = df)
plt.show()
sns.boxplot('Bandwidth_GB_Year', data = df)
plt.show()
sns.boxplot('TimelyResponse', data = df)
plt.show()
sns.boxplot('Courteous', data = df)
plt.show()
Two continuous variables:
1. MonthlyCharge
2. Bandwidth_GB_Year
Two categorical (binomial & ordinal, respectively) variables:
1. Churn
2. Item7 (Courteous exchange) - relabeled "Courteous"
# Create dataframe for heatmap bivariate analysis of correlation
churn_bivariate = df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']]
sns.heatmap(churn_bivariate.corr(), annot=True)
plt.show()
# Create a scatter plot of continuous variables MonthlyCharge & Bandwidth_GB_Year
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].sample(100).plot.scatter(x='MonthlyCharge',
y='Bandwidth_GB_Year')
# Create a scatter plot of categorical variables TimelyResponse & Courteous
churn_bivariate[churn_bivariate['TimelyResponse'] < 7].sample(100).plot.scatter(x='TimelyResponse',
y='Courteous')
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].plot.hexbin(x='MonthlyCharge', y='Bandwidth_GB_Year', gridsize=15)
With a p-value as large as our output from our chi-square significance testing, p-value = 0.6318335816054494, we cannot reject the null hypothesis at a standard significance level of alpha = 0.05. It is unclear given the cleaned data available whether there is a statistically significant relationship between the survey responses (essentially, "How well did we the telecom company take care of you as a customer?") & whether or not this caused customers to leave the company.
Clearly, with a p-value that is so high, p-value = 0.6318335816054494, we need to investigate further & perhaps gather more & better data. It is troubling that this dataset has been so limited in our ability to gather meaningful & actionable information.
While tests show very little correlation & perhaps no linear relations between the variables involved in timely action with regard to customer satisfaction (TimelyResponses, Fixes, Replacements & Respectfulness), we believe that these elements should be given greater emphasis and hopefully help reduce the churn rate from the large number of 27% & "increase the retention period of customers" by targeting more resources in the direction prompt customer service (Ahmad, 2019, p. 1). Again, this seems an intuitive result but now decision-makers in the company of reasonable verification of what might have been a "hunch".
https://wgu.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=%2237a1d719-eece-4cea-949f-ac7201896b42%22
Kaggle. (2018, May 01). Bivariate plotting with pandas. Kaggle. https://www.kaggle.com/residentmario/bivariate-plotting-with-pandas#
Sree. (2020, October 26). Predict Customer Churn in Python. Towards Data Science. https://towardsdatascience.com/predict-customer-churn-in-python-e8cd6d3aaa7
Wikipedia. (2021, May 31). Bivariate Analysis. https://en.wikipedia.org/wiki/Bivariate_analysis#:~:text=Bivariate%20analysis%20is%20one%20of,the%20empirical%20relationship%20between%20them.&text=Like%20univariate%20analysis%2C%20bivariate%20analysis%20can%20be%20descriptive%20or%20inferential.
Ahmad, A. K., Jafar, A & Aljoumaa, K. (2019, March 20). Customer churn prediction in telecom using machine learning in big data platform. Journal of Big Data. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6
Altexsoft. (2019, March 27). Customer Churn Prediction Using Machine Learning: Main Approaches and Models. Altexsoft. https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/
Bruce, P., Bruce A. & Gedeck P. (2020). Practical Statistics for Data Scientists. O'Reilly.
Freedman, D. Pisani, R. & Purves, R. (2018). Statistics. W. W. Norton & Company, Inc.
Frohbose, F. (2020, November 24). Machine Learning Case Study: Telco Customer Churn Prediction. Towards Data Science. https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d
Griffiths, D. (2009). A Brain-Friendly Guide: Head First Statistics. O'Reilly.
NIH. (2020). National Library of Medicine. https://www.nlm.nih.gov/nichsr/stats_tutorial/section2/mod11_significance.html#:~:text=In%20statistical%20tests%2C%20statistical%20significance,set%20to%200.05%20(5%25).
P-Values. (2020). StatsDirect Limited. https://www.statsdirect.com/help/basics/p_values.htm
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('D207_Performance_Assessment.ipynb')