Performance Assessment | D207 Exploratory Data Analysis

Ryan L. Buchanan
Student ID: 001826691
Masters Data Analytics (12/01/2020)
Program Mentor: Dan Estes
(385) 432-9281 (MST)
rbuch49@wgu.edu </span>

A1. Question for Analysis:¶

Which customers are at high risk of churn? And, which customer features/variables are most significant to churn?

A2. Benefit from Analysis:¶

Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.

A3. Data Identification:¶

Most relevant to our decision making process is the dependent variable of "Churn" which is binary categorical with only two values, "Yes" or "No". In cleaning the data, we discovered relevance of the continuous numerical data columns "Tenure" (the number of months the customer has stayed with the provider), "MonthlyCharge" (the average monthly charge to the customer) & "Bandwidth_GB_Year" (the average yearly amount of data used, in GB, per customer). Finally, the discrete numerical data from the survey responses from customers regarding various customer service features is relevant in the decision-making process. In the surveys, customers provided ordinal numerical data by rating 8 customer service factors ("timely response", "timely fixes", "timely replacements", "reliability", "options", "respectful response", "courteous exchange" & "evidence of active listening") on a scale of 1 to 8 (1 = most important, 8 = least important).

B1. Code:¶

Chi-square testing will be used.

Standard imports¶

# Standard data science imports
import numpy as np
import pandas as pd
from pandas import DataFrame

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Statistics packages
import pylab
import statsmodels.api as sm
import statistics
from scipy import stats

# Import chisquare from SciPy.stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

# Load data set into Pandas dataframe
df = pd.read_csv('Data/churn_clean.csv')

# Rename last 8 survey columns for better description of variables
df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'Fixes', 
                     'Item3':'Replacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'Respectfulness', 
                     'Item7':'Courteous', 
                     'Item8':'Listening'}, 
          inplace=True)

contingency = pd.crosstab(df['Churn'], df['TimelyResponse'])
contingency

contingency_pct = pd.crosstab(df['Churn'], df['TimelyResponse'], normalize='index')
contingency_pct

plt.figure(figsize=(12,8))
sns.heatmap(contingency, annot=True, cmap="YlGnBu")

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

<matplotlib.axes._subplots.AxesSubplot at 0x113ef590ec8>

B2. Output:¶

# Chi-square test of independence
c, p, dof, expected = chi2_contingency(contingency)
print('p-value = ' + str(p))

p-value = 0.6318335816054494

B3. Justification:¶

In this analysis, we are looking at churn from a telecom company ("Did customers stay with or leave the company?"). "Churn" is a binomial, categorical dependent variable. Therefore, we will use chi-square testing as it is a non-parametric test for this "yes/no" target variable. Our other categorical variable, "TimelyResponse", is at the ordinal level.

C. Univariate Statistics:¶

Two continuous variables:

1. MonthlyCharge
2. Bandwidth_GB_Year

Two categorical (ordinal) variables:

1. Item1 (Timely response) - relabeled "TimelyResponse"
2. Item7 (Courteous exchange) - relabeled "Courteous"

df.describe()

C1. Visual of Findings:¶

# Create histograms of contiuous & categorical variables
df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']].hist()
plt.savefig('churn_pyplot.jpg')
plt.tight_layout()

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = df)
plt.show()

sns.boxplot('Bandwidth_GB_Year', data = df)
plt.show()

sns.boxplot('TimelyResponse', data = df)
plt.show()

sns.boxplot('Courteous', data = df)
plt.show()

D. Bivariate Statistics¶

Two continuous variables:

1. MonthlyCharge
2. Bandwidth_GB_Year

Two categorical (binomial & ordinal, respectively) variables:

1. Churn
2. Item7 (Courteous exchange) - relabeled "Courteous"

D1. Visual of Findings:¶

# Create dataframe for heatmap bivariate analysis of correlation
churn_bivariate = df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']]

sns.heatmap(churn_bivariate.corr(), annot=True)
plt.show()

# Create a scatter plot of continuous variables MonthlyCharge & Bandwidth_GB_Year
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].sample(100).plot.scatter(x='MonthlyCharge', 
                                                                                 y='Bandwidth_GB_Year')

# Create a scatter plot of categorical variables TimelyResponse & Courteous
churn_bivariate[churn_bivariate['TimelyResponse'] < 7].sample(100).plot.scatter(x='TimelyResponse', 
                                                                                 y='Courteous')

<matplotlib.axes._subplots.AxesSubplot at 0x113f0677d08>

churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].plot.hexbin(x='MonthlyCharge', y='Bandwidth_GB_Year', gridsize=15)

<matplotlib.axes._subplots.AxesSubplot at 0x113f065ccc8>

E1. Results of Analysis¶

With a p-value as large as our output from our chi-square significance testing, p-value = 0.6318335816054494, we cannot reject the null hypothesis at a standard significance level of alpha = 0.05. It is unclear given the cleaned data available whether there is a statistically significant relationship between the survey responses (essentially, "How well did we the telecom company take care of you as a customer?") & whether or not this caused customers to leave the company.

E2. Limitations of Analysis:¶

Clearly, with a p-value that is so high, p-value = 0.6318335816054494, we need to investigate further & perhaps gather more & better data. It is troubling that this dataset has been so limited in our ability to gather meaningful & actionable information.

E3. Recommended Course of Action:¶

While tests show very little correlation & perhaps no linear relations between the variables involved in timely action with regard to customer satisfaction (TimelyResponses, Fixes, Replacements & Respectfulness), we believe that these elements should be given greater emphasis and hopefully help reduce the churn rate from the large number of 27% & "increase the retention period of customers" by targeting more resources in the direction prompt customer service (Ahmad, 2019, p. 1). Again, this seems an intuitive result but now decision-makers in the company of reasonable verification of what might have been a "hunch".

F. Video¶

https://wgu.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=%2237a1d719-eece-4cea-949f-ac7201896b42%22

G. Sources for Third-Party Code¶

Kaggle. (2018, May 01). Bivariate plotting with pandas. Kaggle. https://www.kaggle.com/residentmario/bivariate-plotting-with-pandas#

Sree. (2020, October 26). Predict Customer Churn in Python. Towards Data Science. https://towardsdatascience.com/predict-customer-churn-in-python-e8cd6d3aaa7

Wikipedia. (2021, May 31). Bivariate Analysis. https://en.wikipedia.org/wiki/Bivariate_analysis#:~:text=Bivariate%20analysis%20is%20one%20of,the%20empirical%20relationship%20between%20them.&text=Like%20univariate%20analysis%2C%20bivariate%20analysis%20can%20be%20descriptive%20or%20inferential.

H. Sources¶

Ahmad, A. K., Jafar, A & Aljoumaa, K. (2019, March 20). Customer churn prediction in telecom using machine learning in big data platform. Journal of Big Data. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6

Altexsoft. (2019, March 27). Customer Churn Prediction Using Machine Learning: Main Approaches and Models. Altexsoft. https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/

Bruce, P., Bruce A. & Gedeck P. (2020). Practical Statistics for Data Scientists. O'Reilly.

Freedman, D. Pisani, R. & Purves, R. (2018). Statistics. W. W. Norton & Company, Inc.

Frohbose, F. (2020, November 24). Machine Learning Case Study: Telco Customer Churn Prediction. Towards Data Science. https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d

Griffiths, D. (2009). A Brain-Friendly Guide: Head First Statistics. O'Reilly.

NIH. (2020). National Library of Medicine. https://www.nlm.nih.gov/nichsr/stats_tutorial/section2/mod11_significance.html#:~:text=In%20statistical%20tests%2C%20statistical%20significance,set%20to%200.05%20(5%25).

P-Values. (2020). StatsDirect Limited. https://www.statsdirect.com/help/basics/p_values.htm

!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('D207_Performance_Assessment.ipynb')

TimelyResponse	1	2	3	4	5	6	7
Churn
No	158	1002	2562	2473	994	146	15
Yes	66	391	886	885	365	53	4

TimelyResponse	1	2	3	4	5	6	7
Churn
No	0.021497	0.136327	0.348571	0.336463	0.135238	0.019864	0.002041
Yes	0.024906	0.147547	0.334340	0.333962	0.137736	0.020000	0.001509

	CaseOrder	Zip	Lat	Lng	Population	Children	Age	Income	Outage_sec_perweek	Email	...	MonthlyCharge	Bandwidth_GB_Year	TimelyResponse	Fixes	Replacements	Reliability	Options	Respectfulness	Courteous	Listening
count	10000.00000	10000.000000	10000.000000	10000.000000	10000.000000	10000.0000	10000.000000	10000.000000	10000.000000	10000.000000	...	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000
mean	5000.50000	49153.319600	38.757567	-90.782536	9756.562400	2.0877	53.078400	39806.926771	10.001848	12.016000	...	172.624816	3392.341550	3.490800	3.505100	3.487000	3.497500	3.492900	3.497300	3.509500	3.495600
std	2886.89568	27532.196108	5.437389	15.156142	14432.698671	2.1472	20.698882	28199.916702	2.976019	3.025898	...	42.943094	2185.294852	1.037797	1.034641	1.027977	1.025816	1.024819	1.033586	1.028502	1.028633
min	1.00000	601.000000	17.966120	-171.688150	0.000000	0.0000	18.000000	348.670000	0.099747	1.000000	...	79.978860	155.506715	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
25%	2500.75000	26292.500000	35.341828	-97.082813	738.000000	0.0000	35.000000	19224.717500	8.018214	10.000000	...	139.979239	1236.470827	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000
50%	5000.50000	48869.500000	39.395800	-87.918800	2910.500000	1.0000	53.000000	33170.605000	10.018560	12.000000	...	167.484700	3279.536903	3.000000	4.000000	3.000000	3.000000	3.000000	3.000000	4.000000	3.000000
75%	7500.25000	71866.500000	42.106908	-80.088745	13168.000000	3.0000	71.000000	53246.170000	11.969485	14.000000	...	200.734725	5586.141369	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000
max	10000.00000	99929.000000	70.640660	-65.667850	111850.000000	10.0000	89.000000	258900.700000	21.207230	23.000000	...	290.160419	7158.981530	7.000000	7.000000	8.000000	7.000000	7.000000	8.000000	7.000000	8.000000