Performance Assessment | D207 Exploratory Data Analysis

 Ryan L. Buchanan
 Student ID: 001826691
 Masters Data Analytics (12/01/2020)
 Program Mentor: Dan Estes
 (385) 432-9281 (MST)
 rbuch49@wgu.edu </span>

A1. Question for Analysis:

Which customers are at high risk of churn? And, which customer features/variables are most significant to churn?

A2. Benefit from Analysis:

Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.

A3. Data Identification:

Most relevant to our decision making process is the dependent variable of "Churn" which is binary categorical with only two values, "Yes" or "No". In cleaning the data, we discovered relevance of the continuous numerical data columns "Tenure" (the number of months the customer has stayed with the provider), "MonthlyCharge" (the average monthly charge to the customer) & "Bandwidth_GB_Year" (the average yearly amount of data used, in GB, per customer). Finally, the discrete numerical data from the survey responses from customers regarding various customer service features is relevant in the decision-making process. In the surveys, customers provided ordinal numerical data by rating 8 customer service factors ("timely response", "timely fixes", "timely replacements", "reliability", "options", "respectful response", "courteous exchange" & "evidence of active listening") on a scale of 1 to 8 (1 = most important, 8 = least important).

B1. Code:

Chi-square testing will be used.

Standard imports

In [1]:
# Standard data science imports
import numpy as np
import pandas as pd
from pandas import DataFrame

# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Statistics packages
import pylab
import statsmodels.api as sm
import statistics
from scipy import stats

# Import chisquare from SciPy.stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
In [2]:
# Load data set into Pandas dataframe
df = pd.read_csv('Data/churn_clean.csv')
In [3]:
# Rename last 8 survey columns for better description of variables
df.rename(columns = {'Item1':'TimelyResponse', 
                    'Item2':'Fixes', 
                     'Item3':'Replacements', 
                     'Item4':'Reliability', 
                     'Item5':'Options', 
                     'Item6':'Respectfulness', 
                     'Item7':'Courteous', 
                     'Item8':'Listening'}, 
          inplace=True)
In [4]:
contingency = pd.crosstab(df['Churn'], df['TimelyResponse'])
contingency
Out[4]:
TimelyResponse 1 2 3 4 5 6 7
Churn
No 158 1002 2562 2473 994 146 15
Yes 66 391 886 885 365 53 4
In [5]:
contingency_pct = pd.crosstab(df['Churn'], df['TimelyResponse'], normalize='index')
contingency_pct
Out[5]:
TimelyResponse 1 2 3 4 5 6 7
Churn
No 0.021497 0.136327 0.348571 0.336463 0.135238 0.019864 0.002041
Yes 0.024906 0.147547 0.334340 0.333962 0.137736 0.020000 0.001509
In [6]:
plt.figure(figsize=(12,8))
sns.heatmap(contingency, annot=True, cmap="YlGnBu")
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x113ef590ec8>

B2. Output:

In [7]:
# Chi-square test of independence
c, p, dof, expected = chi2_contingency(contingency)
print('p-value = ' + str(p))
p-value = 0.6318335816054494

B3. Justification:

In this analysis, we are looking at churn from a telecom company ("Did customers stay with or leave the company?"). "Churn" is a binomial, categorical dependent variable. Therefore, we will use chi-square testing as it is a non-parametric test for this "yes/no" target variable. Our other categorical variable, "TimelyResponse", is at the ordinal level.

C. Univariate Statistics:

Two continuous variables:

1. MonthlyCharge
2. Bandwidth_GB_Year

Two categorical (ordinal) variables:

1. Item1 (Timely response) - relabeled "TimelyResponse"
2. Item7 (Courteous exchange) - relabeled "Courteous" 
In [8]:
df.describe()
Out[8]:
CaseOrder Zip Lat Lng Population Children Age Income Outage_sec_perweek Email ... MonthlyCharge Bandwidth_GB_Year TimelyResponse Fixes Replacements Reliability Options Respectfulness Courteous Listening
count 10000.00000 10000.000000 10000.000000 10000.000000 10000.000000 10000.0000 10000.000000 10000.000000 10000.000000 10000.000000 ... 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 5000.50000 49153.319600 38.757567 -90.782536 9756.562400 2.0877 53.078400 39806.926771 10.001848 12.016000 ... 172.624816 3392.341550 3.490800 3.505100 3.487000 3.497500 3.492900 3.497300 3.509500 3.495600
std 2886.89568 27532.196108 5.437389 15.156142 14432.698671 2.1472 20.698882 28199.916702 2.976019 3.025898 ... 42.943094 2185.294852 1.037797 1.034641 1.027977 1.025816 1.024819 1.033586 1.028502 1.028633
min 1.00000 601.000000 17.966120 -171.688150 0.000000 0.0000 18.000000 348.670000 0.099747 1.000000 ... 79.978860 155.506715 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 2500.75000 26292.500000 35.341828 -97.082813 738.000000 0.0000 35.000000 19224.717500 8.018214 10.000000 ... 139.979239 1236.470827 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000
50% 5000.50000 48869.500000 39.395800 -87.918800 2910.500000 1.0000 53.000000 33170.605000 10.018560 12.000000 ... 167.484700 3279.536903 3.000000 4.000000 3.000000 3.000000 3.000000 3.000000 4.000000 3.000000
75% 7500.25000 71866.500000 42.106908 -80.088745 13168.000000 3.0000 71.000000 53246.170000 11.969485 14.000000 ... 200.734725 5586.141369 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
max 10000.00000 99929.000000 70.640660 -65.667850 111850.000000 10.0000 89.000000 258900.700000 21.207230 23.000000 ... 290.160419 7158.981530 7.000000 7.000000 8.000000 7.000000 7.000000 8.000000 7.000000 8.000000

8 rows × 23 columns

C1. Visual of Findings:

In [9]:
# Create histograms of contiuous & categorical variables
df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']].hist()
plt.savefig('churn_pyplot.jpg')
plt.tight_layout()
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
In [10]:
# Create Seaborn boxplots for continuous & categorical variables
sns.boxplot('MonthlyCharge', data = df)
plt.show()
In [11]:
sns.boxplot('Bandwidth_GB_Year', data = df)
plt.show()
In [12]:
sns.boxplot('TimelyResponse', data = df)
plt.show()
In [13]:
sns.boxplot('Courteous', data = df)
plt.show()

D. Bivariate Statistics

Two continuous variables:

1. MonthlyCharge
2. Bandwidth_GB_Year

Two categorical (binomial & ordinal, respectively) variables:

1. Churn
2. Item7 (Courteous exchange) - relabeled "Courteous" 

D1. Visual of Findings:

In [14]:
# Create dataframe for heatmap bivariate analysis of correlation
churn_bivariate = df[['MonthlyCharge', 'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']]
In [15]:
sns.heatmap(churn_bivariate.corr(), annot=True)
plt.show()
In [16]:
# Create a scatter plot of continuous variables MonthlyCharge & Bandwidth_GB_Year
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].sample(100).plot.scatter(x='MonthlyCharge', 
                                                                                 y='Bandwidth_GB_Year')

# Create a scatter plot of categorical variables TimelyResponse & Courteous
churn_bivariate[churn_bivariate['TimelyResponse'] < 7].sample(100).plot.scatter(x='TimelyResponse', 
                                                                                 y='Courteous')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x113f0677d08>
In [17]:
churn_bivariate[churn_bivariate['MonthlyCharge'] < 300].plot.hexbin(x='MonthlyCharge', y='Bandwidth_GB_Year', gridsize=15)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x113f065ccc8>

E1. Results of Analysis

With a p-value as large as our output from our chi-square significance testing, p-value = 0.6318335816054494, we cannot reject the null hypothesis at a standard significance level of alpha = 0.05. It is unclear given the cleaned data available whether there is a statistically significant relationship between the survey responses (essentially, "How well did we the telecom company take care of you as a customer?") & whether or not this caused customers to leave the company.

E2. Limitations of Analysis:

Clearly, with a p-value that is so high, p-value = 0.6318335816054494, we need to investigate further & perhaps gather more & better data. It is troubling that this dataset has been so limited in our ability to gather meaningful & actionable information.

While tests show very little correlation & perhaps no linear relations between the variables involved in timely action with regard to customer satisfaction (TimelyResponses, Fixes, Replacements & Respectfulness), we believe that these elements should be given greater emphasis and hopefully help reduce the churn rate from the large number of 27% & "increase the retention period of customers" by targeting more resources in the direction prompt customer service (Ahmad, 2019, p. 1). Again, this seems an intuitive result but now decision-makers in the company of reasonable verification of what might have been a "hunch".

F. Video

https://wgu.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=%2237a1d719-eece-4cea-949f-ac7201896b42%22

H. Sources

Ahmad, A. K., Jafar, A & Aljoumaa, K.   (2019, March 20).   Customer churn prediction in telecom using machine learning in big data platform.   Journal of Big Data. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6


Altexsoft.   (2019, March 27).   Customer Churn Prediction Using Machine Learning: Main Approaches and Models. Altexsoft. https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/


Bruce, P., Bruce A. & Gedeck P. (2020). Practical Statistics for Data Scientists. O'Reilly.


Freedman, D. Pisani, R. & Purves, R. (2018). Statistics. W. W. Norton & Company, Inc.


Frohbose, F.   (2020, November 24).   Machine Learning Case Study: Telco Customer Churn Prediction. Towards Data Science. https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d


Griffiths, D. (2009). A Brain-Friendly Guide: Head First Statistics. O'Reilly.


NIH. (2020). National Library of Medicine. https://www.nlm.nih.gov/nichsr/stats_tutorial/section2/mod11_significance.html#:~:text=In%20statistical%20tests%2C%20statistical%20significance,set%20to%200.05%20(5%25).


P-Values. (2020). StatsDirect Limited. https://www.statsdirect.com/help/basics/p_values.htm

In [ ]:
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('D207_Performance_Assessment.ipynb')