This is a step-by-step guide on how to implement a Chi-Square test for A/B testing in Python using the SciPy, NumPy and Pandas libraries.
Check out this post for an introduction to A/B testing, test statistic, significance level, statistical power and p-values.
If you are already familiar with two-sample Chi-square tests, feel free to jump to Section 3 where I explain how to implement such a test in Python.
Table of contents
1. Two-sample Chi-square test
In this post, we will use a chi-square statistic which is well suited for categorical data.
A two-sample chi-square test can be used to test whether two populations have the same proportions. The chi-square statistic is defined as follows [1]:
Where O stands for observed and E stands for expected. The sum is over groups and types of outcomes.
Once we calculated our chi-square statistic, we can then obtain a critical value which is the value of the chi-square statistic distribution (for the appropriate degrees of freedom) corresponding to the chosen significance level. If the observed chi-square statistic is higher than this critical value, then we can reject the null hypothesis. This is equivalent to checking the p-value, although the p-value also tells you the probability of obtaining such data if the null hypothesis is true.
Please note:
If the expected frequencies in the contingency table are very small, unreliable results can be obtained. For such cases, you can use Fisher's exact test.
The degrees of freedom represent the number of values in the final calculation of the chi-square statistic that are free to vary
A contingency table is used to summarize the data. It shows the number of successes and failures (for example, conversions and not conversions) for each group.
Example of a contingency table:
​ | ​Success | Failure |
A | 124 | 524 |
B | 145 | 503 |
Before continuing let's briefly introduce some concepts that will be needed to understand my implementation of a chi-square test in Python.
Binomial distribution:
The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent experiments. Each experiment has two mutually exclusive outcomes (success/failure or yes/no). Its parameters are the number of experiments and the probability of success for a single experiment. When sampling from a binomial distribution, we obtain the number of successes.
Normal distribution:
Normal (also called Gaussian) distributions are symmetrically distributed. They are shaped as a bell and are characterized by a mean (mu) and a standard deviation (sigma).
Cumulative distribution function:
The cumulative distribution function (CDF) of a random variable X evaluated at x is the probability that X will take a value less than or equal to x.
Z-scores:
A z-score measures the distance from the mean in terms of the standard deviation.
2. Introducing the case example: conversion rates
Let's imagine we have a hotel booking website and wish to study if a given change in our website can boost our conversion rates (at the final stage of the booking process). We decide then to make an A/B test to help us determine if we want to release such a change. For this example, let's set the significance level to 0.05 (alpha) and the statistical power (1-beta) to 0.8 (the statistical power will be used to define the minimum sample size in Section 3.1).
In this example, our null hypothesis states that there is no significant difference between the conversion rates with or without such a change in the website.
3. Implementing a Chi-square test in Python
Let's start by importing all the libraries and functions that we will need:
from scipy.stats import chi2, norm, chi2_contingency
import numpy as np
import pandas as pd
import math
3.1. Estimating the minimum sample size
Before running an A/B test, we need to estimate the minimum sample size required to observe a difference as large as our desired effect size (or larger) with the chosen significance level and statistical power. If the sample size of our data is below such a minimum sample size, even if we see a difference larger than the desired effect size, we might not be able reject the null hypothesis since the difference would not be statistically significant. For the example defined above, the minimum sample size corresponds to the number of users per group.
If n is the minimum sample size, and p1 and p2 are respectively the conversion rates for groups A and B (assuming that p2 is p1 plus the desired effect size, and p1 is obtained using historical data), then here is the equation [2] to calculate it:
Note: the above Z-score values are calculated using a Normal distribution with mean zero and standard deviation equal to 1.
Here is an example of how we can implement Equation 2 in Python:
def get_min_sample_size(
p1, # conversion rate for group A
des, # desired effect size
alpha = 0.05, # significance level
power = 0.8 # statistical power
):
"""
Estimate minimum sample size for chi-square test
Assumption: sigma_A = sigma_B (using pooled probability)
"""
# Find Z_beta
Z_beta = norm.ppf(power) # ppf = Percent Point Function (inverse of Cumulative Distribution Function)
# Find Z_alpha
Z_alpha = norm.ppf(1 - alpha / 2)
# Estimate minimum sample size
p2 = p1 + des
avgp = 0.5 * (p1 + p2) # pooled proportions
var = avgp * (1 - avgp) # variance
return math.ceil(2 * var * (Z_alpha + Z_beta)**2 / des**2)
Let's calculate the minimum sample size using the function defined above for p1 = 0.2, and using the values chosen for our example for alpha (0.05) and power (0.8), and let's set the desired effect size to 0.006:
min_sample_size = get_min_sample_size(
p1 = 0.2,
des = 0.006,
alpha = 0.05,
power = 0.8
)
The above will give us 70549, that is the needed number of users per group.
3.2. Generating simulated data
Let's create a function to generate data for group A and B, each with the same sample size. First, we will set a seed to get reproducible results (so we all get the same results). Then, we will generate data using a binomial distribution. Next, we will collect the data into a DataFrame and finally we'll get a Contingency Table out of the DataFrame.
Here is the function that does all the above:
def generate_data(
sample_size,
conversion_rate_A, # conversion rate for group A
conversion_rate_B # conversion rate for group B
):
"""Generate fake data to perform a two-sample chi-square test"""
# Set a random seed for reproducibility
np.random.seed(42)
# Generate data for group A and B
group_A_converted = np.random.binomial(1, conversion_rate_A, sample_size)
group_A_not_converted = 1 - group_A_converted
group_B_converted = np.random.binomial(1, conversion_rate_B, sample_size)
group_B_not_converted = 1 - group_B_converted
# Create a DataFrame to store the data
data = pd.DataFrame({
'Group': ['A'] * sample_size + ['B'] * sample_size,
'Converted': np.concatenate([group_A_converted, group_B_converted]),
'Not Converted': np.concatenate([group_A_not_converted, group_B_not_converted])
})
# Create a contingency table
contingency_table = pd.crosstab(data['Group'], data['Converted'])
return contingency_table
Let's now use that function to generate data incompatible with the null hypothesis, i.e. let's generate two samples each with a different conversion rate. I will choose a large-enough difference that will allow us to see it in our Chi-square test (i.e. difference > minimum desired effect).
data = generate_data(
sample_size = min_sample_size,
conversion_rate_A = 0.2,
conversion_rate_B = 0.21
)
Let's now take a look at the generated data:
Converted 0 1
Group
A 56503 14046
B 55692 14857
3.3. Running a Chi-square test
3.3.1. Rejecting the null hypothesis
We will use the chi2_contingency() function from SciPy to retrieve the chi-square statistic and the p-value:
stat, pvalue, _, _ = chi2_contingency(data, correction = False)
Note:
The correction argument is set to False since none of the expected counts is smaller than 5. For more information see scipy.stats.chi2_contingency.
The last two returned values that we are not using are the degrees of freedom and expected frequencies.
The above will return a p-value of 8.813828436841988e-08, well below the cut off of 0.05, meaning we can reject the null hypothesis. This means there is a difference (that is statistically significant) between the two groups of users.
If this would be a real-life experiment, this points out that it might be a good idea to roll out the A -> B change to all users (if the difference is in a positive direction). Said that, we might want to further support this change by additional studies. For example, by running the experiment a second time or by looking at the daily conversion rates. Since the conversion rate in a group on a certain day represents a single data point, the sample size would be the number of days in this case. Here the data is continuous, hence a chi-square test is no longer suitable and a t-test should be used instead. I will shortly make a post showing how to implement a t-test in Python, so stay tuned!
Let's convince ourselves that we are doing things correctly and calculate the chi-square statistic by hand following Equation 1. Here is how we can implement it in Python:
dof = 1 # degrees of freedom (valid for 2x2 contingency tables)
expected_proportion = (data[1]["A"] + data[1]["B"]) / (data[1]["A"] + data[1]["B"] + data[0]["A"] + data[0]["B"])
expected_1 = expected_proportion * min_sample_size
expected_0 = min_sample_size - expected_1
obs = np.asarray(data) # observed values
exp = np.array(
[[expected_0, expected_1],
[expected_0, expected_1]]
) # expected values
terms = (obs - exp) ** 2 / exp
my_stat = terms.sum(axis = None)
print(f'chi2-statistic calculated by hand = {round(my_stat, 2)}')
Note:
The expected values are calculated assuming the null hypothesis is true, then there would be no significant difference between groups A and B. Under this assumption, we can use the observed values to calculate the expected ratio of conversions by summing conversions and non-conversions from both groups.
The above code gives the following:
chi2-statistic calculated by hand = 28.62
which agrees with the chi-square statistic (stat) retrieved with the chi2_contingency() function.
Let's now calculate the critical chi-square statistic value using the percent point function which is the inverse of the cumulative distribution function. With this, we obtain the value of the chi-square statistic distribution (for the given degrees of freedom) corresponding to the chosen value of alpha. In other words, this critical value ensures that 1 minus the cumulative distribution function evaluated at the critical value is equal to alpha (0.05). Here is how we can calculate the critical value in Python:
critical_chi2_stat = round(chi2.ppf(1 - alpha, dof), 2) # ppf = percent point function (inverse of the cumulative distribution function)
print(f'{critical_chi2_stat = }')
Which prints the following:
critical_chi2_stat = 3.84
Since the chi-square statistic (28.62) is higher than the critical chi-square statistic (3.84), we can then reject the null hypothesis. Furthermore, we can also calculate the p-value by hand and validate the value we obtained before:
my_pvalue = chi2.sf(my_stat, dof) # 1 - cdf, where cdf = cumulative distribution function = P(X <= x) = probability that X will have a value <= x
print(f'p-value calculated by hand = {my_pvalue}')
Which prints the following (which agrees with the value obtained above):
p-value calculated by hand = 8.813828436841988e-08
3.3.2. Failing to reject the null hypothesis
Let's generate new data that presents a difference below the desired effect size and re-run the chi-square test.
data = generate_data(
sample_size = min_sample_size,
conversion_rate_A = 0.2,
conversion_rate_B = 0.202
)
Here is how the new generated data looks like:
Converted 0 1
Group
A 56503 14046
B 56235 14314
Let's now run the chi-square test:
stat, pvalue, _, _ = chi2_contingency(data, correction = False)
if pvalue < alpha:
print(f"Decision: There is a significant difference between the groups (p-value = {round(pvalue, 3)}, chi2-statistic = {round(stat, 2)}).")
else:
print(f"Decision: There is no significant difference between the groups (p-value = {round(pvalue, 3)}, chi2-statistic = {round(stat, 2)}).")
This is the result:
Decision: There is no significant difference between the groups (p-value = 0.075, chi2-statistic = 3.17).
Since the obtained p-value is larger than 0.05 (i.e. our choice for alpha), then we fail to reject the null hypothesis.
3.3.3. Repository with full code
The full code can be found in the following repository: https://github.com/jbossios/two-sample-chi-square-test-in-python
Do you wish to learn how to implement a t-test in Python? Check out this post.
Do you wish to learn all the technical skills needed to perform a data analysis in Python? Check out my free Python course for data analysis: https://github.com/jbossios/python-tutorial
References
[1] Introduction to the Practice of Statistics (Sixth Edition) by Moore, McCabe and Craig
[2] Fundamentals of Biostatistics (Seventh Edition) by Bernard Rosner
Comments