I’m neither an employee nor paid-promoter of Optimizely. The views expressed are my own.
Optimizely is popular testing platform that makes understanding the results of A/B tests easier for anyone who isn’t well-versed in statistical testing methodologies (and even those who are!). In the below work, we will intentionally leave out statistics theory and attempt to develop an intuitive sense of what is described in Optimizely’s “Always Valid Inference” and “Peeking at A/B Tests” papers.
One of the problems plaguing the analysis of A/B tests is what Optimizely describes as “peeking”. To better understand what “peeking” is, it helps to first understand how to properly run a test. We will focus on the case of testing whether there is a difference between the conversion rates cr_a
and cr_b
for groups A and B. We define conversion rate as the total number of conversions in a group divided by the total number of subjects.
In order to correctly run a test, one should first calculate the required sample size by doing a power calculation. This is easily done in R using the pwr
library and requires a few parameters: the desired significance level (the false positive rate), the desired statistical power (1-false negative rate), the minimum detectable effect, and the baseline conversion rate cr_a
.
library(pwr)
mde <- 0.1
cr_a <- 0.25
alpha <- 0.05
power <- 0.80
ptpt <- pwr.2p.test(h = ES.h(p1 = cr_a, p2 = (1+mde)*cr_a),
sig.level = alpha,
power = power
)
n_obs <- ceiling(ptpt$n)
This result tells us is that we need to observe 4860 subjects in each of the A and B test groups. Once we have observed that quantity, we can calculate whether there is a statistically significant difference between the two sets of observations via a t-test.
Given the parameters we included in our power calculation, there are two things to be aware of: - there is a 5% chance that our t-test will predict that there is a statistically significant difference when in fact there isn’t. That is a result of our alpha parameter, which sets a false-positive rate of 5%. - there is a 20% chance that the t-test will predict no difference when there actually was a difference, the false-negative rate (or 1-power).
Illustrating alpha and beta parameters
Let’s try to illustrate this through a simulation. We’ll simulate two sequences of conversions with conversion rates such that cr_b = (1+mde)*cr_a
. We’ll then run a t-test comparing all of the available observations of the two groups each time we have a new pair of observations. If the p-value of the t-test is below 5%, we will reject the null-hypothesis that there is no difference between the distributions; hence, p.value < 0.05
implies there is a statistically significant difference between the conversion rates. Finally we’ll plot 1-p.value
to represent “confidence” relative to where we are in the sequence. This essentially simulates what p-values we would calculate if we were to continually monitor conversions as subjects choose to convert or abandon. We add the 95% confidence line horizontally, as well as a vertical line at n_obs
, the number of observations our power calculation says to use to conduct a t-test.
library(ggplot2)
set.seed(2)
# make our "true" effect larger than the mde
effect <- mde
cr_b <- (1+effect)*cr_a
observations <- 2*n_obs
# a sequence of {0,1} conversions
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_b)
# Calculate p-values at each simultaneous observation of the a and b groups
tt <- sapply(10:observations, function(x){
t.test(conversions_a[1:x],conversions_b[1:x])$p.value
})
tt <- data.frame(p.value = unlist(tt))
# for plots
conf_95 <- data.frame( x = c(-Inf, Inf), y = 0.95 )
obs_limit_line <- data.frame( x = n_obs, y = c(-Inf, Inf) )
# plot the evolution of p-value over time, if "peeking"
ggplot(tt, aes(x=seq_along(p.value), y=1-p.value)) +
geom_line() +
geom_line(aes(x, y, color="alpha=5%"), linetype=3, conf_95) +
geom_line(aes(x, y, color="end of test"), linetype=4, obs_limit_line) +
xlab("Observation (day)") +
scale_color_discrete(name = "Legend") +
ylim(c(0,1))
We observe that in the above example we would correctly have measured a difference in conversion rates for this particular simulation. However, we should expect that if we run this experiment 100 times, about 20 of those times will result in an incorrect negative prediction due to our false negative rate being 20% (power=80).
To test this, and many other concepts in this article, we are going to create a utility function that will run a simulation repeatedly:
#
# monte carlo runs n_simulations and calls the callback function each time with the ... optional args
#
monte_carlo <- function(n_simulations, callback, ...){
simulations <- 1:n_simulations
sapply(1:n_simulations, function(x){
callback(...)
})
}
Now, we’ll use the monte_carlo
utility function to run 1000 experiments, measuring whether the p.value is less than alpha after n_obs
observations. If it is, we reject the null hypothesis. We expect about 800 rejections and about 200 non-rejections, since 200/1000 would represent our expected 20% false negative rate.
reject_at_i <- function(observations, i){
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_b)
( t.test(conversions_a[1:i],
conversions_b[1:i])$p.value ) < alpha
}
# run the sim
rejected.H0 <- monte_carlo(1000,
callback=reject_at_i,
observations=n_obs,
i=n_obs
)
# output the rejection table
table(rejected.H0)
## rejected.H0
## FALSE TRUE
## 190 810
We will use the same functions to test the false positive rate. In this case, we want to set the two conversion rates to the same value, cr_a
and confirm that out of 1000 experiments, about 50 show up as having rejected the null hypothesis (and predict a difference in the conversion rates).
reject_at_i <- function(observations, i){
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_a) # this is now the same conversion rate
( t.test(conversions_a[1:i],
conversions_b[1:i])$p.value ) < alpha
}
# run the sim
rejected.H0 <- monte_carlo(1000,
callback=reject_at_i,
observations=n_obs,
i=n_obs
)
# output the rejection table
table(rejected.H0)
## rejected.H0
## FALSE TRUE
## 962 38
Indeed, the results are as expected. We have shown that if we measure the results of our experiment when our power calculation tells us to, we can expect the false positive and false negative rates to reflect the values we have set in alpha
and power
.
Peeking
What happens if we don’t do a power calculation and instead monitor the conversions as they come in? This is what optimizely calls “peeking”. In other words, to run a test correctly you should only observe the results at one moment: when the power calculation has told you the test is complete. If you look at any moment prior to this, you are peeking at the results.
Peeking is widespread when it comes to analyzing experiments. In fact, some popular testing services peek continuously and automatically notify as soon as a p-value is below alpha. What you may notice, however, is that p-values can fluctuate around alpha multiple times before “choosing” a side. Continuously monitoring p-values will inflate false positive rates, up to 30%, Optimizely claims. Let’s see what effect continuous peeking has on our false positive rates:
peeking_method <- function (observations, by=1){
# create the conversions
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_a) # no effect
reject <- FALSE;
# for each simulation, calculate the running conversion rates and days to complete.
# Break the first time we have run more than the required days to complete
for (i in seq(from=by,to=observations,by=by)) {
tryCatch(
{
reject <- ( t.test(conversions_a[1:i],conversions_b[1:i])$p.value ) < alpha
if(reject){
break;
}
}, error=function(e){
print(e)
}
)
}
reject
}
# run the sim
rejected.H0 <- monte_carlo(1000,
callback=peeking_method,
observations=n_obs,
by=100
)
# output the rejection table
table(rejected.H0)
## rejected.H0
## FALSE TRUE
## 653 347
Indeed, even peeking every 100 sets of observations leads to an inflated false positive rate of about 30%! Peeking at test results to make decisions, and especially automating peeking to make decisions, is a big no-no.
In a subsequent article, we’ll explore what optimizely proposes to resolve the peeking issue.