A constant challenge in product development is to know whether a new feature is actually beneficial for the user or not. One of the best ways to get an answer to that is by using A/B testing every time a new feature is launched. This means that the feature is released only to a random subset of the users and we investigate how those users perform compared to the rest of the user base.
When evaluating an A/B test, you essentially measure one or a few metrics - for example conversion rate or customer revenue - for both of your groups to see how big the differences are between these groups. The challenge is to understand if the difference is due to chance or if it can be attributed to the new feature that you’re testing.
There are two main philosophies among statisticians about how to answer that question: the frequentist or the Bayesian approach. The frequentist approach is arguably the most widely known - and applied - and it provides an estimate of how big the observed difference is between the two groups together with a confidence interval. The confidence interval is constructed in such a way that if you were to repeat the experiment many times, the “true” difference between the two groups would be within the confidence interval. Additionally, the width of the interval is determined by a specified significance level (a commonly used significance level is 95%). This means that if you were to repeat the same experiment 10,000 times, the true value of the difference between the two groups would fall within the confidence interval in ~ 9,500 of those experiments. This is by many considered a weird definition since the experiment is only run once, but the interval is constructed with the assumption that it can be repeated infinitely. On top of that, confidence intervals are also often misinterpreted – a typical situation would go as follows: “There is a 95% probability that the difference between the two groups is within this confidence interval”, which is not true.
Despite these drawbacks, the frequentist approach has many advantages as long as it is used correctly. It is robust against biases and can fairly easily be applied to data with many different distributions, and at iZettle we have been using frequentist statistics when evaluating the A/B tests that we’ve run. However, since we’re a curious cultured company and always want to try new stuff, we couldn’t keep ourselves from learning more about Bayesian statistics as well! Hence, during the latest Hack-week (essentially, a full-time week to try out new cool stuff that in some way might benefit iZettle) some of us investigated how we could use the pymc3 library, in Python to evaluate A/B tests using Bayesian statistics.
While frequentist statistics assumes that there is a “true” value of the investigated metric, Bayesian statistics consider the metric as a random variable with a certain probability density function. In other words, when a frequentist analysis gives you a confidence interval in which the true parameter will be in for e.g. 95% of the experiments, a Bayesian analysis leaves you with a probability density function, called the posterior probability density function, which tells you which values are most probable given the data you have collected during the experiment. From this posterior density function you can construct a credible interval - also referred to as the highest posterior density (HPD) - that tells you what the probability is that the metric of your population lies within this interval given the experiment data. In other words, the credible interval means what most people - erroneously - think a frequentist confidence interval means!
To do a Bayesian evaluation of an experiment you are required to specify what prior assumptions you have about the distribution of the metric that you’re investigating. This is, however, a two-edged sword as the result will depend on which prior one chooses. Consequently, two different analysts will probably choose different priors and get different posteriors. This introduces some bias to the analysis (even though the impact of the bias decreases as the sample size of your experiment increases), but it can also be an advantage if you have good prior knowledge about the metric that you’re investigating so that you can leverage on that to get more reliable results. Say, for example, that we know that the conversion rate is 40% (a made up number). We can then specify a prior distribution that is centred around 40%, which will allow us to get a meaningful posterior distribution using less data.
During the hack week we built a handy toolkit that can be used for evaluating two common A/B tests that we are running at iZettle where the metrics are either binomially or exponentially distributed. The binomial case arises whenever we’re experimenting with metrics that have a true/false outcome with some probability, and the exponential cases arise in many situations when we measure continuous variables. The toolkit, which is available here, only requires two data sets and some information about the prior distribution. It then plots the outcome of the experiment together with a 95% credible interval.
There is a lot of theory to learn about Bayesian statistics before you can start applying it to business decisions, and if you’re looking for a good start, you can check out the book Doing Bayesian Data Analysis by John K. Kruschke or watch some of the great lectures that are found on Youtube for example.
Does this sound like fun? Then join us, we are hiring!