How to conduct A/B Testing?

Original article was published by Isak Kabir on Artificial Intelligence on Medium


How to conduct A/B Testing?

The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.

A/B testing is a methodology of comparing multiple versions of a feature, a page, a button, headline, page structure, form, landing page, navigation and pricing etc. by showing the different versions to customers or prospective customers and assessing the quality of interaction by some metric (Click-through rate, purchase, following any call to action, etc.).

AB test by Shutterstock Tech

This is becoming increasingly important in a data-driven world where business decisions need to be backed by facts and numbers.

How to conduct a standard A/B test

  1. Formulate your Hypothesis
  2. Deciding on Splitting and Evaluation Metrics
  3. Create your Control group and Test group
  4. Length of the A/B Test
  5. Conduct the Test
  6. Draw Conclusions

1. Formulate your hypothesis

Before conducting an A/B testing, you want to state your null hypothesis and alternative hypothesis:

Picture by Shutterstock Tech

The null hypothesis is one that states that there is no difference between the control and variant group.The alternative hypothesis is one that states that there is a difference between the control and variant group.

Imagine a software company that is looking for ways to increase the number of people who pay for their software. The way that the software is currently set up, users can download and use the software free of charge, for a 7-day trial. The company wants to change the layout of the homepage to emphasise with a red logo instead of blue logo that there is a 7-day trial available for the company’s software.

Here is an example of hypothesis test:
Default action: Approve blue logo.
Alternative action: Approve red logo.
Null hypothesis: Blue logo does not cause at least 10% more license purchase than blue logo.
Alternative hypothesis: Red logo does cause at least 10% more license purchase than red logo.

It’s important to note that all other variables need to be held constant when performing an A/B test.

2. Deciding on Splitting and Evaluation Metrics

We should consider two things: where and how we should split users into experiment groups when entering the website, and what metrics we will use to track the success or failure of the experimental manipulation. The choice of unit of diversion (the point at which we divide observations into groups) may affect what evaluation metrics we can use.

The control, or ‘A’ group, will see the old homepage, while the experimental, or ‘B’ group, will see the new homepage that emphasises the 7-day trial.

Three different splitting metric techniques:

a) Event-based diversion
b) Cookie-based diversion
c) Account-based diversion

An event-based diversion (like a pageview) can provide many observations to draw conclusions from, but if the condition changes on each pageview, then a visitor might get a different experience on each homepage visit. Event-based diversion is much better when the changes aren’t as easily visible to users, to avoid disruption of experience.

In addition, event-based diversion would let us know how many times the download page was accessed from each condition, but can’t go any further in tracking how many actual downloads were generated from each condition.

Account-based can be stable, but is not suitable in this case. Since visitors only register after getting to the download page, this is too late to introduce the new homepage to people who should be assigned to the experimental condition.

So this leaves the consideration of cookie-based diversion, which feels like the right choice. Cookies also allow tracking of each visitor hitting each page. The downside of cookie based diversion, is that it get some inconsistency in counts if users enter the site via incognito window, different browsers, or cookies that expire or get deleted before they make a download. As a simplification, however, we’ll assume that this kind of assignment dilution will be small, and ignore its potential effects.

In terms of evaluation metrics, we should prefer using the download rate (# downloads / # cookies) and purchase rate (# licenses / # cookies) relative to the number of cookies as evaluation metrics.

Product usage statistics like the average time the software was used in the trial period are potentially interesting features, but aren’t directly related to our experiment. Certainly, these statistics might help us dig deeper into the reasons for observed effects after an experiment is complete. But in terms of experiment success, product usage shouldn’t be considered as an evaluation metric.

3. Create your control group and test group

Once you determine your null and alternative hypothesis, the next step is to create your control and test (variant) group. There are two important concepts to consider in this step, sampling and sample size.

Picture by Shutterstock Tech

Sampling
Random sampling is one most common sampling techniques. Each sample in a population has an equal chance of being chosen. Random sampling is important in hypothesis testing because it eliminates sampling bias, and it’s important to eliminate bias because you want the results of your A/B test to be representative of the entire population rather than the sample itself.

A problem of A/B tests is that if you haven’t defined your target group properly or you’re in the early stages of your product, you may not know a lot about your customers. If you’re not sure who they are (try creating some user personas to get started!) then you might end up with misleading results. Important to understand which sampling method that suits your use case.

Sample Size
It’s essential that you determine the minimum sample size for your A/B test prior to conducting it so that you can eliminate under coverage bias, bias from sampling too few observations.

4. Length of the A/B test

A calculator like this one can help you determine the length of time you need to get any real significance from your A/B tests.

Picture by Shutterstock Tech

History data shows that there are about 3250 unique visitors per day. There are about 520 software downloads per day (a .16 rate) and about 65 licenses purchased each day (a .02 rate). In an ideal case, both the download rate and license purchase rate should increase with the new homepage; a statistically significant negative change should be a sign to not deploy the homepage change. However, if only one of our metrics shows a statistically significant positive change we should be happy enough to deploy the new homepage

For an overall 5% Type I error rate with Bonferroni correction and 80% power, we should require 6 days to reliably detect a 50 download increase per day and 21 days to detect an increase of 10 license purchases per day. Performing both individual tests at a .05 error rate carries the risk of making too many Type I errors. As such, we’ll apply the Bonferroni correction to run each test at a .025 error rate so as to protect against making too many errors.

Use the link above for the test days calculations:
Estimated existing conversion rate (%): 16%
Minimum improvement in conversion rate you want to detect (%): 50/520*100 %
Number of variations/combinations (including control): 2
Average number of daily visitors: 3250
Percent visitors included in test?
100%
Total number of days to run the test:
6 days

Estimated existing conversion rate (%): 2 %
Minimum improvement in conversion rate you want to detect (%): 10/65*100 %
Number of variations/combinations (including control): 2
Average number of daily visitors: 3250
Percent visitors included in test?
100%
Total number of days to run the test:
21 days