Type 1 and Type 2 Errors
A/B testing results can be inconclusive. But did you know they can also be wrong? Design flaws can lead to inaccurate results – false positives or false negatives, also known as Type 1 and Type 2 errors. These errors can lead your optimization efforts astray, but you can prevent them with close attention to how you formulate your hypothesis, choose your audience, and set your sample size.
What is a Type 1 Error?
When you set the parameters of an eCommerce test, the null hypothesis states that there’s no statistically-significant difference between the control and variable versions of the web experience you’re testing. By contrast, the hypothesis predicts that there will be a difference and that the null hypothesis should be rejected.
When your test produces a Type 1 error, the results indicate that the null hypothesis is not true, even when it is. Put another way, a Type 1 error is a “false positive” indicating that the variable version produced significantly different results than the control, when in fact it didn’t.
What is a Type 2 Error?
A Type 2 error is the opposite of a Type 1 error. When your test produces a Type 2 error, the results indicate that the null hypothesis is true, even when it is not and you should actually reject it. That is, the test indicates that your proposed changes have no effect on performance when they actually do. A Type 2 error is also called a “false negative.”
Type 1 Vs. Type 2 Errors
Both Type 1 and Type 2 errors describe inaccuracies in test results, and both relate to the null hypothesis. A test with a Type 1 error produces a “false positive,” while a test with a Type 2 error results in a “false negative.” The table below summarizes the differences:
|Reality: Null hypothesis is true
|Reality: Null hypothesis rejected
|Test results: Null hypothesis is true
|✅ Correct result
|❌Type 2 error
|Test result: Null hypothesis rejected
|❌Type 1 error
|✅ Correct result
As an example, let’s say you decide to test whether highlighting the availability of free shipping on the product detail page improves the “add to cart” rate.
- The null hypothesis predicts the add to cart rate won’t be significantly different between the original version of the product detail page and the variable, which adds the free shipping offer.
- The hypothesis predicts the version of the page with the free shipping offer will produce statistically different results than the original.
After running the test, the results indicate that there is a significant improvement in the add-to-cart rate for the page with the added free shipping offer. But after implementing the change across the site, the add-to-cart rate doesn’t improve as significantly as in the test – or, worse, it drops. Your test produced a false positive, or Type 1 error.
On the other hand, if the results show that there’s no difference between the two page designs, but in fact the addition of the free shipping offer would have boosted the add-to-cart rate if implemented, the test produced a false negative, or Type 2 error.
In this example, the Type 1 error is problematic, because the faulty test has led to widespread implementation of a change that may in fact perform worse than the original version. The Type 2 error represents a missed opportunity; while the fallout is potentially less drastic than the Type 1 error, failure to implement a change that could improve performance means leaving money on the table.
How to Calculate Type 1 and Type 2 Errors?
In a statistics class, calculating Type 1 and Type 2 errors involves long and detailed formulas. While you don’t need to be able to grind through the equations in order to design an A/B test, it’s helpful to understand what impacts the probability of producing each type of error and how you can avoid the pitfalls.
How to Calculate the Probability of a Type 1 Error
The probability of making a Type 1 error is related to the confidence level you set at the outset of your test.
If your goal is to reach the standard confidence level of 95%, then the probability that you’ve produced the wrong result is 5%. That 5% is called the significance level. The significance level and the Type 1 error probability are the same – in this case 5%. We can express this relationship in a formula as
𝝰 = 1 – C
Where C is the confidence level and 𝝰 is the significance, or likelihood of a Type I error.
For example, if you want to test whether changing the color of your “add to cart” button from red to green impacts conversions, and run the test until you achieve a confidence level of 95%, then the risk of a Type 1 error is 5%.
If the test concludes and you discover that there is a significant improvement in conversions with the green button, there’s still a 5% risk that there is no difference between the two versions. If you implement the change to green “add to cart” buttons sitewide and discover there’s no marked difference in conversion after all, then despite your best efforts, your test suffered from a Type 1 error.
How to Calculate the Probability of a Type 2 Error
It’s a little trickier to calculate the probability of making a Type 2 error, which occurs when you conclude there is no difference between the control and test version, but in fact a difference does exist.
Type 2 errors are dependent on the power of the test – a score that determines how likely the test is to accurately produce results that reject the null hypothesis. The power of a test is affected by:
- The sample size – the larger the sample size, the higher the power of your test
- The variability of the sample size – the larger the variability of your sample, the lower the power of the test.
- The significance level – a higher significance level raises the power of the test.
- The size of the potential effect – Increasing the predicted performance difference between the control and test versions makes for a higher power.
𝝱 = 1 – P
Where P is the power of the test and 𝝱 is the probability of a type 2 error.
For example, if your goal is to increase “add to cart” clicks from the search results page, and you include all site visitors in the test, your potential sample size is large, but the power of your test may be higher if you target only those who’ve previously used the search tool to sort and filter results. Assuming that sample size is still large enough to produce results with a high confidence level, the targeted test would have a higher power, making the likelihood of a Type 2 error lower. You’re less likely to produce results erroneously concluding the test version has no impact on performance, because using a targeted audience reduces the likelihood of outlier behavior.
How to Reduce Type 1 and Type 2 Errors?
As the calculations above show, there’s a relationship between Type 1 and Type 2 error probability.
- If the significance level of a test is low, so is the probability of a Type 1 error. Setting a higher confidence threshold at the outset of the test lowers the significance and the probability of a Type 1 error.
- The probability of a Type 2 error is related to the power of the test; the higher the power, the lower the probability of a Type 2 error. One way to increase the power is to raise the significance level.
In short, lowering the significance can prevent Type 1 errors, while raising the significance can prevent Type II errors. To determine the significance level for your test, evaluate the risk of committing each type of error. If you erroneously conclude that the test version of a web page outperforms the control and implement it site-wide – a Type 1 error – is that worse than committing a Type 2 error, incorrectly concluding there’s no performance difference and leaving the page as-is?
Most of the time, leaving a page as-is is less risky than implementing a change that fails to produce results – or actually worsens the experience for users. The good news is that other facets of the test can impact the probability of Type 2 errors, so it’s possible to maintain a low significance level and probability of Type 1 errors, while also preventing false negatives. Here are the factors to consider:
How to Reduce Type 1 Errors
Because Type 1 errors are tied to the confidence level you set at the beginning of the test, it’s crucial to set that threshold high enough and execute according to the plan. The guidelines:
- Ensure your sample size is adequate. Use a sample size calculator to determine how large your test audience needs to be to keep the significance level low. A 95% confidence level/5% significance level is standard.
- Run the test all the way to its conclusion. If you’ve determined that it will take a month to reach the significance level you’ve targeted, do not stop the test early.
For example, if you launch the red vs. green button test and peek at the results early, you may see a dramatic difference in performance and erroneously conclude that the green button is the better option. But if you run the test all the way to its conclusion, the difference between the two options ends up being within the margin of error, suggesting that making the change sitewide is more trouble than it’s worth.
How to Reduce Type 2 Errors
To reduce the risk of Type 2 errors, raise the power of the test. Setting aside adjustment of the significance level, you can boost test power by:
- Raising the sample size and sticking to the schedule. As it turns out, a large sample size is beneficial for preventing Type 2 errors as well as Type 1 errors. By setting a target audience size and resisting the temptation to end the test early, you raise the power of the test and reduce the likelihood of a Type 2 error.
- Targeting the test audience. If you zero in on relevant users, their behavior is more likely to reflect the impact of the change you’re testing, not other random factors.
- Designing a strong hypothesis. Study tests you’ve conducted previously and review industry standards to determine how large of a performance change to expect. The larger the predicted change, the lower the chance that you’ll incorrectly conclude there’s no difference between control and test versions.
For example, if you design a test that predicts adding reviews and ratings to the product detail page will increase the add-to-cart rate by 2%, the results are more likely to generate a false negative than if you predict the add-to-cart rate will increase by 10%.
How are Type 1 and Type 2 Errors Used in A/B Testing?
By understanding the potential for Type 1 and Type 2 errors, you can design A/B tests that are statistically sound. By formulating a sound hypothesis, setting a high confidence threshold for the test, and targeting your audience, you can trust that the test results accurately reflect the impact of your proposed changes. With solid results as your guide, you can optimize your site to improve performance and drive business growth.