A/B testing can transform your business — if you do it right

Enterprise

Join today’s leading executives online at the Data Summit on March 9th. Register here.


This article was contributed by Dmytro Voloshyn, cofounder and CTO of Preply.

I’m a firm believer that no one — including founders, executives, and product managers — is ever certain about what customers want. The best business decisions are supported by data.

In the digital universe, A/B testing is a tried and true method to take UX concepts for a test drive. These online controlled experiments allow for rigorously validating your strongest hypotheses before turning an idea into reality. But when it’s not done properly, A/B testing can actually hurt — rather than help — your business.

When it comes to this research methodology, here are four key factors to keep in mind.

The power of holdout groups

The best way to tell if your A/B testing is successful is to create a holdout group, which is a small percentage of customers who do not participate. For every new feature on the website, you need to preserve a group of users who see the product frozen in its current state.

The goal of creating a holdout group is to measure the combined effect of all product development and to establish a causal relationship between those changes and your company’s performance. In doing so, you will be able to spot weak points and highlight the strong ones more easily. The holdout group is also a powerful tool to detect issues within your A/B testing system, even after an experiment is scaled.

So, how can you effectively manage a holdout group?

1. Implement technical possibilities that maintain the frozen version of your product. 

One option is to keep 5-10% of the servers with the old codebase; then use canary deployment techniques to send traffic to them and monitor the effect of the holdout group. Another option is to isolate user flows in the codebase itself. In this case, it’s critical to delete the old holdout group flows before launching a new one.

2. Ensure that the holdout group is statistically independent from the rest of the product.

You don’t want your users to communicate with each other and potentially share information about the feature set you’re testing.

3. Monitor the performance of the holdout group on a regular basis. 

It’s common to initially see holdout groups performing better than the rest of the product, since there is a natural cost of running A/B tests. But after some time, you’ll most likely see holdout groups performing worse than the group exposed to A/B testing.

4. Understand that the methodology behind implementing holdout groups is not formulaic. 

It deals with the division of your audience into one additional segment. The range of an actual holdout group is often from 1% to 10% of the entire audience. You might come across a combination of the following:

Control: 47.5%

Variation A: 47.5%

Holdout: 5%

Quality over quantity

It’s tempting to get into the habit of testing every possible new feature before it’s rolled out. That’s a great data-driven mindset, but the first question you must ask is: “How can this feature help grow the business?”

Here are a few suggested ways to ensure you’re focused on what’s important.

Remove any incentives to develop and scale non-significant experiments.

The scale rate for A/B testing, on average, should be 30/33/10-20/10. The biggest companies in the world only scale positive and statistically significant experiments: Slack (30%), Microsoft (33%), Bing-Google (10-20%), and Netflix (10%). Through this selectivity, you will ensure that your product does not become unnecessarily complex over time — only scaling features that bring value to customers.

Calculate the minimum detectable effect before developing a feature

The best approach is to make sure that your scale ratio is higher than your p-value. This helps prevent your product team from developing dozens of unnecessary features.

Build a direct connection between A/B testing and the company’s performance. 

Most companies will focus A/B testing on improving their North Star set of business metrics. Once you’ve outlined what those are, work with your product team to set goals, expectations, and benchmarks for success.

Understand the difference between UI and UX Design.

Your team needs to clearly distinguish between UI (user interface) and UX (user experience). Sometimes these two terms are used interchangeably, which can mean that you’re testing for the wrong reasons.

Focus on conversion rate optimization (CRO).

This marketing technique helps generate quality web traffic and increase the number of conversions. CRO can also help uncover why customers may not act once they visit your site, which is critical research that can inform your A/B testing program. Innately, CRO is a binomial metric which is bound between zero and one. This type of metric has significant mathematical properties that help get faster, better results.

Develop customer journey mapping.

Your team needs to have a deep understanding of the customer journey from start to finish. This lays the foundation for prioritizing what those key moments are and where enhancements and improvements will take place.

A/B testing: Beware of bias

It’s human nature: people are biased. And when it comes to A/B testing, this is something to be cognizant of. For example, if a product manager sees negative results from a test that is close to their heart, they may consider tricking the system. This can come in the form of changing the minimum detectable effect (MDE), extending the running time, or changing the p-value, all to sway the results into the insignificant zone. Because people are subjective, good data-driven organizations must provide frameworks that remove the possibility of bias.

Here’s how bias can show up in A/B testing:

1. When a minimum detectable effect isn’t set.

What’s the smallest improvement your team is willing to detect? You need to go into the project with a strong hypothesis and clear metrics for improvement, which will help protect against preconceived ideas.

2. When exceptions are being made.

The easiest way to identify biases is to look for exceptions. Are there cases in the organization when product managers break rules or frameworks? Follow-up with “Why?” questions from there.

3. When testing collisions happen.

Bias can happen when multiple tests are conducted at once and aren’t managed correctly. For example, there can be individual tests (the easiest way to go); multivariate tests (a combination of two objectives); or mutually exclusive tests (simultaneous but separate). All three circumstances require proper set up to avoid bias.

Identify broken functionality in A/B testing

There are numerous points where A/B testing can go wrong.

1. Sample ratio mismatch (SRM)

This happens when your split is incorrect. Splitting participants should be simple: the system flips a coin and distributes all the participants into two equal groups. The more people in both groups, the higher the probability the groups will be equal. The most common problem is either a technical mistake or the wrong point of split. The rule here is always flip a coin as close as possible to the testing of the change.

2. Incorrect data

Another reason for internal error can be using incorrect data. Do you track experiments for search crawlers? How do you maintain unique user IDs across devices? Do you remove outliers? These are important questions. You must re-run some of your A/B tests regularly to see if the results can be replicated. If the results cannot be replicated, you must uncover the issues within your system or frameworks that prevent you from being data-driven.

3. False positives

When validating your systems, beware that some results will be false positives. If you’re doing everything right, the percentage of false positive A/B tests should be approximately equal to your p- value. Best-in-class companies will also add error instrumentation to A/B testing systems so that engineers can notice if the number of errors elevates when launching a new experiment.

4. Poor traffic & conversions

When you do not have enough traffic, split A/B testing is not the appropriate choice. It will take a significant amount of time to collect the data needed to pull off A/B testing the proper way. Hold off on conducting this research until you have enough web visitors to justify it.

Dmytro Voloshyn is the cofounder and CTO of Preply.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers