Why Fresh Egg Welcomes Optimizely’s New Stats Engine With Open Arms
Last week website optimisation company Optimizely released a new data methodology behind its split testing platform, called Stats Engine.
Fresh Egg, which is an accredited Optimizely solutions partner, whole-heartedly welcomes this development as it solves a critical issue with fixed horizon statistics and the fact that CRO work is typically continuous, something which many testers struggle with. It will also help raise standards in the conversion optimisation industry.
The problem with continuous testing as it’s used now
The traditional, sequential, approach to statistical testing always puts data collection before data analysis. Sample sizes are based on expected effect size and statistical power, and at the point the results are calculated, they are fixed. This does not reflect the way conversion rate optimisation (CRO) work is conducted.
Let’s step back for a moment and consider two important factors which we base conclusions on: is the data showing signs of an effect and, secondly, is the sample a good representation of the population it was taken from?
Having applied randomisation to minimise sampling errors, you may be satisfied knowing any effects which are detected in your sample are also true of the underlying population 95% of the time. Alternatively, it could be that the effect you were trying to measure doesn’t actually exist.
You’ve analysed your data and the results are correctly telling you that you can’t be confident there is any significant effect being measured. However, if costs were ignored and the experiment was conducted over and over again it becomes increasingly likely that one of the samples could show an effect that isn’t really there and doesn’t represent the underlying population.
Something similar to this occurs when carrying out continuous testing because samples are always changing as new website visitors enter the experiment. If you frequently look at these fluctuating results then eventually you’ll see a statistically significant effect being reported. Because the error rate is based on the fixed horizon idea, where you only look at your data once, the actual chance of error will be much higher than expected.
Optimizely’s Stats Engine
The new Stats Engine from Optimizely works to maintain the true chance of error close to the expected error rate. It takes a more cautious approach to reporting statistically significant results and thus reduces the number of false positives reported. But it does so without greatly inflating the false negative rate.
The Stats Engine doesn’t just do this for testers who like to peek at results, but also for experiments with multiple variations and/or multiple goals through a feature called False Discovery Rate (FDR). This compensates for what would be an increasing false positive risk with each additional variation or goal.
The downside is that you may need to wait longer before declaring a winning variation, but if you knew what you were doing then you would have already been doing this anyway – Stats Engine applies best practise for you.
The Fresh Egg approach to A/B testing
Savvy testers have understood how fixed-horizon statistics should be used and apply a range of methods to help keep errors down, such as calculating required sample sizes and setting appropriate statistical significance thresholds. Fresh Egg also puts additional measures in place during experimentation to protect against error:
In the real world it’s not just about sample sizes and chance – consideration needs to be given to people’s decision-making processes before they commit to buying a product or using a service. For some types of business – food retailers for example – customers will typically make decisions quickly. The costs are low and the risks minimal. For other businesses, such as vehicle leasers which require a customer to sign up to a contract over a long period, considerable time may be spent weighing up the pros and cons before customers make a commitment.
There’s little point, then, testing changes to a website if you then don’t wait long enough to see if your visitors seeing the variation then go on to convert.
It’s clear that experiments should be run for at least one business cycle, and wherever possible Fresh Egg tests for two.
Having run an experiment for the minimum period dictated by the business cycle, Fresh Egg will then look at the stability of the results. Fresh Egg will analyse the results and ask if the variations have dropped in and out of significance recently. Is there large flux in the conversion trend lines? Is there a variation conversion line still on a steep trajectory?
Fresh Egg will collect more data in such cases before making a decision.
Setting up test variations should never be a numbers game: just as it is necessary to have a solid hypothesis behind every experiment, each variation should also be able to justify its existence.
For example, an experiment to find the best colour for a CTA button might be based on the hypothesis that colours which stand out lead to more conversions. But instead of experimenting with 20 different variations and colours, Fresh Egg would test with just two or three that meet the criteria of standing out.
Focus on what’s important
In most cases, the decision whether or not to implement a change will be dependent on the primary business goal. If the primary goal has not shown any significant improvements, then careful consideration must be given to whether or not the secondary goal(s) are worth pursuing. Secondary goals can be useful in explaining how a primary goal was achieved: for example, perhaps an increase in newsletter subscriptions led to more orders but having more subscriptions without the order increase is less valuable.
While end-users will be glad to have an experienced conversion strategist running their testing, the intricacies of biases and error rates are anything but intuitive and key stakeholders can feel left out in the dark. Optimizely’s Stats Engine perfectly complements existing best practise and does so in a consistent repeatable way which makes communicating results a whole lot more transparent.