How Understanding Particle Physics beats Conversion Rate Optimization
What, really?
Last year we were working on launching Nestoria in Mexico, and as usual when we're launching a new country we sought out some local knowledge in the form of a student who had grown up in Mexico. Carlos helped us with localization, geocoding, and finding a good hosting provider in that part of the world; but it just so happened that Carlos also has a PhD in particle physics, and as such an extremely robust understanding of statistical models. We took the opportunity to get him involved on improving our A/B testing platform - in between modelling the Probability Density Function of our user sessions he was moonlighting as a researcher at CERN! (Ok, it could have been the other way round...)
Our existing approach to A/B testing had been undermined by our lack of trust in the conclusions that they were drawing. In the world of Conversion Rate Optimisation, there are a few popular tools already (Optimizely or "How Optimizely (Almost) Got me Fired") and a bunch of common best practices scattered around the web.
These have been discussed in length already, mainly: Don't stop your tests early, Figure out your success criteria, correct interpretation of p-values and a general feeling that everyone (bar blog author X) is doing it wrong.
So, given that there are so many reported pitfalls to A/B testing, let's answer this question: How would a Particle Physicist solve the problem?
Lay it all down
First thing to note is you have to understand your data. We collect all our own metrics internally, which means we have access to our traffic data in its most unadulterated form. This makes it easier for us to look at the data from different views and figure out the best way to draw meaningful conclusions from our experiments.
For example if you plot total conversions per day, you might not see anything but seasonality. But if you plot conversions per session (aka visit), you get a more well defined view of user behaviour. When we actually looked at our traffic with the correct statistical units, it was like a cloud had lifted over the Nestoria office (metaphorically - we are still in London.)
What we found was that our users behaved in a way that was similar to the Average Energy Loss of Ionizing Particles travelling through matter. Our physicist had recognised our traffic closely resembles the Landau Distribution.
Ahah! The key characteristics of our distribution are a large peak between 0 and 1, followed by a long tail. That is, a lot of users who arrive will not click-out at all, but there could be any small number of users who will click-out a lot. Makes sense!
I'll leave it to the reader to come up with analogies between the behaviour of users in vertical search and the energy loss of particles fired through a layer of some density. I wonder if this type of distribution is limited to Nestoria? Or all search-based traffic? Or for any web-based traffic?
In trying to understand the best view of your numbers, you can come away with learnings that become key to product decision-making.
One interesting thing about the Landau Distribution is that because the long tail has no upper limit, you mathematically cannot calculate the mean for such a dataset. If you were thinking of just using the 'average' click-outs over your experiment length, that won't work!
It also told us that the common assumption of Normality (that your data follows an approximately Normal Distribution) in techniques such as a Student's t-test had been invalidated. What we needed was pre-processing before using such statistical methods.
Keep it simple
Ok, so we know our data follows a Landau Distribution. How would you then compare the performance of your A/B groups if you can't use a mean, or Student's t-test?
It was here that another great statistical technique was introduced to us. If you take all the events in your sample set and randomly assign them to smaller sub-sample sets (of equal size), you can predict that the outcome of those sub-samples will conform to the Central Limit Theorem and therefore be Normally Distributed themselves (provided you can take enough sub-samples).
Cool, this problem existed before the web! We can quite simply transform our data set to make it Normally Distributed, and then use all the common methods that are employed in A/B testing.
We thought about using the Landau parameters to draw conclusions, but they are not commonly known, so interpretting them seems an unnecessary burden. If we can keep it simple, that leaves less possible errors in misinterpretaion.
Because statistics are notorious for misinterpretation
Great, we have our datasets and all that's left is to run a t-test to prove/disprove the null hypothesis right?
To borrow a great quote from our particle physicist: "In theory, yes."
But after this we have to consider possible confounders, how to know when to stop the test and also to revisit the basic assumptions of the statistical methods to make sure none are being contradicted. There are many resources to support you at this point, some of which I'll list at the bottom of this post.
The most certain thing is that any set of analysis will be closely scrutinised before it is accepted. One method of gaining credibility was conducting an A/A test where the results analysis should clearly reject the null hypothesis. Another was to build a segmenter for our dataset in order to see where our conversion was suffering, for example, is our new feature bad overall because there was a bug affecting only Android 4.4.2? Segmenting your dataset and running t-tests for each segment will identify this and help to convert the developers who suggest the analysis is flawed.
Was it worth it?
In a word: yes!
The time spent working with our own metrics data has led to a far broader understanding of how our key business metrics respond to product changes. We now are able to automate the set up and analysis of A/B tests with all feature changes, additions and removals. We are now constantly iterating to improve Nestoria for our users, faster and more correctly than we were before.
That warm fuzzy feeling of a test that resulted in 10% more clickouts is awesome.
Useful links:
Airbnb Experiments
False Discovery Rates
Analyzing A/B test results which are not normally distributed, using independent t-test
Analytics Toolkit Blog
Tim Man











