2 weeks ago, I wrote a post on PPC ad copy testing that ended up being my most popular post for April. One of the recommendations I made was to write a lot of ads, but only test 2 ads at a time, so you can get to statistical significance faster.
But Kirk Williams had another reason not to test multiple ad variations: profitability.
I’ll admit it had occurred to me that running too many ads could hurt profitability, but I’d never run the numbers. And Kirk’s numbers in the table above were made up. So I decided to dig through historical data to see if I had any actual figures to analyze.
We inherited a large account that had up to 12 ad variations running in some ad groups. It’s a high volume account, so that many ads made some sense – except for the fact that most of this client’s conversions come in over the phone, and phone calls can’t be tracked back to ad variations. So looking at just online form fills, each variation often had only 1-2 conversions, and some had none.
I decided to use the actual data to create hypothetical scenarios, where we assume that only the best 2 ads in the ad group ran at the same time.
Scenario 1, Actual Data
In this scenario, there are 6 ads with wildly varying statistics. I should note here that the previous agency also used “optimize for clicks” in some campaigns, but not others. Anyway, there’s one version, Version 4, with a high conversion rate, but each variation had less than 10 conversions each.
Scenario 1, Hypothetical
Here I took the total number of impressions for the ad group and split them evenly, and then calculated the rest of the metrics based on actual CTR and conversion rate. It’s pretty clear which ad is the winner here – and it’s also clear, based on the actual statistics, that about $1,600 was wasted on ads that weren’t converting as well as the top 2.
But was this ad group a fluke? I looked at a second example to be sure.
Scenario 2, Actual
Here we had 5 different ads. Version 1 had the most conversions, but also the lowest conversion rate. The ad that converted the best didn’t have many impressions. There’s no clear winner here either.
Scenario 2, Hypothetical:
The winning ad wins by a landslide here. Cost for the 2 ads was similar, but the winner converted at more than twice the rate of the 2nd-best ad.
The caveat with Scenario 2 is that, in the actual scenario, the winning ad had so few impressions that I hesitate to extrapolate its performance over more impressions and clicks. Often I see ads have “beginner’s luck” where they do very well initially, and then settle in to a more average performance. But even if the winner didn’t convert quite as well, it likely would have beat the contenders in this instance. And in this case, about 80% of the budget was spent on losing ads. I’d hate to have to tell that to the client.
Conclusion
Based on these examples, it’s pretty clear that, at least hypothetically, running 5-6 ads wastes more money than running 2 ads. I’m willing to hear examples to the contrary, though. I know at least a few of my readers know a lot more about statistical theory than I do – what say you? Is this a legit analysis, or are there holes? Share in the comments!
Mel,
Running more ads has a bigger problem.
Here is a talk by Jason Cohen who sheds light on how you could arrive at a wrong conclusion when doing multiple A/B tests. He refers to the “41 Shades of Blue” at Google and says that, there is a non-trivial chance that the winner was a false positive, even if tested at very high confidence level.
Here is a link to his talk.
http://businessofsoftware.org/2013/06/jason-cohen-ceo-wp-engine-why-data-can-make-you-do-the-wrong-thing/
Interestingly, Richard Fergie (@RichardFergie) says this can be solved. https://twitter.com/kshashi/status/580038800866316288
May be he can shed more light on this.
On the same note, Brad Geddes on a podcast said that ideally, you should have a minimum of 2 desktop and 2 mobile-preferred ads in your ad group. If you are offering a service where the mobile users can be shown a different ad copy, we are talking of 4-6 ad copies. Though, in aggregate this number will make us frown, it’s really 2-3 ads per device. I prefer to follow Brad’s advice here.
Very interesting, Shashi. I’ll have to go check out that talk. I’ve heard Brad talk about numbers of ads and I do agree with the 2-3 ads per device rule. Mobile can perform very differently from desktop, and to be honest I didn’t take that into account in the examples in this post. Fodder for a future post I guess!
Hey, interesting discussion!
Wouldn’t the problem of false positives occur regardless of whether you test everything at once or not? In a five ad test, don’t you compare ads one by one just like you would in a series of test with two ads?
You are right, Martin. Then, the goal would be to arrive at a false positive slowly 😉
Now that I think more on it, false positives in ad testing are probably far more prevalent that we would like to believe. The results of the test are valid only if during the entire duration of the test, no other parameter changes. If you modify the bid, add a keyword, update landing page, block certain traffic with negative keywords, etc., I would like to believe, the test is no longer the same. How often can we guarantee cleanroom environment to ad testing?
There is no clean test environment IMHO. Things are always changing: we’re adding/subtracting keywords, changing bids, etc. – and even if we weren’t doing those things, the external environment is always changing, with different people searching, news events, etc. We just have to do the best we can with what we have.
There are fixes for the “pairwise comparison” thing. The most well known is the Bonferroni Correction which basically calculates how much more “significant” your results need to be when you make multiple comparisons.
Gelman recommends using hierarchical models to dodge this problem (but then he recommends hierarchical models for everything I think)
http://andrewgelman.com/2015/04/22/instead-worrying-multiple-hypothesis-correction-just-fit-hierarchical-model/
http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf
Great resources. Thanks Richard!
Overall, I totally agree with the advice here – be careful of testing too many variations at the same time. My preference has always been 2-3 ads by device by ad group (assuming you’re on mobile) and 2-3 ads per ad group included in any multi-ad group tests.
For multi-ad group; if I want ‘big picture’ info – like should I use DKI or not, should I use geos or not, etc -since it’s an a/b question – its 2 ads per ad group. If I want to know something like what goes better: prices, discounts, neither – then I’ll use 3 ads per ad group since there’s 3 questions. I rarely go to 4 or 5 unless there’s a *lot* of traffic.
However (and I don’t necessarily agree with this – but let me play devils advocate). If we added the 3rd element of time/opportunity cost, we might arrive at a different conclusion. If we were to test 5 variations (assuming lots of traffic) and we find a variation that’s much better quicker – than that variation might be found in 4-6 weeks. Now, if we were only doing a/b testing (2 ads per ad group) and we were doing monthly testing – then we might not find that ad until month 5 – so there’s roughly 4 months of sub-optimal ads being displayed.
I think there’s a lot of ways to do this math to prove either viewpoint. The randomness of large sample sets is important; however, that can be overcome with a larger number of impressions. The only constant minimum viable data in ad testing is time – the other variables are overcome by impressions/conversions/etc.
So when I’m being ‘semi-scientific’ I generally use a rule of a minimum of 2 ads per device by ad group and you can add another ad for every 3,000 impressions/month for an ad group (so you need 9000 to test 3 ads); but never to eclipse 5 even if you’re doing 100,000 impressions/month due to the randomness factors.
I’m not sure of all the math; but that sort of semi-math has done well for me. As most ad groups don’t have that many impressions – the overall rule of 2-3 ads per ad group applies to 95%+ of all ad groups and is just good overall advice to follow since it applies to the majority of accounts.
Anyway – my $0.02 🙂
See, I knew someone with a better stats background than I have would comment. 🙂 Huge thanks for your thoughts, Brad – and it totally makes sense. Love the “never more than 5” rule.
Huh, thanks Brad. Fascinating and your “devil’s advocate” point makes a lot of sense.
Melissa, thanks for taking my thoughts and putting feet to them!
Another thought – how would this methodology differ if you were using optimise for clicks/conversions?
In this case, the penalty for having a lot of ads running at once is lower because the better adverts will get more impressions.
I’ve run into several cases where Google doesn’t show the top CTR/CR ad even when using optimize since Google makes very fast decisions and is slow to change their mind.
For instance if this were the stats:
Test a: Imp 30 – Clicks 1
Test b: Imp 30 – Clicks 9
Test c: Imp 10 – clicks 0
There’s a 98.8% confidence that test b is the winner and theoretically, you can make a decision. With confidence info , there’s the assumption that the data that comes later is similar to the data that came before. However, if those 100 impressions occurred during the lunch hour on a Monday – the data will quickly become different when users come back from lunch and are again working.
I don’t see this a lot in Google; but I’ve seen it enough (and its messed up testing and just showing the ad that meets your goals the best regardless of the metric) that I always use optimize indefinitely as my ad rotation so that the test is more equal, hits minimum viable data faster, and that I’m making decisions based upon better overall data quality.
Can’t argue with the fact that optimise for clicks/conversions isn’t perfect but I don’t think our alternative methodologies are perfect either.
As you say, it is a trade off between quicker results and more accurate results.
I run simulations to test out different strategies with this and even the simplest bandit leads to better results than waiting for statistically significant results. But (and it is an absolutely massive but), the model I’ve been using is too simplistic to take into account things like different adverts performing better on different days.
One day I will improve the simulation…
And how can Google’s algorithm be so awful? How much money must this cost them?
My view – it probably isn’t that awful, I just notice and remember the times it appears to be wrong
Oh, it’s awful. 🙂
I agree with Richard: Google’s ad rotation is extremely important for their business. Systematically making the wrong decision would hurt them badly. At the same time, stuff like this is where Google is brilliant.
On the other side we have gut feelings and simple math, both based on superficial data. If those approaches lead to different results, I’d put my money on the algorithm guys.
Define “better,” though. I’ve found with the optimize options that Google picks a winner too soon. I’ve seen one ad get 85% of impressions after only a day or 2. That’s not a fair test IMHO. Would love to hear Brad’s thoughts on this, I know he has some 🙂
I once asked my account manager about the best way to test ads according to them and a senior account strategist replied: create 3 ads, let them run unoptimized until they generated enough data – 1-2 weeks. Then set for any optimization; clicks or conversions.
Sounds solid to me. However, I still leave it running unoptimized and check for significant performance differences based on conversion/impression.
What – no one is going to bring up the fallacy of “rotate evenly” into this discussion?
Much food for thought. Great comments too!
You mean how “rotate evenly” isn’t really even? I almost put a caveat about that in my hypothetical examples, b/c real impression data is never split 50/50 and sometimes it’s more like 80/20. Sadly, that’s the best we have at this point!
True. But it does skew testing results. Granted, we all have to figure out how to effectively test within the system we’ve got. Just one more thing to figure around, right?!?
Hey Mel,
I’d like to add a few things.
First of all, I don’t think the hypothetical scenarios make for a ‘fair’ comparison. It’s basically hindsight: what if we had known the best two ads in advance and never tried any other? Sure, the results would be better – they’d be even better if we had picked the best ad from the start without ever testing anything.
To get a proper picture you’d probably have to go through all possible A/B test combinations and orders (120?) and then average the results.
By the way, you mention that 80% spent on losing ads was bad. But let’s say you find a new best ad after a year, wouldn’t that also mean you’ve spent a year with losing ads?
Then I noticed that you used conversion rates from the ads to calculate outcomes. I don’t know about the nature of the ads tested here, but I don’t think this should be done without an explicit hypothesis. Here, the implicit hypothesis is that these ads have different conversion probabilities — but do they? Of course, they always produce different conversion rates, but, for example, if the only difference in the copy is an exclamation mark then I wouldn’t take conversions into account.
Two more cents 🙂
Martin
See, I knew there were holes in my logic. I suppose there is no perfect test scenario other than to always be testing.
I agree with most of what Martin says here.
There are a couple of things about the final point (“Then I noticed that you used conversion rates…”) that don’t quite ring true to me.
1. For the big picture in this discussion I don’t think it matters what metric we use to define advert success. Just looking at conversion rate isn’t perfect in most business applications, but for the high level view we are taking here it doesn’t matter what success metric we use as long as we all accept that success can be measured. We might be able to have an interesting conversation about how to use “optimise for…” rotation settings when you are optimising for something that isn’t clicks or conversions but this is about a million steps further on from this. (And if you know the answer to this then I would love to hear it!)
2. One of my (many) problems with using traditional null hypothesis testing with adverts is that the null hypothesis is something like “these two adverts have the same conversion rate”. My view is that this hypothesis is always false and that if you wait long enough you will have enough data to reject the null hypothesis. So the question isn’t about whether or not you reject the null hypothesis but *when* you reject the hypothesis. But your comments suggest that you think it is possible for two different adverts to have the same conversion rate. You might mean “it is possible for the difference in conversion rate between two adverts to be meaningless from a business point of view” in which case I 100% agree with you. But if you mean something else I would love for some elaboration.
One more thought. Back when I worked in-house in direct marketing we had a long discussion as a marketing team about statistical significance and what was acceptable. Our analysts were arguing for 95% or 99% confidence – and we rarely reached that level in our testing. Then our VP of Marketing said, “This isn’t brain surgery or medical testing. No one is going to die if we pick the wrong winner. We need to iterate faster.” And the decision was made to go with a 90% confidence interval.
I think sometimes we get hung up on finding the “perfect” ad test when it doesn’t exist. At what point is our testing “good enough”?
I think this is a *great* point Melissa.
If you look at all possibilities and flaws in any testing philosophy; you will find either holes in the logic, the math, or just differing opinions. that can stagnate the entire process.
As a perfectionist – I’m sometimes not great at testing as I have to be told: It’s good enough for now – just test it, see the results, and keep going.
I always have to remember: You don’t have to be perfect. To get improve – you just have to be better than you are today.
I think this is where a lot of very sophisticated stats people get hung up on testing. It’s easy to poke holes in any philosophy, but in the end, all you’re trying to do is get a little bit better every week – that will lead to real improvements.
Hey Richard,
My point about conversion rates is a practical one: Even though we may have ways and tools to take those into account, I don’t think it’s always a good idea. This is especially true for other kinds of success that are further down the road.
I’d leave the math behind and turn to common sense first. An ad has a lot of influence on whether a user clicks. But it has less to do with whether someone converts. It has even less to do with repeat purchases and customer lifetime value.
I know that it’s not hard to argue against this. An ad attracts certain kinds of users and those might have different conversion rates, spending power, etc. So if you test ads like “Cheap stuff, buy now!” against “Fine things for rich people”, then sure, take conversion rates into account. But if you test “Cheap stuff” against “Cheap things”, don’t.
Yes, I know, even those may attract different kinds of customers with different conversion rates. Our industry’s best practice approach dictates we need to test this. We need to know.
At the same time, the commonly used A/B testing technique is basically to never take no for an answer. “Statistical significance calculator – are these two different with 95% significance? No? Then we’ll test some more until they are”. With this approach you’ll always find significant results, regardless of whether there’s really a difference and regardless of whether the difference actually makes a difference.
In my opinion, the better way is to decide whether you really need to look at more than just clicks. If it’s just about minor variations in ad copy, I’ll test for clicks and that’s it. In the time others take to go after the absolute truth, I can iterate through many more tests and take things much further.
sorry, this was supposted to go under Richard’s comment…
I think we are in agreement.
Or if we are not, the difference between our opinions is like the difference in conversion rate between “Cheap stuff” and “Cheap things” 🙂