Random header image... Refresh for more!

Being Mean To The Average

I hate the average.

Specifically, I hate the use of the average as the predominant, sometimes only bit of information given when talking about software performance test results.  I will grant that it’s easy to understand, it’s easy to calculate, and I know that it’s one of the few mathematical concepts that can be passed along to management without much explanation.  However, it’s often misleading, presents an incomplete picture of performance, and is sometimes just plain wrong.  What’s worse is that most people don’t understand just how dangerously inaccurate it can be, and so they happily report the number and don’t understand when other things go wrong. 

Calculating the average is simple.  You take the sum of a set of numbers and divide it by the number of numbers in that set.  But what exactly does that give you?  Most people will say that you get a number in the middle of the original set or something like that.   That’s where the faith in the number begins and, more importantly, where the mistakes begin.  The average will not necessarily be a number in the middle of your original set and it won’t necessarily be anywhere near any of the numbers in your original set.  In fact, even if the average turns out to be dead-center in the middle of your data, it doesn’t tell you anything about what that data looks like.

Consider, for a moment, the annual income of someone we’ll call Mr. Lucky.  Mr. Lucky’s salary starts at $50,000.  Every year, Mr. Lucky gets a $2500 raise.  So, for five years, here’s his income:  $50000, $52500, $55000, $57500, $60000.  Over that period, his average annual income is $55000.   Great, smack in the middle.  Now, in the sixth year, Mr. Lucky wins a $300 million lottery jackpot.  What’s his average income over all six years?  Over $50 million a year.  However, it’s clearly wrong to try to claim that Mr. Lucky made $50 million dollars a year over six years, because once you look at the data, it is obvious that the $300 million is skewing the average well away from what he was actually making at the time.

Let’s take another example, one closer to home.  Every month, you get an electric bill.  Since heating and cooling are often the largest chunks of power consumption, the bill will have the average temperature for the month, in order to help you make sense of the fluctuating charges.   This year, you get a bill for $200.  Shocked, you pull out last year’s bill to compare, and discover that you paid only $50 then.  Last year, according to the bill, the average temperature was 54.3 degrees, and this year it was 53.8 degrees.    The average temperature was roughly the same, so your heating/cooling shouldn’t have changed that much.  You didn’t buy a new TV or an electric car, you turn off the lights when you’re not in the room, your shut down the computer at night, you’ve got CFL bulbs everywhere, and as far as you know, the neighbors aren’t tapped into your breaker box to power the grow op in their basement.  So…  What happened?  Let’s take a closer look at that weather…

Daily Temperature Last Year:


Daily Temperature This Year:


Once you look at the actual daily temperature, it becomes clear what happened to your power bill.  Last year, the temperature was fairly constant, but this year, there were wild temperature swings.  You had your AC cranking full blast for the first part of the month, then you kept nudging up the thermostat during the end of the month.  However, since the temperature extremes offset one another, the average temperature makes it seem like both months had the same weather.

That’s the key:  Once you look at the data, it’s often clear that the average doesn’t tell the whole story, or worse, tells the wrong story.  And the problem is that people typically don’t look at the data.

Let me pull this back to the world of software performance testing and tell a story about a product I’ve worked on and how I first came to realize that typical performance testing was dead wrong.  The product was a keyword processor at the heart of a larger system.  Our customer had demanded a 1.5 second response time from our overall system, and that requirement got chipped down and used up by other components on its way to the keyword processor I was involved with.  By the time it got to us, we only had a response time cap of 250 ms in which to return our data, otherwise the larger system would give up on us and continue on its way.  So, great, I thought.  I’ll just load up the system with an ever increasing number of concurrent requests and find out how many requests we can process at the same time before our average hits 250 ms.  I did that and came up with 20 requests at once.

So we set up enough machines to handle the anticipated load with a maximum of 15 requests per box at one time, so we’d have some room to grow before we needed to add capacity.  All was well.

Until, that is, the day we launched and our keyword processor kept getting dumped on the floor by the system above us.

Something was obviously wrong.  We knew what our performance was.  At 20 concurrent requests, we had a 250 ms response time, at 15 requests, we has seen an average 200 ms response time.  We’re fast enough, and we’d proved that we were fast enough.  Statistics said so!

That right there was the problem.  We trusted the wrong information.  Sure, the average response time was 200 ms at the load we were seeing, but that said absolutely nothing about the something like 30% of the requests that were hitting the 250 ms timeout.  We frantically reran the performance tests.  The results were stunning.  While the average response time did not hit 250 ms until we reached the 20 concurrent request level, we saw a significant (and SLA violating) number of requests that took more than 250 ms by the time we reached the 10 concurrent request level.

People aren’t very happy when you tell them that a cluster size has to double…

At the time, I thought I might have just made a rookie mistake.  It was the first major system I’d done the performance testing for, and I’d had no training or mentoring.  I did what I thought was right and ended up getting burned.  Surely, I thought, everyone else knows what they’re doing.  Real performance testers using real performance testing tools will get it right.  Trouble is, I’ve since discovered that’s not the case.  Everyone else makes these mistakes and they don’t even realize that they’re making any mistakes.  And performance testing tools actively encourage testers to make these mistakes by not giving the tester the information that they really need and, in some cases, giving testers information that is just plain invalid. ((I’ve seen several cases of odd numbers coming out of VS Perf Tests, but the one I’m specifically thinking of here is the fact that VS will still report an average response time, even when you’re ramping up the number of users.  The performance of your system when you have a single user is vastly different than when you have 100 users, so the single “Average Request Time” number that it will report is just plain useless.))

So…  What do you do about it?  For starters, don’t use the average in isolation.  Pull in some other measurement.  I like using the 95th percentile for a second opinion. ((You can get the 95th percentile in Visual Studio, if you tweak a setting.  Set “Timing Details Storage” to “Statistics Only” or “All Individual Details” and it’ll start being recorded.))  The 95th percentile means that 95% of all requests take less than that amount of time.  That’s really more what you care about, anyway.  You’re probably not really concerned with the 5% that lie beyond that point, since they’re usually outliers or abberations in your performance anyway.  This will get rid of things like Mr. Lucky’s lottery winnings.   Additionally, you’re probably not really concerned with where the average response time lies.  People often use the results of performance testing to feed into capacity planning or SLAs.  When there are dollars on the line tied to the Five Nines, why do you care about a number that you think of as the middle of the road?  You care about the worst case, not the average case.  If we’d used the 95th or 99th percentile in our inital performance tests of that keyword processor, we would not have had the problem that we did.

But even so, the 95th percentile has its own set of issues and should also not be used in isolation.  It, too, does not tell you the complete story, and can easily hide important trends of the data.

There I go again with “the data”.  You have to look at the data.  In order to get the full picture of your system performance, you actually have to look at the full picture of your system performance.  You can’t go to single aggregate numbers here or there and call it good.

Of course, that leads to the obvious problem.  When you run a performance test, you’ll often end up running thousands upon thousands of iterations, collecting thousands upon thousands of data points.  You cannot wrap your head around that kind of data in the same way that you can see Mr. Lucky’s income or the daily chart of the temperature.

Well, not without help…

 Whenever I do performance testing now, I rely on a tool that I built that will produce several graphs from the results of a performance test run.  The simplest graph to produce is a histogram of response times.


A histogram will show you the distribution of the response times for the performance test.  At a glance, you can see the where the fastest and slowest responses lie, and get a sense for how the service behaves.  Are the response times consistent, with most of them around 150 ms, or are they spread out between 100 and 200 ms?  A histogram can also show you when things are acting strangely.  One piece of software I tested had most of the response times centered below 400 ms, but there was a secondary bump up between 500-900 ms.  That secondary bump indicated something was strange, perhaps there was a code path that took three times as long to execute that only some inputs would trigger, or there were random slowdowns due to garbage collection or page swapping or network hiccups.


As you can see, there are a sizeable number of responses in that bump, enough to make you want to investigate the cause.  This potential problem would have been completely invisible if you were only concerned with the average, and the full extent would not have been known if you’d been looking at the 95th percentile. ((Likely, the 95th percentile would lie in the middle of the bump, leading you to believe that the system performance is slow overall, not that there’s an anomaly.))  However, it’s plainly visible that something strange is going on when you see the graph like that.

While helpful, simple histograms like this are not enough.  In particular, they fall down if anything changes during the test run.  If something like network latency slows down your test for a brief period of time, that will be invisible in this graph.  If you’re changing the number of users or the number of concurrent requests, then the times from all of them get squeezed together, rendering the graph invalid.  What you’re missing here is the time dimension.

One way to bring in the time dimension is to animate the histogram.  You produce multiple histograms, each representing a slice of time in your test.  That way, you can watch the behavior of your service change over the length of the performance test.  The problem I have with an animated histogram is that you can’t just glance at the information.  You have to watch the whole thing, which can be time consuming for a long running performance test.

Instead of animation, I prefer to visualize the time in this way:


Going up the side, you have divisions for buckets of response times.  Going across, you have time slices. ((Time isn’t marked in these examples because the tool I use to generate them will usually generate other graphs, as well, making it possible to correlate the time on that graph with the time on this graph.  For these examples, it’s not really necessary to know how much time each column is or how many requests are represented by each block.))   Essentially, this is what you’d see if you stacked a bunch of histograms together side by side, then looked at them from the top.  It’s basically like a heat map or a graph of the density of the response times.  The red zones have the most responses, while the green areas have the least. ((And the white zones are for loading and unloading only.  There is no parking in the white zones.))  In the example above, you can see that there are a lot of responses in the 100 and 150 ms buckets and then it quickly trails off to zero.  There’s a lot of noise up to about 600 ms, and sporadic outliers above that.  The performance remains fairly stable throughout the test run.  All in all, this is a fairly standard graph for a well behaved piece of software. ((For comparison, this graph is from the same test run that produced the “Average Response Time: 149 ms” histogram that I showed earlier.))

These density graphs aren’t terribly interesting when things are well behaved, though.  Here’s another graph I saw:


First, there’s bands of slowness between about 210 and 280 and 380-450.  These bands would appear in a histogram like the secondary hill shown above.  But what a histogram isn’t going to show you is the apparent pattern in the 380-450 band.  It appears that there’s groups of slow responses in a slice of time, then none for the next couple of slices, then another group of slow responses, then none, and so on.  Seeing this kind of behavior can help you find the problem faster.  In this case, the slow responses may be caused by something else running on the box that’s scheduled to run at a regular interval, like an anti-virus scanner or a file indexer, or they can be caused by something like the garbage collector waking up to do a sweep on a somewhat regular basis.

Another benefit of a density graph is that they’re still useful, even if you change the parameters of the test during the test run.  For instance, a common practice in performance testing is to increase the load on a system during the run, in order to see how performance changes.


In this example, the number of concurrent users was steadily increased over the run of the test.  In the first part of the test, you can see that increasing the number of users will directly influence the response time.  It also increases the variation of those response times:  For the first part, all of the responses came in within 100 ms of one another, but pretty quickly, they’re spread over a 300-400 ms range.  And then, during the final third of the test, everything went all kerflooey.  I don’t know what happened, exactly, but I know that something bad happened there.

I think this graph is one of my favorites:


As you can see, this graph is distinctly trimodal.  There’s a steep, but well defined band shooting off the top, then a wide and expanding band in the middle, followed by a sharply narrow band with a shallow slope.  I like this graph because it doesn’t actually show anything that’s wrong.  What it illustrates is the huge impact that your test inputs can have on the results.  This test was run against a keyword index system.  The input I used was a bunch of random words or phrases.  Different words caused the keyword index system to do different things.  The shallow band at the bottom was created by the keywords for which the index system found no results.  When it didn’t find anything, the system simply returned immediately, making it very fast.  The middle band was filled with keywords that found a single result.  The top band was the set of keywords that found multiple results, which required extra processing to combine the two results into one before returning.


Performance tests are the same as any other test, your goal is to find problems with the software.  You’re not going to find them if you’re only looking at the average.  So, the next time you’re involved with a performance test, remember: Look deeper.  There are more problems to be found under the surface.


There are no comments yet...

Kick things off by filling out the form below.

Leave a Comment