Meaningful Performance Measures

My first experience of performance based acceptance testing was when I was on a conference call discussing whether a particular piece of software was ready to go live. At the time, I was walking along a busy road on my company mobile handset. A good friend of mine, Nick, was the head of the third party company explaining the performance of the software. I was one of the technical representatives of the company planning to use that software, and one of our internal business people was the end client of the project.

We had some numbers showing some measurements of the system. It could handle “50 v-users”, but we had some qualifying points about how the performance testing did a very rare but slow thing for each “v-user”.

I listened to a business person grow their doubts to a “NO GO” decision. We didn’t have enough information to go live on.

Why is 50 V-Users A Bad Metric?

The simple answer is “how can it be a good metric”?

What did the business person want to know? They wanted to know if the system could tolerate normal and peak usage. The “v-user” was a metric invented by the test automation designer, and didn’t correlate to anything real-world, especially with the added complexity of the test script using the rare operation more than real life.

Establishing Acceptance

  • Measure or estimate real-life traffic
  • Convert it into a speed limit that the system must at least reach
  • Express this in tangible units
  • Demonstrate the extent to which the system can handle it – ideally using the Go Down The Pub Method, where the speed is such an order of magnitude higher than needed, that nobody needs to talk about it any more!

When your stakeholders understand the units, then they can say yes to the solution. If you discuss the metrics with them long before producing a solution, then it’s possible to set an expectation.

Synchronous and Asynchronous Operations

With synchronous systems, the requirement is that the highest peak can be safely tolerated without bad things happening. This is easier to calculate – it works or it doesn’t at peak – but is harder to achieve with tunings etc.

Conversely, asynchronous processing is easier to achieve – at peak loads, the process just falls behind a little – but it’s harder to calculate what’s the real speed needed.

For the “Go Down The Pub” method, the async process, within a sensible window, can achieve an average speed higher than any real-world peak. Thus it works.

In practice, though, it may be an idea to measure a few things:

  • Average load over a long window – this is the “survival rate” of the system: even if it can’t tolerate a peak immediately, it won’t fall behind too much
  • Common peaks – this is the ok rate of the system – achieving this rate means you’ll handle 80% of the events within a short time
  • Occasional peaks – this is the ideal rate to stay ahead of the traffic

What are the Units?

On the whole, we’re talking about requests or units of work per time window. For a very busy website we may talk about hits/second. For a long-running batch process, we may be talking about batches per hour, or messages per minute.

The key thing is to choose a time scale that’s a slice of the SLA time. Example if the user SLA is 4 seconds, then you need a per second timescale for the metric. If there’s an SLA in minutes it makes sense to set the metric in minutes or seconds.

One choice, though, is whether the rate ends up being in a tangible human scale at that unit. For example, I’m working on a system which should be able to process 4000 records a minute. That’s also 240,000 records an hour. The latter is harder to grasp, than the former. We might have chosen the even more tangible 65/second, but the SLA is in minutes, so that swung it to minutes.

I had a document processing system where we were trying to reach 30,000 an hour, which was a good number as we knew that the real world process delivered 20,000 documents in a big batch every day.

So, the unit is chosen at a relevant order of magnitude where the numbers seem familiar, and the time window has some real-world element.

Units and Exceeding Minimum Performance

A general rule of thumb, when discussing if there’s enough headroom in the system, is to look at the speed relative to the peaks in the system. You want to be able to show a multiplier, where the capacity of the system allows X amount of room for growth.

In the original example from this article, which related to car rental, the actual number of users and contracts and operations was not tangible to the business. They just wanted to know will it scale?

So, we measured all the operations the system performed, took a blended average case scenario of the system as a whole and came up with the rate at which everything was used at all its peaks. The answer was 60, 30, 20, 10. That was the number of concurrent operations of each of the different types that would have to be conducted in an average minute to demonstrate peak behavior, as evidenced by graphs of system load.

Nobody wanted to think about those numbers, so we just called that “Peak load x 1”. Then we could load test and show performance at “Peak load x 2”. If that was ok, then we could say the system scales up by an order of magnitude, allowing the business to scale.

In general, to achieve acceptance, getting to 2x the required NFR is good.

To achieve “Going Down The Pub”, going from one tangible unit to the next is the trick. For example, when we had 20,000 documents a day, and we could demonstrate the system could do a day’s work in an hour, it was convincing enough to stop people worrying about catastrophic peaks.

Similarly, if the system is meant to perform 1,000 operations and hour and we reach 1,000 a minute, everyone will stop talking about performance.

This is more psychological than mathematical. Day to hour is 24x improvement, but hour to minute, or minute to second is 60 fold. The point is that we’re trying to prove the headroom we have in a tangible way.

Final Thoughts

Once you’ve chosen your units, stick with them. So choose them wisely. Messing around with the details of acceptance criteria causes more uncertainty. Having repeated conversations around performance tuning really shows benefits when the units are consistent and people can see how things change from one time to the next.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s