February 4, 2015

Enthused About: Test Suite Reduction

Software testing can be a bit like drinking from a fire hydrant. For every "degree of freedom" in an application, the number of potential test cases for that application increases exponentially. To cope with this explosion, one of the most challenging (and consequently, the most interesting) activities carried out by a tester is to reduce the number of test cases required for "proper" verification.

All of this boils down to a rather classic question from software testing: How much testing is "enough"? As with most interesting questions, the answer is some form of "it depends." At a high level, though, I like to define an application-specific notion of "enough" in some terms of coverage.

Let's imagine that we are trying to test a simple calculator. It has three simple input fields:

  • Subtotal
  • Zip Code
  • Shipping Method
and a button "Calculate". Upon clicking "Calculate" it displays a field "Total", which is a computed total cost of Subtotal + Tax + Shipping, based on choices of Zip Code and Shipping Method. We need to test this calculator to make sure that the calculations produced are actually correct. So where's the fire hydrant?

Let's assume Subtotal is a standard text input field on a website. Let's assume that Zip Code is a type of "Drop Down List", allowing the user to select from known, valid Zip Codes. Shipping method is a similar drop down with 3 choices. So we have any possible string of 0 or more characters as Subtotal input, some 1000s of possible Zip Codes, and 3 shipping methods. As far as we know, this results in an Infinite number of possible test cases (1000s * 3 * unlimited combinations of characters for Subtotal):

  • 00001 + Method 1 + a
  • 00001 + Method 1 + aa
  • 00001 + Method 1 + aaa
  • .
  • .
  • 00001 + Method 1 + aaaaaaaaaaa.....

To test each of these possibilities is literally impossible! The first thing we might want to do is to ask our application developer (or check the code ourselves) to see whether there is actually some limit imposed on the length of Subtotal. At least this would get us into the realm of working with a finite number of test cases: (3*1000s * number of characters supported * max length supported). But we're still talking a LOT of test cases here!

Probably the most popular type of reduction to be applied here would be the introduction of equivalence class partitioning. We can introduce groups of values that we consider to be equivalent. This is most readily applied to Subtotal in our current problem, which may lead to test cases like:

  • 00001 + Method 1 + Valid Amount
  • 00001 + Method 1 + Empty Amount
  • 00001 + Method 1 + Invalid Amount
  • 00001 + Method 1 + Valid Amount with Fixable Format (e.g., missing digits after decimal)
  • ...
Actually, we are introducing a notion of coverage here by only covering one value per equivalence class. For this simple problem, we might keep going with equivalence classes by looking into shipping rate categories, etc. But let's look a more general notion of coverage here.

When we use coverage to reduce our number of test cases, what we are really doing is saying that we consider two or more test cases to be "equivalent because the cover the same ______." When we reduce via equivalence classes, we are saying that test cases using values from the same equivalence class are covering the same equivalence class. If all of the values in each class are really equivalent, this is redundant and unnecessary.
  • (00001 + Method 1 + 1.25) == (00001 + Method 1 + 1.26)
Here's where the fun begins. First, we can actually define coverage on all kinds of domains - not just input values. For example, the development of many applications is driven by requirements. We may say that two test cases are equivalent if they cover the same requirement. We may look at execution artifacts to consider redundant coverage of lines of code, or HTTP calls, or Database queries. All of these could lead to some reduction in the number of test cases that we consider essential to our testing of an application.

We could also introduce some outside artifacts into our consideration. For example, analytics can tell us the actions which users tend to perform. Any test cases that come from outside of known use cases are at least less important, if not completely unnecessary.

Finally, we can go one level deeper beyond coverage of single variable values. Let's revisit our test cases, assuming that we have introduced the following equivalence classes:

  • Subtotal: Valid, Empty, Invalid, Valid But Correctable
  • Zip Code: Zip Codes in each US State
  • Shipping Method: Method 1, Method 2, Method 3

If we want to exhaustively test as before, we would need only 4 * 50 * 3 test cases thanks to our equivalence classes; but maybe 600 test cases is still more than we've bargained for. We can further reduce our total number of test cases by considering the coverage of values ACROSS variables. It's only important to test EVERY type of Subtotal with EVERY type of Zip Code and EVERY type of Shipping Method if we believe that bugs can happen which only show up for specific combinations of ALL THREE variables. Three is not actually very many, but on larger systems the problem quickly gets worse. How many inputs are there, for example, on a modern web application like Twitter or GMail? (See below for some more thoughts on "coverage" and these modern web applications!)

The ACTS project at NIST has collected empirical data on just this topic: How many "interacting variables" must I consider when constructing test cases for software? Using data from across multiple types of software, they found rather consistent results. It turns out that 90% of all failures reported on these various systems could be found by considering interactions only 3 or fewer variables, and that a consideration of pairs of variable values can lead to gains of 2-3x in terms of failures (bugs) found.

Let's see the types of gains this can give us on our toy example. What if we only cared about variables in isolation? We would only need 50 test cases:
  • Alabama Zip Code + Method 1 + Valid
  • Alaska Zip Code + Method 2 + Empty
  • Arizona Zip Code + Method 3 + Invalid
  • Arkansas Zip Code + Any Method + Valid But Correctable
  • California Zip Code + Any Method + Any Subtotal
  • Colorado Zip Code + Any Method + Any Subtotal
  • ... (one more test case for each remaining state)
After executing these 50 test cases, we could claim that we have tested every equivalence class of every input into our calculator - but the ACTS data suggests that we can do much better if we consider the interactions of pairs of variables. What would those test cases look like?

  • Alabama + Method 1 + Valid
  • Alabama + Method 2 + Valid
  • Alabama + Method 3 + Valid
  • Alaska + Method 1 + Valid
  • ...
Let's think about how many unique test cases are actually necessary here. We need every State's zip codes combined with each class of Subtotal, at a minimum. This would give us 50 * 4 test cases at a minimum. As we saw in our example of isolated variable values, we will have a number of test cases for which ANY value of variable could be used. We can take advantage of this in order for test cases to cover unique combinations of (State, Subtotal), (State, Method), and (Method, Subtotal). Since there are many more (State, Subtotal) combinations than (State, Method) and (Method, Subtotal), this max of 200 should be all that we would require for "pairwise testing" of these variables.

In this post, I've discussed the notion of Test Suite Reduction by Coverage. We talked about reduction by equivalence class, reduction on arbitrary domains, and pairwise (and really, any "n"-wise) reduction. I also pointed to some empirical evidence of the effectiveness and efficiency of n-wise reduction.

If you found any of that interesting, you may also be interested to look at my advisor's work on the testing of event-driven systems like modern GUIs and Web applications. In these systems, the input of a test cases is not just variables and values, but also a sequence of events (which can be modeled as variables and values). Covering n-length sequences of events for test purposes is also a very interesting problem, and one I'm attacking in my forthcoming dissertation.

With that, farewell and Happy Testing!

1 comment: