This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.
Whether we are writing an individual unit test or designing a product’s entire testing process, it is important to take a step back and think about how effective are our tests at detecting and reporting bugs in our code. To be effective, there are three important qualities that every test should try to maximize:
When the code under test is broken, the test fails. A high-fidelity test is one which is very sensitive to defects in the code under test, helping to prevent bugs from creeping into the code.
Maximize fidelity by ensuring that your tests cover all the paths through your code and include all relevant assertions on the expected state.
A test shouldn’t fail if the code under test isn’t defective. A resilient test is one that only fails when a breaking change is made to the code under test. Refactorings and other non-breaking changes to the code under test can be made without needing to modify the test, reducing the cost of maintaining the tests.
Maximize resilience by only testing the exposed API of the code under test; avoid reaching into internals. Favor stubs and fakes over mocks; don't verify interactions with dependencies unless it is that interaction that you are explicitly validating. A flaky test obviously has very low resilience.
When a test fails, a high-precision test tells you exactly where the defect lies. A well-written unit test can tell you exactly which line of code is at fault. Poorly written tests (especially large end-to-end tests) often exhibit very low precision, telling you that something is broken but not where.
Maximize precision by keeping your tests small and tightly focused. Choose descriptive method names that convey exactly what the test is validating. For system integration tests, validate state at every boundary.
These three qualities are often in tension with each other. It's easy to write a highly resilient test (the empty test, for example), but writing a test that is both highly resilient and high-fidelity is hard. As you design and write tests, use these qualities as a framework to guide your implementation.
It can be tempting to add a ton of scenarios and test cases to acceptance criteria for a story, and look at all possible variations for the sake of completeness. Teams who automate most of their acceptance testing frequently end up doing that. Although it might seem counter-intuitive, this is a sure way to destroy a good user story.
The core of the problem is that a large number of complex scenarios is OK for a machine to process, but not for humans to understand easily. Fast iterative work does not allow time for unnecessary documentation, so acceptance criteria in user stories often ends up being the only detailed specification most teams produce. If that specification is complex and difficult to understand, it is unlikely to lead to good results. Complex specifications don’t invite discussion – people tend to read them in isolation and selectively ignore parts they feel are less important. This can lead to an illusion of precision and completeness. The overwhelming amount of data makes everyone feel secure that someone has the entire picture, but in fact people can easily understand things differently.
Here is a typical example:Feature: Payment routing In order to execute payments efficiently As a shop owner I want the payments to be routed using the best gateway Scenario: Visa Electron, Austria Given the card is 4568 7197 3938 2020 When the payment is made The selected gateway is Enterpayments-V2 Scenario: Visa Electron, Germany Given the card is 4468 7197 3939 2928 When the payment is made The selected gateway is Enterpayments-V1 Scenario: Visa Electron, UK Given the card is 4218 9303 0309 3990 When the payment is made The selected gateway is Enterpayments-V1 Scenario: Visa Electron, UK, over 50 Given the card is 4218 9303 0309 3990 And the amount is 100 When the payment is made The selected gateway is RBS Scenario: Visa, Austria Given the card is 4991 7197 3938 2020 When the payment is made The selected gateway is Enterpayments-V1 ....
…. followed by ten more pages of similar stuff.
The team that implemented the related story suffered from a ton of bugs and difficult maintenance, in my opinion largely caused by the way they captured their examples. This long, monolithic list of examples was difficult to break up, so it was only possible for one pair of developers to work on it instead of splitting work. Because of that, it took several weeks for the implementation to be initially available for business users to try out. The large chunk of work needed to be properly tested, so another week or so passed until it was ready to go live, and then the fireworks started. The false sense of completeness prevented the delivery team from discussing important boundary conditions with business stakeholders, and several important cases were obvious to different people in different ways. This surfaced only after a few weeks of running in production, when someone spotted increased transaction costs.
Although each individual scenario might seem understandable, pages and pages of this make it hard to see the forest for the trees. Of course, this wasn’t implemented as one huge If-Else statement, but some critical variations came in only on page six or seven. Complemented by hotfixing under pressure to resolve production issues, the result was pretty much a patchwork of special cases.
These examples tried to show that, in cases when several processors could potentially execute a card transaction, high risk transactions should be sent to a more expensive processor with better fraud controls, and low risk transactions need to go to a cheaper processor even if it has worse fraud controls. Important business concepts such as transaction risk score, processor cost or fraud capabilities were not captured in the examples – the technical model differed from the business model. Changing small thresholds for risk scoring required huge changes to a complex network of special cases in software, and a ton of unexpected consequences. When a cheaper payment gateway suddenly had better fraud control than a more expensive one, several weeks of testing were needed to adjust the system to benefit from it.
An overly complex specification is often a sign that the technical model is misaligned with the business model, or that the specifications are described at the wrong level of abstraction. Even when correctly understood, such specifications lead to software that is hard to maintain, because small changes in business can lead to disproportionately huge changes in software.
It is far better to focus on illustrating user stories with key examples – a small number of relatively simple scenarios that will be easy to understand, evaluate for completeness and criticise. This doesn’t mean throwing away precision – quite the opposite – it means finding the right level of abstraction and the right mental model that would allow a complex situation to be described better.
The payment routing case could be broken down into several groups of smaller examples. One group would describe transaction risk based on country of residence – something that business users intuitively had in their minds without ever expressing it directly. Another group of examples would describe how to score a transaction based on payment amount and country of purchase. Several more groups of examples would describe individual transaction scoring rules, focused only on the relevant characteristics of a purchase. Then one overall group of examples would describe how to combine different scores – regardless of how they were calculated. A final group of examples would describe how to match the transaction risk score with the compatible payment processors, based on processing cost and fraud capabilities. Each of these groups might have five to 10 important examples, and would be much easier to understand. Taken together, these key examples would allow the team to describe the same set of rules, much more precisely but with far fewer examples than before.Key benefits
Describing stories with several simple groups of key, focused examples leads to specifications which are easier to understand and implement. Smaller numbers of examples make it easier to evaluate completeness and argue about boundary conditions, so they make it easier to discover and resolve inconsistencies and differences in understanding. Stakeholders can have a better, more focused discussion with several smaller groups of focused examples than a with huge spreadsheet of endless possibilities. As a result, the outcome will be much more likely to satisfy the original business needs faster, and it will be less likely that unexpected consequences will be discovered only in production.
Breaking down complex examples into several smaller and focused groups leads to more modularised software, which reduces future maintenance costs. Continuing with the transaction processing example, modelling examples that explain individual scoring rules would give enough hints to the delivery team to capture those rules as separate functions, so that changes to an individual scoring threshold do not impact all the other rules. This would avoid unexpected consequences when rules change. Adding another processor, or changing the preferred processor after a cheaper one gets better fraud control, would require small localised changes instead of causing weeks of confusion.
Describing different aspects of a story with smaller and focused groups of key examples allows the team to divide work better. Two people can take the country-based scoring rules, two other people could take the routing based on final score. When everything is described as a combinatorial explosion of examples, it’s difficult to divide work. Smaller groups also become a natural way of slicing the story – for example some more complex rules could be postponed for a future iteration, but a basic set of rules could go live in a week and provide some nice business value.
Finally, focusing on key examples significantly reduces the sheer volume of scenarios which need to be checked. Assuming that there are six or seven different scoring rules in combination, and that each has five key examples, the entire problem space can be described with roughly eighty thousand examples which combine all variations (five to the power of seven). Breaking it down into groups would allow us to describe the same problem space with forty or so examples (five times seven, plus a few overall examples to show that everything holds together). This significantly reduces the time required for describing and discussing the examples, but also for checking the implementation, and provides a much better starting point for any further exploratory testing.How to make this work
The most important thing to remember is that if the examples are too complex, your work on refining a story isn’t done. There are many good strategies for dealing with complexity. Here are four that I often use as a good starting point:
- Look for missing concepts
- Group by commonality and focus only on variations
- Split validation and processing
- Summarise and explore important boundaries
Overly complex examples, or too many examples, are often a sign that some important business concepts are not explicitly described. Risk score is a good example of that in the payment routing case. Discovering these concepts will allow you to offer alternative models and break down both the specification and the overall story into more manageable chunks. You can use one set of examples to describe how to calculate the risk score, and another to describe how to use the score once it is calculated.
A very common case of hidden business concepts is mixing validation with usage – for example, if the same set of examples describes how to process a transaction and all the ways to reject a transaction without processing (card number in incorrect format, invalid card type based on first set of digits, incomplete user information etc). The hidden business concept in that case is ‘valid transaction’, and it would be much better to split a single large set of complex examples into two groups – determining if a transaction is valid, and working with a valid transaction. Those groups can then be broken down further based on structure.
Long lists often contain groups of examples similar in structure or values. For example, in the payment routing story there were several pages of scenarios with card numbers and country of purchase, a group of examples involving two countries (registration and purchase), and some examples where the actual transaction value was important. Identifying commonalities in structure is often a good first step to discovering meaningful groups. Each group can then be restructured to show only the important differences between examples, reducing the cognitive load.
The fourth good strategy is to identify important boundary conditions and focus on them – ignoring examples that do not increase our understanding. For example, if the risk threshold for low-risk countries is 50 USD, and 25 USD for high risk countries, then the important boundaries are:
- 24.99 USD from a high risk country
- 25 USD from a high risk country
- 25 USD from a low risk country
- 49.99 USD from a low risk country
- 50 USD from a high risk country
Discussing and documenting these five examples will probably give the team 90% of the value they could get from twenty pages of similar examples. Don’t aim to replace testing fully with examples in user stories – aim to create a good shared understanding, and give people the context to do a good job. Five examples that are easy to understand and at the right level of abstraction will do a much better job at that than hundreds of very complex test cases.
I read this article over the weekend about five emerging trends in software testing – Test Automation; Rise of mobile and cloud; Emphasis on security; Context-driven testing; and More business involvement.
I fully acknowledge that I work in a software development environment that isn’t like many others, but while reading the article, I really didn’t feel like any of those areas are “emerging” – all are fully emerged already. Sure, the trends are interesting to testers, but emerging? I could waste some space rebutting or commenting on the areas above, but instead, let me offer some alternate trends that I see inside of MS and from some of my colleagues who work elsewhere.
Fuzzier Role Definitions. I don’t really like the terms “whole team approach” or “combined engineering”, but I do see software teams really figuring out how to work better together and leverage every team members strengths effectively. Great testers are working as Test Specialists and working much more broadly across the team. I expect the “lines” between software disciplines to fade even more in the future.
Developers Own More Testing. You can call them “checks” if you wish (I call them “short tests”, but software developers are beginning to own much bigger portions of traditional software testing. This is a good thing – it ensures that daily code quality is high, and gives test specialists a high quality product to work with.
Testing Live Sites. Mock-test environments typically do a poor job representing production environments. Other than brief sanity checks for the most critical components, many web service teams just roll their new bits straight to production, and then run their tests against the live system. With a good monitoring system (including the ability to stage rollouts and automatically roll back if needed), this is a safe, efficient, and frankly, practical method for testing services.
Data is HUGE. Many software teams have figured out that the best way to get an accurate representation of how customers use software is collect and analyze data from those same customers. A whole lot of traditional test activities can be replaced by product instrumentation, and an efficient method for getting product instrumentation back to the team for analysis. On a lot of teams, last year’s testers are this year’s data analysts and data scientists. While not every tester is cut out for this role, this move to data analysis is a strong trend on a lot of software teams.
To critique myself for a moment, I think a lot of readers could say that none of these points are emerging either. That’s a fair point, since I know teams that have been doing everything above for years…but I’m just now seeing some of these trends “emerge” on multiple teams (and not just those testing web services or sites).
What trends do you see? Did I miss anything huge? Have the above four points already reached the tipping point of emergence?(potentially) related posts: