A tester asked me an interesting question this morning:
“How can I find old test documentation for a completed feature so I can re-use those tests on a similar new feature?”
The answer is easy. But that’s not what this post is about.
It seems to me, a skilled tester can usually come up with better tests…today, from scratch. Test documentation gets stale fast. These are some reasons I can think of:
- A skilled tester knows more about testing today than they did last month.
- A skilled tester knows more about the product-under-test today than they did last month.
- The product-under-test is different today than it was last month. It might have new code, refactored code, more users, more data, a different reputation, a different platform, a different time of the year, etc.
- The available time to perform tests might be different.
- The test environment might be different.
- The product coder might be different.
- The stakeholders might be different.
- The automated regression check suite may be different.
If we agree with the above, we’ll probably get better testing when we tailor it to today’s context. It’s also way more fun to design new tests and probably quicker (unless we are talking about automation, which I am not).
So I think digging up old test documentation as the basis for determing which tests to run today, might be a wrong reason to dig up old test documentation. A good reason is to answer questions about the testing that was performed last month.
In Connecticut, some exploratory testing types found and exploited a software flaw in lottery terminals:
An investigator for the Connecticut Lottery determined that terminal operators could slow down their lottery machines by requesting a number of database reports or by entering several requests for lottery game tickets. While those reports were being processed, the operator could enter sales for 5 Card Cash tickets. Before the tickets would print, however, the operator could see on a screen if the tickets were instant winners. If tickets were not winners, the operator could cancel the sale before the tickets printed.
It’s a condition that only occurred while the system was under processing load.
Which is why, whenever I get to do some load testing, I also like to call up the application under test and run through some basic smoke tests with it. You can find different places where resources are not available or where the load times can lead to unintended consequences–like allowing the user to click a button that renders but is hidden when the page fully loads. Or to act on data that the user should not be able to act on, as the lottery terminal displays.
Of course, you can do something like this through some network-throttling tools, but that will only really handle client-side slowdowns and problems, not necessarily issues with the server and infrastructure.
Also, it’s a way to get one more user’s worth of load on the system, and given our load testing budget most of the time, that can be a 5% increase over the 20 virtual users we have licenses for.
Finally, stringing APIs together and calling it programming doesn’t make it programming. It’s some crazy form of dependency hacking that involves the cloud, over-engineering things, and complexity far beyond what’s actually needed.
What’s worse is that if any of your code (or the 3rd party library code) has a bug or breaks, you won’t know how to debug or fix it if you don’t know how to program.
Events of the last week should make developers wary of third-party stuff, but they won’t.
In Google’s early days, a small handful of software engineers built, tested, and released software. But as the user-base grew and products proliferated, engineers started specializing in roles, creating more scale in the development process:
- Test Engineers (TEs) -- tested new products and systems integration
- Release Engineers (REs) -- pushed bits into production
- Site Reliability Engineers (SREs) -- managed systems and data centers 24x7.
This story focuses on the evolution of quality assurance and the roles of the engineers behind it at Google. The REs and SREs also evolved, but we’ll leave that for another day.
Initially, teams relied heavily on manual operations. When we attempted to automate testing, we largely focused on the frontends, which worked, because Google was small and our products had fewer integrations. However, as Google grew, longer and longer manual test cycles bogged down iterations and delayed feature launches. Also, since we identified bugs later in the development cycle, it took us longer and longer to fix them. We determined that pushing testing upstream via automation would help address these issues and accelerate velocity.
As manual testing transitioned to automated processes, two separate testing roles began to emerge at Google:
- Test Engineers (TEs) -- With their deep product knowledge and test/quality domain expertise, TEs focused on what should be tested.
- Software Engineers in Test (SETs) -- Originally software engineers with deep infrastructure and tooling expertise, SETs built the frameworks and packages required to implement automation.
The impact was significant:
- Automated tests became more efficient and deterministic (e.g. by improving runtimes, eliminating sources of flakiness, etc.)
- Metrics driven engineering proliferated (e.g. improving code and feature coverage led to higher quality products).
Manual operations were reduced to manual verification on new features, and typically only in end-to-end, cross product integration boundaries. TEs developed extreme depth of knowledge for the products they supported. They became go-to engineers for product teams that needed expertise in test automation and integration. Their role evolved into a broad spectrum of responsibilities: writing scripts to automate testing, creating tools so developers could test their own code, and constantly designing better and more creative ways to identify weak spots and break software.
SETs (in collaboration with TEs and other engineers) built a wide array of test automation tools and developed best practices that were applicable across many products. Release velocity accelerated for products. All was good, and there was much rejoicing!
SETs initially focused on building tools for reducing the testing cycle time, since that was the most manually intensive and time consuming phase of getting product code into production. We made some of these tools available to the software development community: webdriver improvements, protractor, espresso, EarlGrey, martian proxy, karma, and GoogleTest. SETs were interested in sharing and collaborating with others in the industry and established conferences. The industry has also embraced the Test Engineering discipline, as other companies hired software engineers into similar roles, published articles, and drove Test-Driven Development into mainstream practices.
Through these efforts, the testing cycle time decreased dramatically, but interestingly the overall velocity did not increase proportionately, since other phases in the development cycle became the bottleneck. SETs started building tools to accelerate all other aspects of product development, including:
- Extending IDEs to make writing and reviewing code easier, shortening the “write code” cycle
- Automating release verification, shortening the “release code” cycle.
- Automating real time production system log verification and anomaly detection, helping automate production monitoring.
- Automating measurement of developer productivity, helping understand what’s working and what isn’t.
In summary, the work done by the SETs naturally progressed from supporting only product testing efforts to include supporting product development efforts as well. Their role now encompassed a much broader Engineering Productivity agenda.
Given the expanded SET charter, we wanted the title of the role to reflect the work. But what should the new title be? We empowered the SETs to choose a new title, and they overwhelmingly (91%) selected Software Engineer, Tools & Infrastructure (abbreviated to SETI).
Today, SETIs and TEs still collaborate very closely on optimizing the entire development life cycle with a goal of eliminating all friction from getting features into production. Interested in building next generation tools and infrastructure? Join us!
While reading Paul Bloom’s The Baby In The Well article in The New Yorker, I noted the Willie Horton effect’s parallel to software testing:
In 1987, Willie Horton, a convicted murderer who had been released on furlough from the Northeastern Correctional Center, in Massachusetts, raped a woman after beating and tying up her fiancé. The furlough program came to be seen as a humiliating mistake on the part of Governor Michael Dukakis, and was used against him by his opponents during his run for President, the following year. Yet the program may have reduced the likelihood of such incidents. In fact, a 1987 report found that the recidivism rate in Massachusetts dropped in the eleven years after the program was introduced, and that convicts who were furloughed before being released were less likely to go on to commit a crime than those who were not. The trouble is that you can’t point to individuals who weren’t raped, assaulted, or killed as a result of the program, just as you can’t point to a specific person whose life was spared because of vaccination.
How well was a given application tested? Users don’t know what problems the testers saved them from. The quality may be celebrated to some extent, but one production bug will get all the press.
I found this interesting article on the St. Louis Post-Dispatch Web site:
There are a lot of test articles floating through the Internet in production systems. Why people don’t bother to turn them off after the testing is done, I don’t know.
Bonus points to you if you can spot the issue with the test article itself.
If you want to be a successful consultant, you might learn something:
What defects will you log? Whatever defects you like. What is the best methodology for testing? Whatever methodology you like. What’s the best time to start automated testing? Whatever time you like.
Well, “succeed” might be a misnomer. But you’ll certainly be employable.
If you find an escape (i.e., a bug for something marked “Done”), you may want to develop an automated check for it. In a meeting today, there was a discussion about when the automated check needed to be developed? Someone asked, “Should we put a task on the product backlog?”. IMO:
The automated check should be developed when the bug fix is developed. It should be part of the “Done” criteria for the bug.
Apply the above heuristically. If your bug gets deffered to a future Sprint, deffer the automated check to that future Sprint. If your bug gets fixed in the current Sprint, develop your automated check in the current Sprint.
In This Moment, “Big Bad Wolf”
If you need more Monday morning wolfery, see also this.
Richard Tattersall of the parish of St. George-in-the-Fields, liberty of Westminster, gentleman, as he liked to be described, had a lot of interesting claims to his name. In his youth, in mid-18th century Lancashire, he wanted to join the jacobite rebels. After a family intervention, he ran away from home, and ended up becoming a stud-groom in the service of the Duke of Kingston-upon-Hull. A unique business sense led Tattersall to start his own race-horse farm, then establish an auctioning house. Tattersalls Auctioneers’ ended up at Hyde Park Corner, and their clients included the key British nobility and even the King of France. Two and a half centuries later, France is no longer a kingdom, the dukedom of Kingston is extinct, but Tattersalls auctioneers still runs. It is the oldest such institution currently operating in the world, and Europe’s largest. Tattersall was not just a good businessman, he also knew how to entertain. Even the future king George IV was known to frequently visit Tattersall at Highflyer Hall, to enjoy ‘the best port in the land’. In his late years, Tattersall was so universally known and respected throughout the realm, that Charles Dickens wrote about Tattersall ‘no highwayman would molest him, and even a pickpocket returned his handkerchief, with compliments’. It would, then, come to great surprise to the gentleman how often software developers curse his name today.
In an epic twist of irony, people often don’t even know that Tattersall, the person, had no influence on the cause of all that pain. Tattersall had just the right mix of business sense and social skills. At his Hyde Park Corner venue, he reserved two ‘subscription rooms’ for the members of the Jockey Club. The two rooms quickly became the centre of all horse betting in the UK. The Tattersalls Committee, an successor organisation to the informal clubs from the two subscription rooms, had the legal ability to exclude people from the sport or access to racecourses in Britain all the way up to 2008. The rules and regulations they used to settle disputes still more or less govern British horse racing today. And among those rules, there is the famed 4(c), the destroyer of software models. For all the pain it caused me over the years, it’s also responsible for one of the most important lessons in improving software delivery I’ve ever come across.
For teams working on horse racing software, ‘Rule 4’ is the one thing you can always mention to disrupt a meeting, cause everyone to tell you to go to hell, and then spend the rest of the afternoon playing Desktop Tower Defence. It’s the edge case a tester can easily invoke to break almost anything. It’s the reason why people phone in the next day and insist to stay home because of some unforeseen illness. That’s because Rule 4 messes with the one thing developers hate to touch — time.
Horse racing bets are priced either at the time when a bet is placed, or at the time when the race starts. Punters generally prefer the first, called ‘fixed odds’ because they know what they’re expecting. The price and the potential winnings are fixed, hence the name, and that’s the rock solid agreement between a punter and a bookie. Such a premise allows developers to design some nice and elegant settlement models, and then deal with the difficult stuff they like to solve, such as latency, throughput, and performance optimisations. But then Tattersall’s Rule 4 kicks in. It controls what happens in a case when one of the horses doesn’t show up for a race. If, for example, the favourite ends up in a ditch somewhere ten minutes before the race, all the people betting on the second favourite now stand to win a lot more than the bookies expected. Rule 4 allows a bookie to pay out the bets on the other horses as if the favourite never existed. This means going back in time, recalculating the odds for all the other runners, and applying the new price. With weird math, that isn’t exactly completely logical. For each single bet, at the time when that bet was placed. This is where the stomach problems start. Developers realise they need to start saving a lot more information at a time when the bet is placed, mess up with the nice elegant settlement models, break system performance, and a ton of other things. That’s why, generally, once Rule 4 is finally implemented correctly, that piece of code is off-limits. Nobody gets to even view it on a screen, to prevent an accidental Heisenbug.
Rule 4 makes a lot of sense when one of the favourites misses the race, but it’s also applied to Limpy Joe, the horse that almost died in the previous race of old age and boredom. It can lead to weird and pointless complaints about why someone got 5 pence less than expected, and it can cost more in wasted customer support than it protects against fraud. I once worked with a company that tried to save a bit of money on customer servicing, and do something nice for its punters at the same time, by not taking advantage of Rule 4 below a certain threshold. Switching rules on and off wasn’t such a big deal, it turned out that they could actually do it themselves, and everyone was happy. No need for anyone to stay home playing Desktop Tower Defence.
But then, one day, they asked us to change the monthly customer statements, and print out the bet results as if Rule 4 was still applied, then add the deduction back. Not only did this mess with time, but it messed with it twice. It required us to record something that didn’t happen as if it happened, combine it with a ton of other rules that did actually happen, and then flip back the whole thing again. And they asked us to do that not at the point when the bets were recorded or settled, but when the monthly statements were produced. This required keeping a lot more information so we could settle all the other rules backwards and forwards in time, break system performance, and change the one part of the system that nobody wanted to touch. Some of the additional rules were implemented by third parties, so we’d have to chase them to change their code as well. Of course, the whole thing needed to be done as quickly as possible, ideally yesterday. Lots of towers were successfully defended that day.
The next week, on a visit to the call centre, I sat next to an operator who was one of the people insisting we ‘fix the statements’, and watched him deal with a disgruntled punter. The person on the other end of the line heard about the Rule 4 promotion, and called in to complain that he was short-changed. In fact, some other rules produced an odd value for the final payout. The operator was stuck explaining that the amount on the statement is OK, and he had to take the punter through the whole weird math required to settle a bet according to all the other on-going promotions. A few minutes after that call, another one came in. Instead of saving money on customer support, the Rule 4 voiding idea made it worse. It was clear that the interaction of rules was causing the confusion, not Rule 4 itself. I asked about why they singled out that one, and the operator said that they were going to ask for all the other promotions to be pulled out into separate line items as well, just later. They didn’t want to overload us with work, so that Rule 4 could get done quickly.
Observing the problem first-hand, I could see the whole mess, and why they wanted something done about it. But just thinking about the implications on our nice, clean, elegant software models made my stomach turn. Luckily, being next to the one person who suffered the most, and finally understanding what’s going on, I could propose an alternative. What if, instead of actually redoing the numbers to list individual calculation components, we just listed the names of the special promotions applied to a bet? So punters would immediately see that there was more than one thing going on, and that their Rule 4 promotion still applied. That information was already available in the database. The fact that the solution could be deployed in a few days sold it easily. The trick with labels reduced the confusion, and not just for Rule 4, but for all the other promotions they were going to ask about later as well.
That day, I learnt that how important it is to spend time observing people actually using the software that we were building. Sitting next to a user allowed us to together flush out all sorts of weird and wonderful assumptions, and work together on ideas that solve real problems, not just plaster over a huge crack. Together, we discovered insights that nobody could predict. That’s just common sense, so surely everyone is doing it by now, right? Not so much.
With the previous post, I asked people to fill in a quick questionnaire about how frequently they observe people using their software. With slightly more than 700 responses, I can’t really claim any kind of universal statistical relevance for the whole industry. On the other hand, given that you’ve self-selected into a group by reading this, the data should be relevant for teams similar to yours.
Roughly fifty percent of the survey participants said that had no direct interaction with end-users, ever. They’ve never seen an actual user work with their software. Only about 10% teams, across the whole group, actually engage with their users every week.
I asked separately about observing end-users testing future software ideas, pretty much the key aspect of getting any sort of user experience research executed. The numbers are even worse. Only about 6.4% of the respondents said that they do that on a weekly basis. In a fast moving industry such as software today, where bad assumptions and communication problems can cause serious damage, that’s just depressive.
This is both good and bad news. For most people reading this, the bad news is that you’re not benefitting from all the insight that you could easily collect. The good news is that the bar is so low at the moment, that you can easily change your delivery process a bit and be far better than the competition.
If you’re a developer or a tester, figure out an excuse to sit with the actual users for a few hours every week. If you’re managing a team, think about sending the group to observe actual users periodically. I guarantee this will significantly improve the software you deliver. It doesn’t matter if your company already uses a UX specialist agency to deliver the decisions, or if you have someone else already tracking the usage patterns and results. Go and see things for yourself, you will learn stuff you could have never predicted before.
For example, we recently added text notes to MindMup 2.0. This was one of the most requested features on the user forums, so we had a ton of data to start with. Instead of just charging ahead with the ideas collected through initial requests, we spent a week building a rough version, and then invited actual users to try things out. The results were surprising. We got a bunch of assumptions about ordering and exporting wrong, and our users wanted something significantly simpler than what we planned to develop. As a result, we were able to launch that probably a month ahead of the schedule, and make it a lot more intuitive.
Of course, there will be plenty of excuses why spending time with users isn’t possible. Especially if they are not easily available. The survey results broken by type of software delivery clearly show that. Reduced only to software delivered internally, roughly 27% of the participants said that they observe users working with their software at least once a month. This is significantly better than just 16.5% of those working on consumer software. Enterprise B2B, of course, ends up in the last spot with roughly 14%.
Don’t let the fact that your users are remote, or numerous, stop you from talking to them. Even for consumer-oriented products, getting this kind of feedback is easier than it seems. I go to lots of software conferences, and I always try to get a few people to try out MindMup between conference sessions. It doesn’t cost us anything, and most of the people I approach are willing to spend a few minutes helping us out, especially during long boring lunch breaks. If such direct contact isn’t possible for your team, think about remote screen-sharing sessions. It’s not the best way to do user research, and I’d love to have one of those hi-tech rigs that tracks eye movement and blood pressure that ad agencies use, but that’s far beyond my budget. However, even a simple screen share session opens up an incredible amount of insight. The text notes research we did for MindMup 2.0 was done 100% over remote screen sharing, and we had people participating from all over the world. Just get people to think out loud while they are clicking around the screen, and be ready to get surprised.
This is a login screen before you can use the application, with an account name, password field, and a Log In! button.
There is bubbly copy and a licensed stock image of a bearded man holding a small boy.
> check copy
The copy is cheery, but not particularly informative. In a stunning turn of events, the words are all spelled correctly, AND they've remembered the serial comma.
> mouseover image
The title and alt text are set for the image and read "Welcome back!"
> type </html> into account name field.
The value displays in the edit box.
> type </html> into password edit box.
The value displays in the edit box.
> click Log In!
A Potentially Malicious Request warning displays! Oh, woe and agony! The site is eaten by a grue.
Well, suddenly, that’s me.
So I log a bug to indicate that the page should display a message in this case. That’s the first part of the two-fer.
Oh, and I look forward to the first through one hundredth times I have to log a bug about capitalizing CrossFit correctly.