Thanks for attending the webinar today and for all the great conversation. Here are the slides:
The video will be posted very shortly.
The last app monitoring tool review on our tour is AppDynamics. There’s more than meets the eye with AppDynamics. In terms of features, AppDynamics is huge in both depth and breadth. The app has potential for use in large firms who want to keep track of everything going on in and with their application. This is because each account can have multiple users monitoring multiple web apps.Setup
I noticed something strange when I began to install the agent. I didn’t have the option to select the PHP agent! This was unexpected because AppDynamics says that they support PHP applications (Magento in our case). I had to get the PHP agent package from another AppDynamics account in order to get the agent working. The issue caused us to be late for the demonstration web meeting. Monitoring agents should be easy to install, but their support team was not as helpful.Application
Once the agent was installed, we were on our way to exploring AppDynamics. The first page shows all the different applications being monitored. In our case MAGE_TEST is the group of servers housing the applications we’re testing.
After delving deeper into the MAGE_TEST application, we are greeted with the application dashboard. In the dashboard, I uncovered the Application Flow Map. While it may be superficial, it is one of my favorite features. It lays out your application network’s topology and basic metrics in a way that is easy to understand.
The Application Flow Map for our environment (click to expand)Drill Down into a Web Transaction
Past all the eye candy, there is another tab that peaks our interests. Clicking on Top Business Transactions neatly sorts the transactions it sees as most important. They include:
- Transactions by Load
- Transactions by Response Time
- Transactions by Errors
- Transactions by Slow Transactions
- Transactions by Stalls
- Transactions by Health Rule Violations
We notice business transaction catalog : product : view is taking 17.9 seconds. When we double click it, we are shown the Application Flow Map associated with just that transaction. In this case, it’s just NODE_3 and the shared RDS database. From here, we select a transaction snapshot to “Drill Down” into. These show the slowest pages related to our transaction. Finally we get to the call graph which is similar to NewRelic’s transaction trace. The call graph shows us the slow code in the app. This can help developers pinpoint performance bottlenecks in the application layer. The Hot Spots section shows the slowest methods of that transaction. In addition the SQL Calls tab shows queries sent to the database. This feature didn’t really exist in NewRelic or AppFirst, so it was nice to see for the first time in AppDynamics.Other Features and Impressions
We’ve only gone over a few of the many features in AppDynamics. There are other features we liked but aren’t elaborating on. They include the following:
- The ability to save time windows
- The all encompassing Metric Browser
- Scalability Analysis reports
- Following transactions through the network topology (Inception of app monitoring)
- SMS notification alerts
In general, AppDynamics is still a thorough and impressive app monitoring service. Despite the rocky start, the application remains a useful tool to anyone who wishes to optimize the performance of their web application.
I Attended John Stevenson’s great talk and workshop at Monday night’s Software Testing Club Atlanta. I’m happy to report the meeting had about 15 in-person attendees and zero virtual attendees. Maybe someone read my post.
John is a thoughtful and passionate tester. He managed to hold our attention for 3 hours! Here are the highlights from my notes:
- The human brain can store 3TBs of information; This is only 1 millionth of the new information released on the internet every day.
- Over stimulation leads to mental illness.
- John showed us a picture and asked what we saw. We saw a tree, flowers, the sun, etc. Then John told us the picture was randomly generated. The point? People see patterns even when they don’t exist. Presumably to make sense out of information overload.
- Don’t tell your testing stories with numbers. “A statistician drowned while crossing a river with an average depth of 3 feet”; Isn’t that like, “99 percent of my tests passed”?
- Don’t be a tester that waits until testing “is done” to communicate the results. Communicate the test results you collected today? I love this and plan to blog about it.
- Testers, stop following the same routines. Try doing something different. You might end up discovering new information.
- Testers, stop hiding what you do. Get better at transparency and explaining your testing. Put your tests on a public wiki.
- Critical thinking takes practice. It is a skill.
- “The Pause”. Huh? Really? So? Great critical thinking model explained in brief here.
- A model for skepticism. FiLCHeRS.
- If you challenge someone’s view, be aware of respecting it.
- Ways to deal with information overload:
- Slow down.
- Don’t over commit.
- Don’t fear mistakes. But do learn from them. This is how children learn. Play.
- (Testing specific) Make your testing commitments short so you can throw them away without losing much. Don’t write some elaborate test that takes a week to write because it just might turn out to be the wrong test.
- You spend a 3rd of your life at work. Figure out how to enjoy work.
- John led us through a series of group activities including the following:
- Playing Disruptus to practice creative thinking. (i.e., playing Scamper.)
- Playing Story War to practice bug advocacy.
- Determining if the 5 test phases (Documentation, Planning, Execution, Analysis, Reporting) each use Creative Thinking or Critical thinking.
- Books John referenced that I would like to read:
- The Signal and the Noise – Nate Silver
- Thinking Fast and Slow – Daniel Kahneman
- You are Not So Smart – David McRaney
In this case, the development of software automobiles:
Once upon a time, software was written by people who knew what they were doing, like Mel and his descendants. They were generally solitary, socially awkward fellows with strong awareness of TSR gaming. They were hugely effective at doing things like getting an Atari 2600 to run Pac-Man or writing operating system kernels that never crashed, but they weren’t terribly manageable and they could be real pricks when you got in their way. I once worked with a fellow who had been at the company in question for twenty-three years and had personally written a nontrivial percentage of the nine million lines of code that, when compiled, became our primary product. He was un-fire-able and everybody knew it. There were things that only he knew.
This kind of situation might work out well for designing bridges or building guitars (not that Paul Reed Smith appears to miss Joe Knaggs all that much, to use an inside-baseball example) but it’s hell on your average dipshit thirty-five-year-old middle manager, who has effectively zero leverage on the wizard in the basement. Therefore, a movement started in the software business about fifteen years ago to ensure that no more wizards were ever created. It works like this: Instead of hiring five guys who really know their job at seventy bucks an hour each, you hire a team of fifty drooling morons at seven bucks an hour each. You make them program in pairs, with one typing and the other once watching him type (yes! This is a real thing! It’s called “extreme programming”!) or you use a piece of software to give them each a tiny bit of the big project.
This is what you get from a management perspective: fifty reports who are all pathetically grateful for the work instead of five arrogant wizards, the ability to fire anybody you like at any time withouiret consequence, the ability to demand outrageous work hours and/or conditions, (I was just told that a major American corporation is introducing “bench seating” for its programmers, to save space) and a product that nominally fulfills the spec. This is what you get from a user perspective: the kind of crapware that requires updates twice a week to fix bugs introduced with the previous updates. Remember the days when you could buy software that simply worked, on a floppy disk or cartridge, with no updates required? Those were the wizards at work. Today, you get diverse teams of interchangeable, agile, open-office, skill-compatible resources that produce steaming piles of garbage.
He doesn’t speak about xth generation languages, which allow the software to be badly written far away from and with little knowledge of the hardware it’s running on by developers without knowledge of it and who are forgiven by advances in that underlying hardware (and now virtual hardware) that can cover-up some poor design and coding practices, third-party components of dubious provenance relied on for core processing, and Internet cut-and-paste.
But, other than that, he explains very well why my next car is going to be a Mercedes. A Mercedes 35 hp.
I’ve been thinking a lot less about testing activities lately, and much, much more on how we to make higher quality software in general. The theme is evident from my last several blog posts, but I’m still figuring out exactly what that means for me. What it boils down to, is a few principles that reflect how I approach making great software.
- I believe in teams made up of generalizing specialists – I dislike the notion of strong walls between software disciplines (and in a perfect world, I wouldn’t have separate engineering disciplines).
- It is inefficient (and wasteful) to have a separate team to conduct confirmatory testing (e.g. “checks” as many like to call them). This responsibility should lie with the author of the functionality.
- The (largely untapped) key to improving our ability to improve software quality comes from analysis and investigation of data (usage patterns, reliability trends, error path execution, etc.).
I haven’t written much about point #3 – that will come soon. Software teams have wasted huge amounts of time and money equating test cases and test pass rates to software quality, and have ignored trying to figure out if the software is actually useful.
We can do better.
A few links:
Firefox Attempting to Replace Google as Main Source of Revenue
Mozilla Firefox is well known for providing users with unique customizable extensions and plugins. Mozilla announced this week they will start selling ads on their directory tiles in order to gain a new line of revenue stream. The company said in a blog posting on Tuesday that they are currently reaching out to potential corporate sponsors about the directory tiles program. The project will be aimed at targeting first-time Firefox users.
Prior to this announcement, new Firefox users would see nine blank tiles when they fire up their browser the first time. As the new user explore more websites, their directory tile would be filled with their most-visited or recently visited websites. Mozilla’s directory tile intends to display the most popular sites by location, as well as sponsored websites that will be clearly labeled as promoted to first time users.
90% of Mozilla’s yearly revenue comes from Google through the Firefox search box. As Firefox’s market share continues to decline, the nonprofit foundation is in need of seeking a new revenue stream. Back in the day, Firefox was once the most popular alternative to Microsoft’s Internet Explorer. These day, Google Chrome is chopping away at Firefox’s market share. Even though Mozilla is currently one of Google’s major partners, in terms of negotiating their contract in the future, Google will have a stronger hand. Unless Bing steps in, Mozilla won’t have much of a better chance negotiating with Google.
Mozilla is hardly the first nonprofit faced with the problem of raising funds. A possible alternative to the directory tile would be if if Firefox and Thunderbird had donation icons.
When you click on the image above, you will see a complete history of all the browser usage history. Here are some stats:
In 2002, Internet Explorer had 83.4% of the market.
Firefox joined the market in 2003 with 7.2%.
In 2008, Google Chrome came into the market with 3.6%. IE had 46% of the share, while Firefox held onto 44.4%.
Twitch.TV Detecting Adblock to Block Viewers from Streaming
Unless it’s re-watching Super Bowl commercials on YouTube, viewers despise websites that show advertising. Some of us have developed banner blindness while others have downloaded Adblock to filter out ads. Although we may enjoy the content that certain websites are providing us, we know it’s also currently cutting into the pockets of websites like Twitch.TV and perhaps even browsers like Firefox in the future. Twitch.tv is figuring out how to better bypass Adblock to show ads to their viewers. Some viewers who used Adblock have received this messages like this during a live stream:
Now assuming that they are trying to force viewers not to use Adblock when watching streams, this will either draw viewers away from twitch or make viewers more tempted in purchasing Twitch’s turbo package. The turbo package is Ad-Free No pre-rolls, no mid-rolls, no companions, and no display ads.
Twitch.tv is the leading video broadcasting platform for gamers. The website attracts 45 million unique viewers per month worldwide. The Wall Street Journal showed data of the percentages of U.S. peak Internet traffic produced by companies network:
Never heard of Twitch? Surprised that it’s ranked higher than Facebook? Twitch is a video streaming service, so it will use much more bandwidth than sites like Facebook. Nonetheless, Twitch is still beating popular sites with streaming services like Hulu and Amazon. Attracting millions of users on a global scale is a challenging experience. Twitch is currently upgrading their their servers and renting servers at different locations to keep up with their global audience to deliver more reliable video streaming. This is important because Twitch.Tv viewers tend to complain about the buffering and lag when streaming videos.
The post Web Performance News for the Week of February 10, 2014 appeared first on LoadStorm.
Previously, the team checked out NewRelic and all its capabilities. Even though we were impressed with the app monitoring and UI as-is, we missed out on some detailed server resource monitoring. We looked forward to the next player in the app and server monitoring services, AppFirst. At first glance, AppFirst is more simplified and monitors some of the same apps as NewRelic such as Java, PHP, and Ruby. In this post, I will go through the team’s experiences with AppFirst as a monitoring tool.
Like last time, the first thing to do was get the agent. AppFirst calls them “collectors” but they are essentially the same thing as agents. The AppFirst support team reached out to us beforehand to make sure the collectors were configured properly. This was helpful to me because I had not set up the collector to monitor MySQL properly. All in all, installing and configuring the collector was not difficult; later there were some confusing aspects about how servers were organized that I didn’t like.
My first glimpse of the user interface was the Dashboard. Upon initially logging in, the Dashboard was empty. This should imply two things:
- There is a lot of customization needed.
- This is not for users who want data right away.
This is in contrast to NewRelic which had the graphs specialized for app performance. AppFirst however, makes you choose those settings on your own.
The next tab over is the Workbench, which gives in-depth data of things like servers, alert statuses, and a summary table. You can dig deeper into a selected server. In our case, we are monitoring three Magento app servers. You can then check one of those servers and see alerts, running processes, and historical CPU and memory usage.
The Dataflow tab gives us insight into how AppFirst perceives our network. It does so in an interesting way. Just take a look at the screenshot of how the dataflow is represented for our 3-server Magento store.
You can hover your mouse over the nodes and get a glimpse of data transfer. I find this presentation awkward, especially when compared to how other app monitorings tools handle this. The Browse tab mostly shows running processes. This is not as useful for us at the moment. The Correlate tab on the other hand, lets you select two of any data types present in AppFirst and compare them. For example:
- Polled data
These are the kind of potentially useful features that the Web Performance Lab really look for. A downside about this page is that it doesn’t give you the option to filter out unavailable data. This means you manually have to find usable data; so it’s important to make sure your collectors are getting valid data!
In Logs, you can diagnose problems with the server from the AppFirst web app. You can actually specify monitoring for any plaintext file not just logs. The functionality here is similar to running tail -f on a log file, which is nice to have.
Overall, the monitoring software is satisfying. Most pages give a hyperlink that allows you to share your monitoring data with 3rd party sources. An AppFirst account is required though, which I think defeats the purpose of having a shareable link in the first place. In addition, there are no transaction traces in AppFirst like with NewRelic. The server scope is confusing at times and there are annoying feedback forms that keep popping up. On the other hand, the user interface is straightforward and highly usable. One metric AppFirst has that other monitoring services lack is the server cost over time. With AppFirst, you know if you’re meeting Service Level Agreements to the dollar amount.
We’ve got one more application monitoring service to review. Coincidentally, their name is AppDynamics (App just seems to be a popular prefix). We’ll be checking in with them next time!
"What user acceptance testing metrics are most crucial to a business?"
Here is an expanded version of my answer, with some caveats.
The leading caveat is that you have to be very careful with metrics because they can drive the wrong behavior and decisions. It's like the unemployment rate. The government actually publishes several rates, each with different meanings and assumptions. The one we see on TV is the usually the lowest one, which doesn't factor in the people who have given up looking for work. So, the impression might be the unemployment situation is getting better, while the reality is a lot of people have left the work force or may be under-employed.
Anyway, back to testing...
If we see metrics as items on a dashboard to help us drive the car (of testing and of projects), that's fine as long as we understand that WE have to drive the car and things happen that are not shown on the dashboard.
Since UAT is often an end-project activity, all eyes are on the numbers to know if the project can be deployed on time. So there may be an effort my some stakeholders to make the numbers look as good as possible, as opposed to reflecting reality.
With that said...
One metric I find very telling is how many defects are being found per day or week. You might think of this as the defect discovery velocity. These must be analyzed in terms of severity. So, 10 new minor defects may be more acceptable than 1 critical defect. As the deadline nears, the number of new, critical, defects gains even more importance.
Another important metric is the number of resolved/unresolved defects. These must also be balanced by severity and should be reflected in the acceptance criteria. Be aware, though, that it is common (and not good) practice to reclassify critical defects as "moderate" to release the system on time. Also, keep in mind that you can "die the death of a thousand paper cuts." In other words, it's possible to have no critical issues, but many small issues that render the application useless.
Acceptance criteria coverage is another key metric to identify which criterion have and have not been tested. Of course, proceed with great care on this metric as well. Just because a criterion has been tested doesn't mean it was tested well, or even passed the test. In my Structured User Acceptance Testing course, we place a lot of focus of testing on the business processes, not just a list of acceptance criteria. That gives a much better idea of validation and whether or not the system will meet user needs in the real world.
Finally, stakeholder acceptance is the ultimate metric. How many of the original acceptance criteria have been formally accepted vs. not accepted. It may be the case where just one key issue holds up the entire project.
As far as business value is concerned, a business must see the value in UAT and the system to be released. Here is an article I wrote that address the value of software quality: The Cost of Software Quality - A Powerful Tool to Show the Value of Software Quality.
I hope this helps and I would love to hear about any metrics for UAT you have found helpful.
This is an excerpt from my upcoming book 50 Quick Ideas to Improve your User Stories. If you want to try this idea in practice, I’ll be running a workshop on improving user stories at the Booster conference in Bergen, NO next month. I’m also participating in the Product Owner Survival Camp in Zurich in March and we’ll be playing around with hierarchical backlogs and behaviour changes then as well.
Bill Wake’s INVEST set of user story characteristics has two conflicting forces. Independent and Valuable are often difficult to reconcile with Small. Value of software is a vague and esoteric concept in the domain of business users, but task size is under the control of a delivery team, so many teams end up choosing size over value. The result are “technical stories”, things that don’t really produce any outcome and a disconnect between what the team is pushing out and what the business sponsors really care about.
Many delivery teams also implicitly assume that something has value just because business users asked for it, so it’s difficult to argue about that. Rober Brinkerhoff, in Systems Thinking in Human Resource Development, argues that valuable initiatives produce an observable change in someone’s way of working. This principle is a great way to start a conversation on the value of stories or to unblock a sticky situation. In essence, translating Brinkerhoff’s idea to software means that it’s not enough to describe just someone’s behaviour, but we should aim to describe a change in that behaviour instead. This trick is particularly useful with user stories that have an overly generic value statement, or where the value statement is missing.
I recently worked with a team that struggled to describe acceptance criteria for a user story that was mostly about splitting a background process into two. The story was perceived to be of value because the business stakeholders asked for it. It was a strange situation, because the implication of the story is purely technical – a division of a background task. The success criteria was deceivingly simple – check that we have two jobs instead of one – so the team was worried that there is more to this than meets the eye.
The value statement was “being able to import contacts”. The problem was that the users were able to import contacts already, and they will still be able to import contacts after the story is done – there was no real success criteria. We tried to capture the value not just as a behaviour, but as a change in that behaviour, and the discussion suddenly took a much more productive turn.
Some people argued that splitting the background process will allow users to import contacts faster, but the total time for a split task would be the same. So either the solution was wrong, or the assumed value was incorrect. Digging deeper in what would be different after the story is delivered, we discovered that users were not able to import large contact files easily. Imported data was going directly into the database, where it got processed in several steps synchronously. For large files, this process took longer than the allowed time for a HTTP request, so the users would see an error on the screen. They would have to re-upload a file and wait to see if it will be processed.
We finally identified the change as “being able to upload larger sets of contacts faster”, and this opened a discussion on several potential solutions. One was to just store the uploaded file on the server and complete the HTTP request, letting the user go and do other things, while the same job as before picks up the file in the background and processes it. It was a better solution than the original request because it did not depend on the speed of the background process, and it also was easier and faster to implement.
In addition, understanding the expected change in behaviour of business users allowed the team to set a good acceptance criteria for the user story. They could test that a large file upload completes within the HTTP request timeout limit, instead of just checking for the number of background tasks.Key benefits
Capturing a behaviour change makes a story measurable from a business perspective, and this always opens up a good discussion. For example, once we know that a change is about uploading larger groups faster, two questions immediately pop up: how much larger, and how much faster? The right solution completely depends on these two factors. Are we talking about megabytes or gigabytes? Are we talking about speeding something up by a small percentage, or by an order of magnitude?
This will help to determine if the proposed solution is appropriate, inadequate or over the top. Describing the change often sets the context which allows a delivery teams to propose better solutions.
Describing expected changes also allows teams to measure if a story succeeded from a business perspective once it is delivered. Even if the story passes all technical and functional tests, but fails to produce the expected behaviour change, it is not complete. This might lead the business sponsors to suggest more changes and different stories. The opposite is also true – if there are several stories aimed at the same behaviour change but the first one achieves more than planned, then the other stories can be thrown out of the plan – they are not needed any more.
A measurable behaviour change makes stories easier to split, because there is one more potential dimension to discuss. For example, if the behaviour change is “import contacts 20% faster”, offering a small subset of functionality that speeds up importing by 5% is still valuable.How to make this work
Try to quantify expected changes – the good thing about a change is that is should be visible and measurable. Even if you do not end up measuring it at the end, capturing the expectation on how much something should change will help you discuss the proposed solutions.
If discrete values are difficult to set, aim for ranges. For example, instead of “10% faster”, ask about the minimum that would make a behaviour change valuable, and what would make it over the top. Then set the range somewhere in between.
Teams sometimes struggle to do this for new capabilities, or when replacing a legacy system. If the capability is not there yet, then “Start to” or “Stop doing” are valid behaviour changes. This will allow you to discuss what exactly “Start to” means. For example, a team I worked with had several weeks of work planned to enable traders to sell a new category of products, but it turned out that they can start to trade by logging purchase orders in Excel. The Excel solution did not deliver the final speed or capacity they needed, but traders started selling several months sooner than if they had to wait for the full big bang deployment to production, and this had immense value for the company.
Recently someone asked me if it was possible to measure the performance of localStorage. While this was difficult a few years ago, we can do it now thanks to Navigation Timing. This post explains how to measure localStorage performance as well as the results of my tests showing the maximum size of localStorage, when the penalty of reading localStorage happens, and how localStorage behavior varies across browsers.Past Challenges
In 2011 & 2012, Nicholas Zakas wrote three blog posts about the performance of localStorage. In the last one, The performance of localStorage revisited, he shared this insight from Jonas Sicking (who worked on localStorage in Firefox) explaining why attempts to measure localStorage up to that point were not accurate:
Firefox starts out by reading all of the data from localStorage into memory for the page’s origin. Once the data is in memory, reads and writes should be relatively fast (…), so our measuring of reads and writes doesn’t capture the full picture.
Nicholas (and others) had tried measuring the performance of localStorage by placing timers around calls to localStorage.getItem(). The fact that Firefox precaches localStorage means a different approach is needed, at least for Firefox.Measuring localStorage with Navigation Timing
Early attempts to measure localStorage performance didn’t capture the true cost when localStorage is precached. As Jonas described, Firefox precaches localStorage, in other words, it starts reading a domain’s localStorage data from disk when the browser first navigates to a page with that domain. When this happens the performance of localStorage isn’t captured because it might already be in memory before the call to getItem().
My hypothesis is that we should be able to use Navigation Timing to measure localStorage performance, even in browsers that precache it. Proving this hypothesis would let us measure localStorage performance in Firefox, and determine if any other browsers have similar precaching behavior.
The Navigation Timing timeline begins with navigationStart - the time at which the browser begins loading a page. Reading localStorage from disk must happen AFTER navigationStart. Even with this knowledge, it’s still tricky to design an experiment that measures localStorage performance. My experiment includes the following considerations:
- Fill localStorage to its maximum so that any delays are more noticeable.
- Use two different domains. In my case I use st2468.com and stevesouders.com. The first domain is used for storing and measuring localStorage. The second domain is for landing pages that have links to the first domain. This provides a way to restart the browser, go to a landing page on stevesouders.com, and measure the first visit to a page on st2468.com.
- Restart the browser and clear the operating system’s disk cache between measurements.
- In the measurement page, wrap the getItem() calls with timers as well as recording the Navigation Timing metrics in order to see when the precache occurs. We know it’s sometime after navigationStart but we don’t know what marker it’s before.
- Make the measurement page cacheable. This removes any variability due to network activity.
The step-by-step instructions can be seen on my test page. I use Browserscope to record the results but otherwise this test is very manual, especially since the only reliable way to clear the OS disk cache on Windows, iOS, and Android is to do a power cycle (AFAIK). On the Macbook I used the purge command.The Results: maximum localStorage
The results of filling localStorage to the maximum are shown in Table 1. Each browser was tested nine times and the median value is shown in Table 1. (You can also see the raw results in Browserscope.) Before determining if we were able to capture Firefox’s precaching behavior, let me describe the table:
- The User Agent column shows the browser and version. Chrome, Firefox, Opera, and Safari were tested on a Macbook Air running 10.9.1. IE was tested on a Thinkpad running Windows 7. Chrome Mobile was tested on a Samsung Galaxy Nexus running Android 4.3. Mobile Safari was tested on an iPhone 5 running iOS 7.
- The size column shows how many characters localStorage accepted before throwing a quota exceeded error. The actual amount of space depends on how the browser stores the strings – single byte, double byte, mixed. The number of characters is more relevant to most developers since everything saved to localStorage is converted to a string. (People using FTLab’s ftdatasquasher might care more about the actual storage mechanism underneath the covers.)
- The delta getItem column shows how long the call to getItem() took. It’s the median of the difference between “AFTER getItem time” and “BEFORE getItem time”. (In other words, it’s possible that the difference of the medians in the table don’t equal the “delta getItem” median exactly. This is an artifact of how Browserscope displays results. Reviewing the raw results shows that if the math isn’t exact it’s very close.)
- The remaining columns are markers from Navigation Timing, plus the manual markers before and after the call to getItem(). The value is the number of milliseconds at which that marker took place relative to navigationStart. For example, in the first row responseStart took place 3 ms after navigationStart. Notice how responseEnd takes place just 2 ms later because this page was read from cache (as mentioned above).
One thing to notice is that there are no Navigation Timing metrics for Safari and Mobile Safari. These are the only major browsers that have yet to adopt the W3C Navigation Timing specification. I encourage you to add your name to this petition encouraging Apple to support the Navigation Timing API. For these browsers, the before and after times are relative to a marker in an inline script at the top of the HEAD.Table 1. maximum localStorage User Agent size (K-chars) delta getItem (ms) response- Start time response- End time dom- Loading time BEFORE getItem time AFTER getItem time dom- Interactive time Chrome 33 5,120 1,038 3 5 21 26 1,064 1,065 Chrome Mobile 32 5,120 1,114 63 69 128 163 1,314 1,315 Firefox 27 5,120 143 2 158 4 15 158 160 IE 11 4,883 759 3 3 3 15 774 777 Opera 19 5,120 930 2 4 14 20 950 950 Mobile Safari 7 2,560 453 1 454 Safari 7 2,560 520 0 520 Did we capture it?
The results from Table 1 show that Firefox’s localStorage precaching behavior is captured using Navigation Timing. The delta of responseStart and responseEnd (the time to read the HTML document) is 156 ms for Firefox. This doesn’t make sense since the HTML was read from cache. This should only take a few milliseconds, which is exactly what we see for all the other browsers that support Navigation Timing (Chrome, IE, and Opera).
Something else is happening in Firefox during the loading of the HTML document that is taking 156 ms. The likely suspect is Firefox precaching localStorage. To determine if this is the cause we reduce the amount of localStorage data to 10K. These results are shown in Table 2 (raw results in Browserscope). With only 10K in localStorage we see that Firefox reads the HTML document from cache in 13 ms (responseEnd minus responseStart). The only variable that changed between these two tests was the amount of data in localStorage: 10K vs 5M. Thus, we can conclude that the increase from 13 ms to 156 ms is due to Firefox precaching taking longer when there is more localStorage data.Table 2. 10K localStorage User Agent size (K-chars) delta getItem (ms) response- Start time response- End time dom- Loading time BEFORE getItem time AFTER getItem time dom- Interactive time Chrome 33 10 3 5 7 18 28 29 29 Chrome Mobile 32 10 28 73 76 179 229 248 250 Firefox 27 10 1 3 16 4 15 16 16 IE 11 10 15 6 6 6 48 60 57 Opera 19 10 7 2 4 15 23 33 33 Mobile Safari 7 10 16 1 17 Safari 7 10 11 0 11
Using Navigation Timing we’re able to measure Firefox’s precaching behavior. We can’t guarantee when it starts but presumably it’s after navigationStart. In this experiment it ends with responseEnd but that’s likely due to the page blocking on this synchronous disk read when the call to getItem() is reached. In the next section we’ll see what happens when the call to getItem() is delayed so there is not a race condition.Does anyone else precache localStorage?
We discovered Firefox’s precaching behavior by comparing timings for localStorage with 10K versus the maximum of 5M. Using the same comparisons it appears that none of the other browsers are precaching localStorage; the delta of responseStart and responseEnd for all other browsers is just a few milliseconds. We can investigate further by delaying the call to getItem() until one second after the window onload event. The results of this variation are shown in Table 3 (raw results in Browserscope).Table 3. maximum localStorage, delayed getItem User Agent size (K-chars) delta getItem (ms) response- Start time response- End time dom- Loading time BEFORE getItem time AFTER getItem time dom- Interactive time Chrome 33 5,120 1,026 3 5 21 1112 2139 85 Chrome Mobile 32 5,120 1,066 83 87 188 1240 2294 234 Firefox 27 5,120 0 3 17 4 1038 1039 20 IE 11 4,883 872 5 5 5 1075 1967 49 Opera 19 5,120 313 2 4 15 1025 1336 23 Mobile Safari 7 2,560 104 1003 1106 Safari 7 2,560 177 1004 1181
Table 3 confirms that Firefox is precaching localStorage – “delta getItem” is 0 ms because there was plenty of time for Firefox to finish precaching before the call to getItem(). All the other browsers, however, have positive values for “delta getItem”. The values for Chrome, Chrome Mobile, and IE are comparable between Table 1 and Table 3: 1038 vs 1026, 1114 vs 1066, and 759 vs 872.
The values for Opera, Mobile Safari, and Safari are slower in Table 1 compared to Table 3: 930 vs 313, 453 vs 104, and 520 vs 177. I don’t have an explanation for this. I don’t think these browsers are precaching localStorage (the values from Table 3 would be closer to zero). Perhaps the call to getItem() took longer in Table 1 because the page was actively loading and there was contention for memory and CPU resources, whereas for Table 3 the page had already finished loading.500K localStorage
So far we’ve measured maximum localStorage (Table 1) and 10K of localStorage (Table 2). Table 4 shows the results with 500K of localStorage. All of the “delta getItem” values fall between the 10K and maximum values. No real surprizes here.Table 4. 500K localStorage User Agent size (K-chars) delta getItem (ms) response- Start time response- End time dom- Loading time BEFORE getItem time AFTER getItem time dom- Interactive time Chrome 33 500 20 3 4 19 25 43 43 Chrome Mobile 32 500 164 78 85 144 183 368 368 Firefox 27 500 14 2 30 3 15 30 31 IE 11 500 32 5 5 5 48 89 83 Opera 19 500 36 2 4 14 23 57 58 Mobile Safari 7 500 37 1 38 Safari 7 500 44 0 44 Conclusions
The goal of this blog post was to see if Firefox’s localStorage precaching behavior was measurable with Navigation Timing. We succeeded in doing that in this contrived example. For real world pages it might be harder to capture Firefox’s behavior. If localStorage is accessed early in the page then it may recreate the condition found in this test where responseEnd is blocked waiting for precaching to complete.
Another finding from these tests is that Firefox is the only browser doing precaching. This means that the simple approach of wrapping the first access to localStorage with timers accurately captures localStorage performance in all browsers except Firefox.
It’s hard not to focus on the time values from these tests but keep in mind that this is a small sample size. I did nine tests per browser for Table 1 and dropped to five tests per browser for Tables 2-4 to save time. Another important factor is that the structure of my test page is very simple and unlike almost any real world website. Rather than focus on these time values, it would be better to use the conclusions about how localStorage performs to collect real user metrics.
There are takeaways here for browser developers. There’s quite a variance in results across browsers, and Firefox’s precaching behavior appears to improve performance. If browser teams do more extensive testing coupled with their knowledge of current implementation it’s likely that localStorage performance will improve.
A smaller takeaway is the variance in storage size. Whether you measure by number of characters or bytes, Safari holds half as much as other major browsers.Notes, Next Steps, and Caveats
As mentioned above, the time values shown in these results are based on a small sample size and aren’t the focus of this post. A good next step would be for website owners to use these techniques to measure the performance of localStorage for their real users.
In constructing my test page I tried various techniques for filling localStorage. I settled on writing as many strings as possible of length 1M, then 100K, then 10K, then 1K – this resulted in a small number of keys with some really long strings. I also tried starting with strings of length 100K then dropping to 1K – this resulted in more keys and shorter strings. I found that the first approach (with some 1M strings) produced slower read times. A good follow-on experiment would be to measure the performance of localStorage with various numbers of keys and string lengths.
With regard to how browsers encode characters when saving to localStorage, I chose to use string values that contained some non-ASCII characters in an attempt to force browsers to use the same character encoding.
In my call to getItem() I referenced a key that did not exist. This was to eliminate any variability in reading (potentially large) strings into memory since my main focus was on reading all of localStorage into memory. Another follow-on experiment would be to test the performance of reading keys of different states and lengths to see if browsers performed differently. For example, one possible browser optimization would be to precache the keys without the values – this would allow for more efficient handling of the case of referencing a nonexistant key.
A downside of Firefox’s precaching behavior would seem to be that all pages on the domain would suffer the penalty of reading localStorage regardless of whether they actually used it. However, in my testing it seemed like Firefox learned which pages used localStorage and avoided precaching on pages that didn’t. Further testing is needed to confirm this behavior. Regardless, it seems like a good optimization.Thanks
Thanks to Nicholas Zakas, Jonas Sicking, Honza Bambas, Andrew Betts, and Tony Gentilcore for providing advice and information for this post.
As the name implies, the Web Performance Lab is all about performance optimization. We felt it was our duty to investigate server-side monitoring as part of our industry. We also needed a monitor that could serve us accurate, reliable data. Read on for a review of our experiences working with New Relic.What is Server-side Monitoring?
In a nutshell, it is a way for you to watch your web and app servers for performance issues using a monitoring service. The lab’s first go-to was New Relic; a big player in the application performance monitoring arena. The goal was to monitor a smaller version of my scaling Magento project’s test environment.
Monitoring agents are programs that sit on the target server and collect data about the system to send to the monitor controller (New Relic). Once the agent is set up, we will execute a 5,000 VUser test and see how the system performs. This is important because we want to have a Proof of Concept that the load tests are indeed hitting our server.
After logging in, the user interface was pretty helpful in pointing me in the right direction. On the first page, I could see a red “add more” button, which I used to add test agents. There were a lot of steps required to get the agents functioning completely. This is because we had to use the right package file, install it, edit the php.ini file and restart some services. However, within minutes I had an application agent up and running. Setting up server monitoring agent was a breeze after installing the app agent. There were two setup process for the two agents, which made me ponder: why are there even two? It would have been nice to see an all-in-one agent that could be installed.
I began exploring the application navigation pane after the agents were successfully registered. My first impression: Wow! How about those charts? Two vital pieces of data were the Apdex chart, and Web transactions. Apdex is a simplified Service-Level Agreement that gives a single quantitative rating of how customers might be reacting to the site’s performance. This is based on response time. Find out more about Apdex here. Web Transactions allow users to go through the PHP transaction traces and pinpoint methods causing poor performance. Web transactions can be sorted by the following four conditions:
- Most time consuming
- Slowest average response time
- Apdex most dissatisfying
- Highest throughput.
Now that the app/server agents and test environment were ready to go, it was time to begin testing. While the test was running, the New Relic team gave us a comprehensive demo of their monitoring application.
Some of the most time consuming transactions in our test environment
If you’re not interested in drilling into the details of the application, you could always just inspect the response versus time graph. This helped the team see the effects of a load test on a server over long periods of time. In fact, there is some correlation between the load test results from LoadStorm and the response time graphs from New Relic together.
It’s not perfect because the scaling is off, but there is definitely correlation between the two data sets. This is an indicator that our system is being successfully load tested and monitored.
Overall,the team was impressed with the features and ease of use that New Relic offered. Their monitoring services cover both the application and server. A further benefit includes the quick-filtered transactions, which are useful for web developers and performance engineers alike. The support team was helpful in getting us well acquainted with the application. I am happy to give New Relic a positive review and would recommend it to anyone looking for optimize their web app because of the advanced detail provided. Subscribe to our feed and find out what we think of the next monitoring service provider: AppFirst.
In my automated test scripts, I always put one-line comments at the end of loops and methods that show what is closing. For example, in Ruby, it looks like this:
puts("Waiting for delete confirmation") sleep(1) end #until end #click_delete_yes
It makes it just a little easier to figure out where I am in the code, and it also makes sure I close the loops and functions appropriately.
That’s so simple and obvious it can’t possibly be a helpful tip, coud it?
Since 2000, I have been researching the question, "What is the recommended ratio of software testers to developers?" I have written two articles on that topic, with the original article, "The Elusive Tester to Developer Ratio" getting over 30,000 hits on my web site and being cited in many other articles and books. This is an important metric, but it also raises other important questions, such as:
- What if your needs are different from "average"?
- Is this metric really the best way to plan the staffing of a test organization?
- What are other, perhaps better, ways to balance your workload?
- How can small test teams be successful, even in large development organizations?
To sign-up, just go http://www.anymeeting.com/PIID=EA52DA86864F3C (You will get automatic reminders beforehand.)
There are limited slots available, so be sure and sign-up and show-up early to reserve your place. (The last time we had a completely full session.) We will be recording the session and post it a little later on my YouTube channel.
Feel free to pass this invitation along to a friend!
I hope to see you there!
At this week’s metric themed Atlanta Scrum User’s Group meetup, I asked the audience if they knew of any metrics (that could not be gamed) that could trigger rewards for development teams. The reaction was as if I had just praised Planned Parenthood at a Pro-life rally…everyone talking over each other to convince me I was wrong to even ask.
The facilitator later rewarded me with a door prize for the most controversial question. What?
Maybe my development team and I are on a different planet than the Agile-istas I encountered last night. Because we are currently doing what I proposed, and it doesn’t appear to be causing any harm.
Currently, if 135 story points are delivered in the prior month AND no showstopper production bugs were discovered, everyone on the team gets a free half-day-off to use as they see fit. We’ve achieved it twice in the past year. The most enthusiastic part of each retrospective is to observe the prior months metrics and determine if we reached our “stretch goal”. It’s…fun. Let me repeat that. It’s actually fun to reward yourself for extraordinary work.
Last night’s question was part of a quest I’ve been on to find a better reward trigger. Throughput and Quality is what we were aiming for. And I think we’ve gotten close. I would like to find a better metric than Velocity, however, because story point estimation is fuzzy. If I could easily measure “customer delight”, I would.
At the meeting, I learned about the Class of Service metric. And I’m mulling over the idea of suggesting a “Dev Forward” % stretch goal for a given time period.
But what is this nerve I keep touching about rewards for good work?
On weekends, when I perform an extraordinary task around the house like getting up on the roof to repair a leak, fixing an electrical issue, constructing built-in furniture to solve a space problem, finishing a particularly large batch of “Thank You” cards, or whatever…I like to reward myself with a beer, buying a new power tool, relaxing in front of the TV, taking a long hot shower, etc.
Rewards rock. What’s wrong with treating ourselves at work too?
National Signing Day Takes Rivals Website Down
National Signing Day for high school athletes is like Christmas and birthdays all wrapped into one. Websites like 247Sports.com and Scout.com had their writers covering stories and updating every minute. On one of the biggest day for the recruiting industry, the top dog website Rivals lagged behind. Out of all 365 days, National Signing Day resulted in a strong overflow on their message boards causing a bottleneck that brought down the site. Rivals.com team diagnosed the problem and decided that the best way to stabilize and fix the issue was to temporarily shut down the message boards while making Premium content available to all paid subscribers.
While some users were still able to access content, Head of Rivals.com, Eric Winter immediately addressed the problem and directed readers to their mobile website.
Rivals.com has an estimated 200,000 subscribers, twice the subscribers of 2nd and 3rd placed Scout.com and 247Sports.com. While Rivals was continuing to fix their website, 247Sports had their website flow in with 1 million unique visitors and had close to 15 million pageviews on National Signing Day. It’s safe to assume some of the traffic could’ve been Rivals, but due to the technology and unexpected traffic flow, Any big event like this will bring down websites if left unprepared. Eric Winter told AL.com before National Signing Day that the technology that they were carrying was out of date. As a frequent visitor to Rivals.com, I’m optimistic that there will be some changes in the future because of this event.
There’s a saying that goes “On the internet nobody knows you’re a dog.” How true is this? Can dogs really surf the internet? New data shows that humans are only accounting for 38.5% of all web traffic. This can only mean your dog is in fact, surfing on the net. Right!? Sadly no, the other 61.5% of web traffic is actually coming from bots. In 2012, 49% of web traffic were coming from humans, while 51% were bots.
You might be thinking that the majority of these bots are bad, but most of them are actually good bots. The good bots are out their indexing websites, to help users find more accurate and relevant information. The bad bots like spam bots are actually decreasing. Only 1% of spam bots account for internet activity. Next time when you’re on Google to find an answer, think of bots bringing back information that’s high quality and up-to-date.
Nonetheless, it’s still important to play it safe. Besides malicious spam bots, there are scrapers, hacking tools, and impersonators. Scrapers generally steal and duplicate content. They will also steal emails for spam purposes. Hacking tools are used to hijack servers and steal credit cards. Finally, impersonators are the ones driving down the website with bandwidth consumption with hopes to bring a website to a downtime.
Just like humans, the actions that bots take can be good or bad. With statistics showing that human traffic is accounting for less web traffic, the future will have some interesting consequences if the pattern continues.
The post Web Performance News for the Week of February 3, 2014 appeared first on LoadStorm.
Here are the slides and the references from my talk at JFokus this week, on surviving and thriving with flexible scope:
Here are some additional resources and reading materials:
- Make Impacts, not Software – this presentation has some overlapping ideas, but explores business metrics and impact maps in a more detail
- The Ducati Story that I mentioned is from HBR April 2011, Why Leaders Don’t Learn from Success by Francesca Gino and Gary P. Pisano.
- Palchinsky principles comes from Adapt: Why Success Always Starts with Failure by Tim Harford
- For more on impact mapping, check out my book Impact Mapping and the community resources at ImpactMapping.org
- For story mapping, see Jeff Patton’s presentation and Christian Hassa’s Story Maps in Practice
- A nice intro book to Design thinking is Change by Design: How Design Thinking Creates New Alternatives for Business and Society by Tim Brown
- How to Measure Anything: Finding the Value of Intangibles in Business by Douglas Hubbard, pretty good book on business metrics
- I’ll be writing about the topic of this session and more ideas on improving user stories in my upcoming book 50 Quick Ideas to Improve your User Stories
If this is a topic of interest, I strongly suggest attending one of the upcoming Product Owner Survival Camps. David Evans is running a fantastic session on user stories, and as a bonus you can practice Story Maps with Christian Hassa and impact mapping with me.
Let’s say you are a web developer and have now come around to load testing your website. It’s important to make sure the load test results are reliable and useful; so how do you guarantee this? Take a look at these 3 useful load testing tips you can use with LoadStorm 2.0 to get more effective results and save time.Tip one: Scale Gradually
If you’re a first time web performance tester, it is a good idea to underestimate rather than overestimate the amount of load your server can handle. If your initial influx of VUsers is too high from the beginning, you will immediately see a huge spike in errors, as well as respective drops in throughput and requests per second. Nobody wants to go through that because it wastes time and money. Begin by scaling gradually. Choose a long test duration so the increase in users over time (ramp up) is lessened. This will produce more meaningful results. An example are the two images below. Notice the first has a higher ramp up and initial VUser count than the second graph, but also a large increase in error rates and response time.
You’ve finished generating a HAR recording which simulates VUser activity for your site, then you upload it to LoadStorm and run a test. You then discover that your 404 error rates are through the roof! These errors are easier to handle when in the Recording and Scripting stages. Some errors might be unavoidable during the load test, like timeouts and HTTP 500 status returns. Ideally, there should be no errors before the load test. One of the easiest error to eliminate is a 404. It’s just a matter of doing one of the following:
- Adding the resource that is returning a 404, even if it’s a blank one, make it return 200.
- Remove the request for that resource from the HAR file altogether.
Now let’s say you want the VUsers to hit a bunch of web pages. However, you don’t want to go through the process of making HAR recordings that hit every page in your site. At that point, it would be the perfect time to take advantage of the User Data feature in LoadStorm! Simply generate a comma-separated values file (CSV) of all the web pages you wish to hit. Then upload that file, and use URL replacement on a script made from a small HAR recording. Additionally, you can use query string replacement. For example, if you’ve got a search feature that uses query strings, you can simulate users putting in popular search terms. This data can be read from a User Data CSV file.
Take advantage of these tips. As load testing becomes more complicated, these can help you save time and produce more reliable results. Are there any other tips you have come across while load testing in general?