- Top 3 Performance Problems in Custom Microsoft CRM Applications
- Top 10 Client-Side Performance Problems in Web 2.0
- How to Automate Google Analytics Analysis
- Ajax Best Practices: Reduce and Aggregate similar XHR calls
- dynaTrace Continuously Monitors ShowSlow URLs
- Performance as Key to Success! How Online News Portals could do better
- Week 9 – How to Measure Application Performance
- Video of Business Transaction Management in Action: In 6 minutes from Slow Search Request to identify Impacted Users and Offending SQL
- IE Compatibility View: How to identify performance problems between IE versions
- dynaTrace at Web Performance Meetups in Boston and New York City
- Too Much Cache is Like a Krispy Kreme Burger
- Debugging SAP scripts using SAPGUI Spy in LoadRunner
- Monitoring Maintenance Windows
- How to Monitor Oracle Database Performance
- Stressing Out Your Access Management System
- Running remote Unix commands from LoadRunner
- Web Performance Tuning Never Ends
- Running command-line programs from LoadRunner
- IIS Connections Affect Web Performance
- Load Testing Quote for August 19, 2010
Load & Perf Testing
Performance Blame Game
Behind the scenes of ASP.NET MVC 2 – Understand the internals to build better apps
Best Practice Webinar on Proactive Application Performance with Smith Micro on April 28th
Week 14 – Building Your Own Amazon CloudWatch Monitor in 5 Steps
dynaTrace shows how to track AJAX performance with Selenium
dynaTrace shows how to track AJAX performance with Selenium
dynaTrace just posted a really nice tutorial showing how to track client-side performance (JavaScript, DOM, render times, etc) automatically using Selenium. They do this by using their super-cool free AJAX edition of their product.
Solving problems with SSODB during BizTalk 2009 Configuration
How better Caching helps Frankfurt’s Airport Website to handle additional load caused by the Volcano
17 Performance Testing Articles
Monitoring Maintenance Windows
Update dynaTrace AJAX Edition to get Rendering Times on ALL versions of Internet Explorer
Tracking a response time problem
Recently, an engineer came to me puzzled that the response times of some performance benchmark she was running were increasing. She had already looked at the usual trouble spots – resource utilizations on the app and database systems, database statistics, application server tunables, the network stack etc. I asked her about the cpu metrics on the load driver systems (the machines which drive the load). Usually, when I ask this question, the answer is “I don’t know. Let me find out and get back to you”. But this engineer had looked at that as well. “It isn’t a problem. There is plenty of CPU left – I have 30% idle”.
Ah ah – I had spotted the problem. When we run benchmarks, we tend to squeeze every bit of performance we can out of the systems. This means running the servers as close to 100% utilized as possible. This mantra is sometimes carried over to the load driver systems as well. Unfortunately, that can result in severe performance degradation. Here’s why.
The load driver systems emulate users and generate requests to the system under test. They receive the responses and measure and record response times. A typical driver emulates hundreds to thousands of users. Each emulated user is then competing for system resources. Now suppose an emulated user has issued a read request to read the response from the server. It is very likely that this thread will be context switched out by the operating system as there are so many additional users it needs to serve. Depending on the number of CPUS on the system and the load, the original emulated user thread may get to execute with a considerable delay and consequently record a much larger response time. My rule of thumb is never to run the load generator systems more than 50% busy if the application is latency sensitive. In this particular case, the system was already 70% utilized.
Sure enough – when a new load driver system was added and the performance tests re-run, all the response time criteria passed and the engineer could continue scaling the benchmark.
Moral of the story – don’t forget to monitor your load driver systems and don’t be complacent if their utilization starts climbing above 50%.
Correlating Events to Recognize Problems
Every engineer and manager who receives alerts from automated monitoring systems can relate to both the critical need they fill and to their often annoying short comings. I’m not just referring to a situation where the monitoring has recently been installed, and you haven’t tuned the default thresholds to limit notification to actionable events. The nature of monitoring and alerting, even with the most sophisticated programs, is that problem events are based on very narrow criteria, like a server’s CPU Load, or a router’s bandwidth consumption, or some application’s .NET errors. It’s good to know about these specific problems. But if they are related in a complex problem, that diagnosis can easily be missed when these separate events are surrounded with hundreds or thousands of other random problem events. Recognizing the complex problem is further complicated because the events come from different sources: servers, network appliances, and applications. Of course most real world applications depend on distributed environments for reliable service delivery, and problems occur whose symptoms span multiple devices and programs. If you want to get ahead of the curve to immediately recognize and fix complex problems, then you need to start correlating multiple events so that you can send intelligent notifications that describe the conditions and fixes for complex problems.
What’s The Problem?
Events can be misleading. Consider an example where several servers are behind a switch. We’ll further assume that we are monitoring the availability of the switch and the servers. When the switch goes down, what happens? A ton of notification is sent alerting everyone that all the servers are down, which is effectively true, but isn’t really the problem. Of course eventually the switch down alert comes in with all the server down messages. This is a simple example, where most good engineers will immediately diagnose the problem when they read the switch down alert, but a lot of messages were sent to notify you of the true problem. I always cringe when I know my boss is getting flooded with email that the sky is falling. Now, what if we use some logic in our notification that only sends out server down messages when the switch is OK, and suppresses all the server down messages when the switch goes down? That would be useful. Even better, let’s configure the switch down message to inform recipients with the list of servers that are unavailable due to the switch being down.
The switch example is easy to understand. Think about how useful correlating much more complex events can be, especially when critical information is included in the notification. Fixing a problem is a lot easier if the problem email or text message includes a concise description of the multiple conditions, what the root cause is, how to fix it, and who to communicate with for help. I’ve even worked with customers to include links to their own online SOP or Help Desk documentation. In my experience with problem email alerts, less is always better, as long as you always get all the notification you need. Correlation of events is the only way to simultaneously reduce the email count and dramatically improve the quality of information in alerts.
Logically Speaking
A Correlated Event is going to have multiple conditions. Sometimes all conditions must be true. We also want to be able to recognize when some conditions are true, while others are definitely not true. We may even want to specify that some conditions must be true, while others may be true, and some others must not be true (a really complex event…). We’re really only using three Booleans, AND, OR, and NOT, where we group the logically similar conditions. The logical order of listing the conditions should be:
- Conditions that Must Be True
- Conditions that May Be True
- Conditions that Must Not Be True
The number of conditions can vary, but most complex problems can be recognized based on between 2 and 10 conditions. More may be needed when deducing problems along an extensive network path or across multiple application servers, for example.
I find it useful to include some synchronization and the awareness of persistence when correlating events. First, all conditions might not be based on measurements with the same interval. For example, disk statistics are typically collected hourly, whereas availability is tested every 1 to 5 minutes. Many other conditions will have intervals between these two extremes. I may want to specify that all conditions must happen with a specified interval. It’s also really valuable to be able to specify that X number of events must have happened in the interval. Picture network latency that gets flaky sometimes, but if it persists for X amount of time then it’s a problem.
Intelligent Notification
Once we recognize a complex problem based on the presence or lack of specific conditions we’re in a position to provide effective notification that will maximize the probability that a problem is fixed as quickly as possible. You’re setting up yourself and those around you for success. Here’s how I recommend configuring Correlated Event notification:
- Describe the condition set
- what it means
- what’s the root cause
- Describe the procedure to fix the problem
- Links to documentation
- List the interested parties to contact for help
- ISP Contacts
- Network Admins
- Server Admins
- Application Support
How To Do It
You can build a BAT file, VB, Shell, or Perl script to use a CASE test using the Booleans described above, but you’ll have to build an interface to the database of events. You can even use a well crafted query to select for the conditions of interest. If you use Longitude, then you can just use the Correlated Event actions to define the multiple conditions, interval, and persistence, build your notification. Please email me if you have questions about using Correlated Events.
The keys to Effective SLAs
Service Level Agreements are usually the object of desire, fear, and uncertainty all at the same time. They can be such useful tools that it’s important to demystify them. SLAs are desirable because they provide accountability and timely feedback to managers. They are to be feared when they include factors beyond control or that are poorly aligned with reality. SLAs are commonly approached with a high degree of uncertainty about what to measure and how to report results as an effective tool for all parties. While the ingredients in SLAs are as varied as applications and service providers, all effective SLAs share a few critical characteristics.
Good and Bad SLAs
Let’s start by poking fun at what will be the worst example of an SLA you’ve ever heard of or that I’ve been a party to implementing. I should point out this happened long before I became part of the Heroix team. I was brought in to design and implement a monitoring and reporting regime that supported the SLA between a web hosting provider and a Wall Street firm seeking its first web presence. Considering the big-time clients and huge capital expenditure (100 servers in two data centers), I was expecting a challenging assignment with a highly sophisticated and complex set of monitoring requirements. When I received my copy of the SLA, a single paragraph appendix to a large contract, it had one condition:
- No server shall experience greater than 30% average CPU usage during any rolling hour
You could have knocked me over with a feather. After disbelief, hilarity, and confusion, came concern and agitation. I actually suggested that we provide much more, which would have been included in any basic monitoring regime, and was rebuffed. What’s obviously wrong with this SLA is that the measure of success has no direct connection with actual service delivery to consumers. It was, however, very easy to measure. So the primary rule in creating effective SLAs is:
- Measure things that directly impact service delivery or user experience
Our first rule provides the guiding principle in answering the questions, “What should I measure and why should I measure it?” Of course, the WHY part of the answer should always be “Because it directly impacts service. Some good examples WHAT to measure would be:
- Availability of systems and applications
- Success of sessions and transactions
- Web Pages, DB Queries, etc.
- Response Times where applicable
- Loss of resources critical to service delivery
- Disk or DB space, Connection or Session limits
When selecting SLA measures it’s important to choose things that you have control over and that can be measured objectively, even if the statistic is as simple 0 for True and 1 for False, as in the case of whether a required TCP port is accepting connections. Either it is (0) or it isn’t (1). A valuable planning exercise is to picture the data or transaction path, and reserve slots in your SLA for appropriate tests of each potential break point in a service. Using a typical web application example, a consumer connects to a web server, which creates a session on a back end application server, which in turn queries a DB server, ultimately sending a response back to the consumer. In our model the break points are the web, application, and DB servers, plus the network connecting them. By constructing a map of break points to monitor, you place yourself in a position to go beyond simply reporting a service failure by localizing where the service is breaking.
Although I’m sure you get why it’s important to localize the point of failure for a service. It is worth examining the answer. Recall that one of our principle goals is to achieve accountability. That doesn’t just apply to apportioning blame afterwards. It means knowing who owns the component that has failed, and should immediately be given the lead to find a solution. A process of discovery always happens as soon as a service failure is detected. In my experience, quickly identifying who in a group of varied specialists responsible for different technologies should own a problem can be unnecessarily time consuming, if you know what I mean… This is especially true if an SLA is poorly designed and the data are ambiguous as to the cause of the failure. In a well designed SLA with data from each break point and each team member seeing the same picture, it’s usually immediately clear who “owns” the problem. You actually facilitate taking ownership of the problem and effecting a solution.
How to Report SLA Data
Designing and implementing the best SLA will be for naught if you fail to build accessible views of the data that can easily be assimilated into a concept of operations. In other words, you have to build that intuitive picture of your transaction path that everyone’s going to share, and put it somewhere everyone can see it quickly and easily. You may have noted that we are discussing using SLA data in the present tense, as in live presentations. Don’t be confused if you expected an SLA to be some tabular historical report to be compared to contract terms and conditions. An effective SLA is all of these things. What’s the point of identifying what can impact service delivery if we don’t use it as an intensive monitor of the health of our critical application? So let’s use the same data to create live presentations of the application’s current state, while also generating historical reports of compliance with key standards.
There are some types of data that do not lend themselves to live reporting. For example, log data that’s collected nightly. Any type of daily or weekly aggregated data should be relegated to historical reporting. That can include both daily detail reports and long term summaries. Any data that is measured at least hourly can be represented effectively in live presentations or dashboards. Remember we want live data to be fresh (last 5-60 minutes) and not have to wait long periods for one component to refresh the picture again. For historical reporting, we’re really using the same data, just querying for longer periods, like weeks, months, and years (ok, daily if your feeling nervous or obsessive…).
Putting It All Together
A well designed SLA can be a critical tool for managers and technicians. Hopefully the process of automating it in live SLA dashboards and historical reports will actually reduce the workload on administrators. It should dramatically reduce the time normally spent in discovery when service problems arise. The SLA will provide accountability, timely assistance, and a unified picture. It will enable service providers to report proactively how well they are providing service.