Skip navigation.

Load & Perf Testing

Performance Blame Game

Performance & Load Testing - Sat, 05/29/2010 - 08:57
pimg src=http://loadstorm.com/sites/loadstorm.com/files/OldSchool.jpg class=picture-left alt=old school development /br / h2Poor Performance - Your Fault?/h2 /ppa href=http://blog.dynatrace.com/2010/05/28/week-16-who-is-to-blame-for-bad-application-performanceWho is to blame for bad application performance?/a by Alois Reitbauer is an informative look at how developers, system architects, testers, Ramp;D managers, and operations leaders can each play a role in poor performance of software./p pWhile pointing the finger is a common way employees in companies invest their time, rarely does it have much ROI. My experience is that development teams, IT departments, and company executives usually don't play together very well. They don't communicate clearly or frequently to each other. It's only natural because they have their own jobs to do, and their job evaluation (i.e. bonus or raise) isn't measured by collaboration./p pIn order to reduce or eliminate the blame game, cultures must change. It must be made a culture thing to have everyone accept that they play a key role in the performance of the end product. When companies get smarter and create a culture where everyone wants to have high-performance software applications, then much of the blame game will disappear. A bad habit is replaced with a good habit. /p pWorking together throughout the entire product lifecycle is the best way to make the software faster, smoother, more reliable, and more scalable. /p pMy recommendation is to skip the blame game. Begin going out of your way to drop by a colleague's cubical and ask about how their piece is performing. Offer to help. Encouraging all of the people around you to think about performance may feel uncomfortable at first, and some may treat you with a bit of skepticism at first, but keep at it. Soon you will be viewed as the person in the company that is sincerely concerned about speed and scalability. /p pWould it be so bad to be nicknamed The Performance Czar? I would gladly take that geek name and wear it with pride./p
Categories: Load & Perf Testing

Behind the scenes of ASP.NET MVC 2 – Understand the internals to build better apps

With Visual Studio 2010, Microsoft is shipping the next version of the popular ASP.NET MVC Framework with its IDE. A year ago I blogged about my findings when getting my hands on the first version of ASP.NET MVC. The MVC Framework provides really nice features that make it very easy to build web applications on [...]
Categories: Load & Perf Testing

Best Practice Webinar on Proactive Application Performance with Smith Micro on April 28th

Besides blogging and speaking at conferences I often get the chance to co-host a Webinar with one of our customers. This week I am co-hosting with Bill Mar from SmithMicro talking about their Best Practices on Proactive Application Performance. After giving an introduction from the dynaTrace perspective we will hear from Bill why it is [...]
Categories: Load & Perf Testing

Week 14 – Building Your Own Amazon CloudWatch Monitor in 5 Steps

Amazon EC2 offers the CloudWatch service to monitor cloud instances as well as load balancers. While this service comes at some cost (0,015$/hour/instance) it offers useful infrastructure metrics about the performance of your EC2 infrastructure. While there are commercial and free tools out there which provide this service, you might not want to invest in [...]
Categories: Load & Perf Testing

dynaTrace shows how to track AJAX performance with Selenium

BrowserMob - Tue, 05/25/2010 - 11:26
dynaTrace just posted a really nice tutorial showing how to track client-side performance (JavaScript, DOM, render times, etc) automatically using Selenium. They do this by using their super-cool free AJAX edition of their product. Tweet This Post 
Categories: Load & Perf Testing

dynaTrace shows how to track AJAX performance with Selenium

BrowserMob - Tue, 05/25/2010 - 11:26

dynaTrace just posted a really nice tutorial showing how to track client-side performance (JavaScript, DOM, render times, etc) automatically using Selenium. They do this by using their super-cool free AJAX edition of their product.

Tweet This Post 

Categories: Load & Perf Testing

Solving problems with SSODB during BizTalk 2009 Configuration

I just spent an hour to figure out why I couldn’t get BizTalk 2009 installed on my test machine. I finally figured it out and want to share this information so that you don’t have to waste more time on this problem than necessary. After installing BizTalk you run through the BizTalk Configuration Wizard. Either in [...]
Categories: Load & Perf Testing

How better Caching helps Frankfurt’s Airport Website to handle additional load caused by the Volcano

Along with so many others I am stranded in Europe waiting for my flight back to the United States right now. The Volcano not only impacts flights across Europe but also impacts web sites of airports, airlines and travel agencies around the world. Checking my flight status on Sunday was almost impossible. The website of [...]
Categories: Load & Perf Testing

17 Performance Testing Articles

Performance & Load Testing - Mon, 05/17/2010 - 14:19
pOf course there are hundreds or thousands of posts out on the web about performance testing. I thought I would share 17 good sources focused on web application performance testing. Some of these are lists that lead to other excellent posts./p pa href=http://blog.dynatrace.com/2010/01/13/ensuring-web-site-performance-why-what-and-how-to-measure-automated-and-accurately/Ensuring Web Site Performance – Why, What and How to Measure Automated and Accurately/a/p pa href=http://performance-testing.org/content/performance-testing-articlesPerformance Testing Articles/a /p pa href=http://agiletesting.blogspot.com/2005/02/performance-vs-load-vs-stress-testing.htmlDifference in perf and load testing - part 1/abr / a href=http://agiletesting.blogspot.com/2005/04/more-on-performance-vs-load-testing.htmlDifference in perf and load testing - part 2/a/p pa href=http://www.sqaforums.com/showflat.php?Cat=0amp;Number=41861amp;an=0amp;page=0FAQ - Performance amp; Load Testing on Software QA Forums/a/p pa href=http://articles.techrepublic.com.com/5100-10878_11-1037784.htmlPerformance testing: Does your Web site make the grade? on TechRepublic/a/p pa href=http://askbobrankin.com/website_performance_testing.htmlHow to Test and Improve Website Performance by Bob Rankin/a/p pa href=http://blog.dynatrace.com/2010/03/11/week-6-how-to-make-developers-write-performance-tests/How to Make Developers Write Performance Tests/a/p pa href=http://www.logigear.com/resource-center/software-testing-articles-by-others/157-loadperformance-articles.htmlLogiGear links and articles/a/p pa href=http://www.stickyminds.com/sitewide.asp?ObjectId=15493amp;Function=DETAILBROWSEamp;ObjectType=ARTamp;sqry=*Z%28SM%29*J%28MIXED%29*R%28relevance%29*K%28simplesite%29*F%28%22performance+testing%22%29*amp;sidx=0amp;sopp=10amp;sitewide.asp?sid=1amp;sqry=*Z%28SM%29*J%28MIXED%29*R%28relevance%29*K%28simplesite%29*F%28%22performance+testing%22%29*amp;sidx=0amp;sopp=10Nuts and bolts of performance testing by Abhinav Vaid on StickyMinds.com/a/p pa href=http://www.dotnetfunda.com/articles/article901-web-performance-test-using-visual-studio-part-i-.aspxWeb performance test using Visual Studio/a/p pa href=http://www.bitpipe.com/detail/RES/1265721149_745.htmlPerformance Testing In Agile Environments by HP/a/p pa href=http://www.bitpipe.com/detail/RES/1255388747_681.htmlRapid Bottleneck Identification - A Better Way to do Load Testing by Oracle/a/p pa href=http://www.theserverside.com/news/1364725/Tips-on-Performance-Testing-and-OptimizationServerside.com looks at performance testing of Java apps/a/p pa href=http://www.testinggeek.com/index.php/testing-articles/132-performance-testing-typesTypes of performance testing by TestingGeek.com/a/p pa href=http://askbobrankin.com/website_performance_testing.htmlHow to Test and Improve Website Performance by Bob Rankin/a/p pa href=http://www.ibm.com/developerworks/rational/library/4169.htmlIntro on IBM's site/a/p
Categories: Load & Perf Testing

Monitoring Maintenance Windows

BrowserMob - Mon, 05/17/2010 - 12:17
When setting up monitoring jobs, there are often predictable time periods in which you want to change the behavior of a script or prevent it from running at all, without having to manually stop/start the monitoring job each time. For instance, you might want to prevent errors and alert emails during routine maintenance windows, or [...]
Categories: Load & Perf Testing

Update dynaTrace AJAX Edition to get Rendering Times on ALL versions of Internet Explorer

Analyzing Rendering Activity is one of the many features of the FREE dynaTrace AJAX Edition. Alois wrote a nice blog article that explains the internals of IE’s Rendering Engine and how rendering is analyzed with the AJAX Edition. Different patch levels of IE may cause problems In order for the AJAX Edition to capture Drawing, Layouting and [...]
Categories: Load & Perf Testing

Tracking a response time problem

Performance & Open Source - Thu, 04/08/2010 - 17:56

Recently, an engineer came to me puzzled that the response times of some performance benchmark she was running were increasing. She had already looked at the usual trouble spots – resource utilizations on the app and database systems, database statistics, application server tunables, the network stack etc.  I asked her about the cpu metrics on the load driver systems (the machines which drive the load). Usually, when I ask this question, the answer is “I don’t know. Let me find out and get back to you”. But this engineer had looked at that as well. “It isn’t a problem. There is plenty of CPU left – I have 30% idle”.

Ah ah – I had spotted the problem. When we run benchmarks, we tend to squeeze every bit of performance we can out of the systems. This means running the servers as close to 100% utilized as possible. This mantra is sometimes carried over to the load driver systems as well. Unfortunately, that can result in severe performance degradation. Here’s why.

The load driver systems emulate users and generate requests to the system under test. They receive the responses and measure and record response times. A typical driver emulates hundreds to thousands of users. Each emulated user is then competing for system resources. Now suppose an emulated user has issued a read request to read the response from the server. It is very likely that this thread will be context switched out by the operating system as there are so many additional users it needs to serve. Depending on the number of CPUS on the system and the load, the original emulated user thread may get to execute with a considerable delay and consequently record a much larger response time. My rule of thumb is never to run the load generator systems more than 50% busy if the application is latency sensitive. In this particular case, the system was already 70% utilized.

Sure enough – when a new load driver system was added and the performance tests re-run, all the response time criteria passed and the engineer could continue scaling the benchmark.

Moral of the story – don’t forget to monitor your load driver systems and don’t be complacent if their utilization starts climbing above 50%.


Categories: Load & Perf Testing

Correlating Events to Recognize Problems

Heroix Perf Monitoring - Sun, 11/15/2009 - 18:55

Every engineer and manager who receives alerts from automated monitoring systems can relate to both the critical need they fill and to their often annoying short comings. I’m not just referring to a situation where the monitoring has recently been installed, and you haven’t tuned the default thresholds to limit notification to actionable events. The nature of monitoring and alerting, even with the most sophisticated programs, is that problem events are based on very narrow criteria, like a server’s CPU Load, or a router’s bandwidth consumption, or some application’s .NET errors. It’s good to know about these specific problems. But if they are related in a complex problem, that diagnosis can easily be missed when these separate events are surrounded with hundreds or thousands of other random problem events. Recognizing the complex problem is further complicated because the events come from different sources: servers, network appliances, and applications. Of course most real world applications depend on distributed environments for reliable service delivery, and problems occur whose symptoms span multiple devices and programs. If you want to get ahead of the curve to immediately recognize and fix complex problems, then you need to start correlating multiple events so that you can send intelligent notifications that describe the conditions and fixes for complex problems.

What’s The Problem?

Events can be misleading. Consider an example where several servers are behind a switch. We’ll further assume that we are monitoring the availability of the switch and the servers. When the switch goes down, what happens? A ton of notification is sent alerting everyone that all the servers are down, which is effectively true, but isn’t really the problem. Of course eventually the switch down alert comes in with all the server down messages. This is a simple example, where most good engineers will immediately diagnose the problem when they read the switch down alert, but a lot of messages were sent to notify you of the true problem. I always cringe when I know my boss is getting flooded with email that the sky is falling. Now, what if we use some logic in our notification that only sends out server down messages when the switch is OK, and suppresses all the server down messages when the switch goes down? That would be useful. Even better, let’s configure the switch down message to inform recipients with the list of servers that are unavailable due to the switch being down.

The switch example is easy to understand. Think about how useful correlating much more complex events can be, especially when critical information is included in the notification. Fixing a problem is a lot easier if the problem email or text message includes a concise description of the multiple conditions, what the root cause is, how to fix it, and who to communicate with for help. I’ve even worked with customers to include links to their own online SOP or Help Desk documentation. In my experience with problem email alerts, less is always better, as long as you always get all the notification you need. Correlation of events is the only way to simultaneously reduce the email count and dramatically improve the quality of information in alerts.

Logically Speaking

A Correlated Event is going to have multiple conditions. Sometimes all conditions must be true. We also want to be able to recognize when some conditions are true, while others are definitely not true. We may even want to specify that some conditions must be true, while others may be true, and some others must not be true (a really complex event…). We’re really only using three Booleans, AND, OR, and NOT, where we group the logically similar conditions. The logical order of listing the conditions should be:

  1. Conditions that Must Be True
  2. Conditions that May Be True
  3. Conditions that Must Not Be True

The number of conditions can vary, but most complex problems can be recognized based on between 2 and 10 conditions. More may be needed when deducing problems along an extensive network path or across multiple application servers, for example.

I find it useful to include some synchronization and the awareness of persistence when correlating events. First, all conditions might not be based on measurements with the same interval. For example, disk statistics are typically collected hourly, whereas availability is tested every 1 to 5 minutes. Many other conditions will have intervals between these two extremes. I may want to specify that all conditions must happen with a specified interval. It’s also really valuable to be able to specify that X number of events must have happened in the interval. Picture network latency that gets flaky sometimes, but if it persists for X amount of time then it’s a problem.

Intelligent Notification

Once we recognize a complex problem based on the presence or lack of specific conditions we’re in a position to provide effective notification that will maximize the probability that a problem is fixed as quickly as possible. You’re setting up yourself and those around you for success. Here’s how I recommend configuring Correlated Event notification:

  1. Describe the condition set
    1. what it means
    2. what’s the root cause
  2. Describe the procedure to fix the problem
    1. Links to documentation
  3. List the interested parties to contact for help
    1. ISP Contacts
    2. Network Admins
    3. Server Admins
    4. Application Support

How To Do It

You can build a BAT file, VB, Shell, or Perl script to use a CASE test using the Booleans described above, but you’ll have to build an interface to the database of events. You can even use a well crafted query to select for the conditions of interest. If you use Longitude, then you can just use the Correlated Event actions to define the multiple conditions, interval, and persistence, build your notification. Please email me if you have questions about using Correlated Events.

 

Categories: Load & Perf Testing

The keys to Effective SLAs

Heroix Perf Monitoring - Sun, 11/08/2009 - 22:50

Service Level Agreements are usually the object of desire, fear, and uncertainty all at the same time. They can be such useful tools that it’s important to demystify them. SLAs are desirable because they provide accountability and timely feedback to managers. They are to be feared when they include factors beyond control or that are poorly aligned with reality. SLAs are commonly approached with a high degree of uncertainty about what to measure and how to report results as an effective tool for all parties. While the ingredients in SLAs are as varied as applications and service providers, all effective SLAs share a few critical characteristics.

Good and Bad SLAs

Let’s start by poking fun at what will be the worst example of an SLA you’ve ever heard of or that I’ve been a party to implementing. I should point out this happened long before I became part of the Heroix team. I was brought in to design and implement a monitoring and reporting regime that supported the SLA between a web hosting provider and a Wall Street firm seeking its first web presence. Considering the big-time clients and huge capital expenditure (100 servers in two data centers), I was expecting a challenging assignment with a highly sophisticated and complex set of monitoring requirements. When I received my copy of the SLA, a single paragraph appendix to a large contract, it had one condition:

  • No server shall experience greater than 30% average CPU usage during any rolling hour

You could have knocked me over with a feather. After disbelief, hilarity, and confusion, came concern and agitation. I actually suggested that we provide much more, which would have been included in any basic monitoring regime, and was rebuffed. What’s obviously wrong with this SLA is that the measure of success has no direct connection with actual service delivery to consumers. It was, however, very easy to measure. So the primary rule in creating effective SLAs is:

  1. Measure things that directly impact service delivery or user experience

Our first rule provides the guiding principle in answering the questions, “What should I measure and why should I measure it?” Of course, the WHY part of the answer should always be “Because it directly impacts service. Some good examples WHAT to measure would be:

  • Availability of systems and applications
  • Success of sessions and transactions
    • Web Pages, DB Queries, etc.
  • Response Times where applicable
  • Loss of resources critical to service delivery
    • Disk or DB space, Connection or Session limits

When selecting SLA measures it’s important to choose things that you have control over and that can be measured objectively, even if the statistic is as simple 0 for True and 1 for False, as in the case of whether a required TCP port is accepting connections. Either it is (0) or it isn’t (1). A valuable planning exercise is to picture the data or transaction path, and reserve slots in your SLA for appropriate tests of each potential break point in a service. Using a typical web application example, a consumer connects to a web server, which creates a session on a back end application server, which in turn queries a DB server, ultimately sending a response back to the consumer. In our model the break points are the web, application, and DB servers, plus the network connecting them. By constructing a map of break points to monitor, you place yourself in a position to go beyond simply reporting a service failure by localizing where the service is breaking.

Although I’m sure you get why it’s important to localize the point of failure for a service. It is worth examining the answer. Recall that one of our principle goals is to achieve accountability. That doesn’t just apply to apportioning blame afterwards. It means knowing who owns the component that has failed, and should immediately be given the lead to find a solution. A process of discovery always happens as soon as a service failure is detected. In my experience, quickly identifying who in a group of varied specialists responsible for different technologies should own a problem can be unnecessarily time consuming, if you know what I mean… This is especially true if an SLA is poorly designed and the data are ambiguous as to the cause of the failure. In a well designed SLA with data from each break point and each team member seeing the same picture, it’s usually immediately clear who “owns” the problem. You actually facilitate taking ownership of the problem and effecting a solution.

How to Report SLA Data

Designing and implementing the best SLA will be for naught if you fail to build accessible views of the data that can easily be assimilated into a concept of operations. In other words, you have to build that intuitive picture of your transaction path that everyone’s going to share, and put it somewhere everyone can see it quickly and easily. You may have noted that we are discussing using SLA data in the present tense, as in live presentations. Don’t be confused if you expected an SLA to be some tabular historical report to be compared to contract terms and conditions. An effective SLA is all of these things. What’s the point of identifying what can impact service delivery if we don’t use it as an intensive monitor of the health of our critical application? So let’s use the same data to create live presentations of the application’s current state, while also generating historical reports of compliance with key standards.

There are some types of data that do not lend themselves to live reporting. For example, log data that’s collected nightly. Any type of daily or weekly aggregated data should be relegated to historical reporting. That can include both daily detail reports and long term summaries. Any data that is measured at least hourly can be represented effectively in live presentations or dashboards. Remember we want live data to be fresh (last 5-60 minutes) and not have to wait long periods for one component to refresh the picture again. For historical reporting, we’re really using the same data, just querying for longer periods, like weeks, months, and years (ok, daily if your feeling nervous or obsessive…).

Putting It All Together

A well designed SLA can be a critical tool for managers and technicians. Hopefully the process of automating it in live SLA dashboards and historical reports will actually reduce the workload on administrators. It should dramatically reduce the time normally spent in discovery when service problems arise. The SLA will provide accountability, timely assistance, and a unified picture. It will enable service providers to report proactively how well they are providing service.

 

Categories: Load & Perf Testing
Syndicate content