Perf Planet

Subscribe to Perf Planet feed
News and views from the web performance blogosphere
Updated: 28 min 13 sec ago

Network App Performance: Application Deceleration Controller & DC RUM

Thu, 10/08/2015 - 12:06

Not too long ago I had an opportunity to work with a customer who was experiencing performance problems with their web-based HR application. Users at the headquarters location – about 30 milliseconds away from the data center – would occasionally experience page load times of 10 or 15 seconds – instead of the normal 2 […]

The post Network App Performance: Application Deceleration Controller & DC RUM appeared first on Dynatrace APM Blog.

Stronger JavaScript?

Thu, 10/08/2015 - 09:23

The V8 team (the JavaScript engine that powers Chrome, Opera, Node.js, MongoDB, etc…) are moving forward with an experiment in defining a stronger version of JavaScript that ensures that code being run is behaving well, and introducing run-time typing based on TypeScript’s typings. V8’s motivation is always performance, and a more stringent set of ECMAScript would obviously allow them to tune the engine to streamline performance, but are there other benefits?

We don’t need no stinking prologs!

Part of the proposal adds another prolog, "use strong";, in addition to the current "use strict";.

Many developers have some confusion around the benefits of strict mode in JavaScript, including me. "use strict"; has been around for a while and I assumed incorrectly that it helped improve performance, but modern JavaScript engines don’t actually need you to assert "use strict"; to streamline your code’s performance. Basically it just keeps you from doing things you shouldn’t be doing anyways, which should really be the job of a linting tool such as JSHint, JSCS, or ESLint. Now doing some of the things blocked by strict mode will make it impossible for the JavaScript engine to fully optimise your code, ergo, a potential benefit of using strict mode though the “should I ship with it” is an ad nauseam discussion. That said, strict mode doesn’t solve some of the most common ways of making your code unoptimisable.

Build Stronger and Smarter

Strong mode would automatically imply strict mode plus:

  • Removing support for var, instead using only let or const
  • No use before declaration
  • Accessing missing properties throws
  • Strong Objects are not extensible
  • class prototypes are frozen
  • instances are sealed
  • Arrays are non-sparse
  • arguments is eliminated
  • calling with too few arguments throws
  • No implicit coercion

These are potentially some fundamental changes to a lot of source code, but there is nothing I personally see in this list that I disagree with when you have the full benefits of ES6 and TypeScript. In many ways, using strong mode forces you to move your code towards ES6/ES2015 standards. For example, if you need a sparse array, you should be using Map instead. If you want to mutate your classes after you have declared them, you are being illogical and you should be refactoring your code instead of plastering over cracks.

The second part of strong script is run-time types based on TypeScript’s type annotations. This is incremental opt in, in that if you assert a type, the compiler will enforce it at run-time. The TypeScript team have started looking at what it might take to pass through your TypeScript code so it can be emitted in strong mode. There are some challenges in that what the V8 team propose and the semantics of TypeScript don’t quite align, but hey, we all know how to iterate, don’t we?

Yes, but what is in it for me?

Ultimately, like strict mode, strong mode just keeps you from doing silly things at run-time. Is there benefit in that? Well yes, we all solve problems the wrong way now and again, and when we do that unintentionally in fundamental pieces of our code, we can end up breaking things in strange ways that are hard to trace, have unintended consequences on other developers, or kill the performance of our code. The other thing about strong mode is that it is a contract with the compiler about the intent of our code. JavaScript is littered with flexibility and keeping all those possible options open makes it very challenging for the compiler to optimise. Specifically with strong mode, we are contracting to collapse a lot of the bad parts of pre-ES6/2015 that cause the compiler to hold open the option, just in case we do something silly.

Many of the things that strong mode is proposing are already patterns we should (or could be) following. So why have the browser enforce them at runtime? Why not just have a linting tool check our code, because having code that throws an error at runtime is never a good thing as it is usually being used by someone who didn’t write it. Well, ultimately the "use strong" prolog is a contract between the developer and the JIT compiler that says “I promise to not do anything silly” and therefore the JIT compiler can run your code through an optimized code path that performs faster, meaning better code. Consider not having to manage the arguments variable on every function call, or not being entirely sure if someone is going to declare a var on you or not. These are very tricky things for a run-time compiler to manage and tricky equals less performance.

So what is ultimately in it for the developer is better, faster, stable and more maintainable code. That can’t be a bad thing, can it?

Ok, you sold me, I want it NOW

You can experiment with both Chrome Canary, Traceur and Node.js (io.js 2.3+, Node.js 4+). In Canary, you will need to start it with --js-flags="--strong-mode", with Traceur you will need to use the strong mode branch, and with NodeJS you will need to use the --use-string flag.

Canary Chrome running in strong mode.

The modern web is cool, but what do I do now?

If you’re interested in learning more about ES6, TypeScript and building strong code, we are offering a four-day online ES6 and TypeScript workshop, and we also help organizations with their approach to adopting ES6 and Typescript with our expert JavaScript support plans. If you’re not sure where to start with ES6 adoption, whether TypeScript is right for you or not or when is right to adopt other experimental web technologies, SitePen can help! Contact us for a free 30-minute consultation.

Software Test Professionals – Conference Highlights from STPCon 2015

Wed, 10/07/2015 - 06:25

This week the Software Testing World moved to the US East Coast – just outside of Boston, MA a group of testers is discussing the latest and greatest at STPCon 2015. I was lucky enough to get 3 speaking slots this time (1 workshop, 2 breakouts) to share my thoughts on Performance, DevOps & Agile […]

The post Software Test Professionals – Conference Highlights from STPCon 2015 appeared first on Dynatrace APM Blog.

Performance Testing and APM part 2 – New Relic

Fri, 10/02/2015 - 09:32
In this episode, Ian Molyneaux discusses New Relic APM as it pertains to performance testing.

Network Virtualization and the Software Defined Data Center: Part 2

Thu, 10/01/2015 - 07:19

In last week’s post about Network Virtualization and the Software Defined Data Center, I pointed out that whatever approach you choose to take towards your modern data center network – SDN, NFV, a blend – an element of network virtualization will become fundamental to its architecture. I concluded by pointing out that the requirement to […]

The post Network Virtualization and the Software Defined Data Center: Part 2 appeared first on Dynatrace APM Blog.

TrafficDefender helps Poundworld Plus website stay online during BBC One programme

Thu, 10/01/2015 - 04:00
BBC One's "Pound Shop Wars" drove so much traffic to the ecommerce sites of the features shops that it risked overloading the servers and bringing the sites down. Thankfully for Poundworld Plus, they had TrafficDefender at the ready.

Born to Test – Conference Highlights from StarWest 2015

Wed, 09/30/2015 - 08:39

This week the Software Testing World is gathering at StarWest 2015 in Anaheim, CA (yeah – that’s where Disneyland is) to share the latest best practices when it comes to Software Quality, Testing, Continuous Testing, DevOps, … and all the other cool words floating around. I am a big fan of testers – that’s actually […]

The post Born to Test – Conference Highlights from StarWest 2015 appeared first on Dynatrace APM Blog.

Another Facebook Outage Causes Ripple Effects Around the Web

Tue, 09/29/2015 - 09:09

Yesterday afternoon, beginning at around 3 pm EST, Facebook started experiencing intermittent outages around the globe for the second time this week due to what the company called a “configuration issue.” And over the course of the next 90 minutes, as the outages became increasingly prevalent, users flocked to a variety of other social networks to complain about the problem.

But as we’ve seen many times before, the effects were not limited to just people who were unable to log directly onto Facebook. Due to the social media giant’s increasingly large footprint throughout the web, the availability issues sent ripple effects throughout the web and were felt by many different sites that use Facebook for login capabilities, chat features, and ad serving.

As you can see, the webpage load times for a wide range of retail sites were directly impacted by Facebook elements failing to load. In some cases this was due to a pixel that experienced connection failures, and others had the problem compounded by a second connect.facebook.net request later on in the rendering. Most of the sites that we inspected behaved that way, while others had Facebook pixels completely blocking the onload event from occurring, potentially resulting in a poor user experience. In Facebook’s case, we clearly saw that some sites were impacted. This resulted in the spinning wheel for end users (or other signs that page still loading), and a potential delay of page assets loading. And from a business standpoint, third party tags like analytics and profiling might have been delayed, or not load at all.

As such, the degree of impact that the Facebook outage had on other sites is a direct result of how those sites’ load the Facebook tags. Any third party widget such as social sharing, commenting, etc. should always be loaded asynchronously to avoid it becoming a single point of failure (SPOF) that blocks the other elements on the page from loading as well. However, even that may not always be enough to protect you, as sometimes a tag that’s loaded asynchronously can still block the onload event from occurring depending on the tag’s vendor.

That’s why the most important step to take is protecting yourself; any site that contains elements hosted by Facebook (or any other third party provider for that matter) should have strict Service Level Agreements in place to protect any revenue streams that might be negatively affected by a poor-performing element on the page. All third parties – whether they’re data centers, SaaS tools, ad serving networks, or social media platforms – have to be held accountable for their performance through legally binding contracts.

Adding another wrinkle to this story is that Facebook’s rough week comes at the same time that it’s positioning itself to be a major player in the online advertising industry. With the rise of adblock tools, Facebook is uniquely positioned to grab a major share of the ad serving market through its news feed ads as well as the expansion of Instant Articles. And while a few hours of downtime in a week is highly unlikely to have a major impact on Facebook’s reputation, it’s still an unfortunate piece of timing for the social site to experience significant performance problems while simultaneously trying to show that they can deliver a consistently superior user experience than their competitors in these industries.

 

The post Another Facebook Outage Causes Ripple Effects Around the Web appeared first on Catchpoint's Blog.

On Our Way to the Fall Classic with Digital Performance Management

Tue, 09/29/2015 - 06:53

Several weeks ago we had a look at Fantasy Football site performance as football season kicked off.  Autumn is also the time to watch the Major League Baseball pennant races. Baseball is particularly exciting for me this year as the Blue Jays are leading the AL East just a few games up on NY. It reminds me […]

The post On Our Way to the Fall Classic with Digital Performance Management appeared first on Dynatrace APM Blog.

I’m not worried about you. It’s the other guy.

Mon, 09/28/2015 - 07:45

Share

My teenagers are getting that look in their eyes. No, not that look. The look that says, “Dad, I really want to learn how to drive!” So, I’ve started teaching them, just a little at a time, so that when that day comes, they’ll be absolutely as prepared as possible to survive a simple drive to the grocery store with their white-knuckled mother in the passenger seat.

Part of the preparation for this event is my expression of trust in my kids. I am confident that I will do everything in my power to prepare them to drive safely, effectively, and properly, and I tell them so: “I’m not worried about you. I’m going to prepare you. I’m worried about the other drivers who won’t perform as expected.”

It’s not you. It’s the other guy.

Image courtesy of Sarah Joy/Flickr

That, unfortunately, is how web performance optimization geeks professionals are now forced to approach their own performance when they know that there will be third party ads served up on their carefully crafted pages. I know how to prepare my kids to drive, and I know what to do the first time they drive out of the driveway without a white-knuckled parent at their side (pray & cry!), but what can you do to prevent third party ads from ruining your webperf? Here are 5 suggestions.

  1. Get a regular Zoompf Free Scan or use webpagetest.org to determine exactly how each element on your pages is performing, or not performing as the case may be. You can easily make this task a big CYA, but it should really be about total performance, including third party ads. There’s a reason that third party ads are being served up on your web site. Hint: it sounds like “honey”.
  2. Measure the actual performance of those ads to determine if having fewer but better performing (e.g., faster loading) ads would help the page’s overall performance. Here, we don’t mean only the loading speed of the ads, but the click through rates of each ad. If nobody is clicking on an ad, should it even be there, slowing down your web page?
  3. Involve company leadership to determine what the goals of the page or site are. If the #1 goal is absolutely 100% ad revenue, then you may be fighting a tough battle. However, you have the knowledge that, if a page loads really, really slowly, it doesn’t matter what the goal is because it won’t be achieved. Visitors will simply leave. If the top goal is user experience, which always includes great performance, then you may have the ability to get rid of some of those ads, or at least optimize their performance as much as possible.
  4. Test using a least common denominator approach, rather than testing in a vacuum. What we mean by this is that you should test using the worst possible functional devices you can get your hands on, and see how your pages/sites perform on those old, slow devices. That means mobile devices, too! Check Google Analytics (or whatever web analytics package you use) and determine what percentage of your visitors are using a mobile device, and then test for webperf on old versions of those devices.
  5. Experiment with fewer ads, bigger ads, more ads, different ad types, different third party services, etc. The digital marketing experience is about iteration towards the best possible solution. Web performance can be approached in much the same way.

Your kids – your website is your baby, right?? – can be really well prepared to perform optimally, and you cannot predict what that other driver, the third party ad, will do. But you can test those third party ads and make sure they are doing their best.

If you are interested in making sure you are protected and prepared against all types of performance issues, then you will love Zoompf. We have a number of free tools that help you detect and correct front-end performance issues on your website: check out our free report to analyze your website for common performance problems, and if you like the results consider signing up for our free alerts to get notified when changes to your site slow down your performance.

The post I’m not worried about you. It’s the other guy. appeared first on Zoompf Web Performance.

Kick-Start Continuous Monitoring with Chef, Ansible and Puppet

Mon, 09/28/2015 - 05:55

It’s been more than six months since I wrote about Top DevOps Tools We Love, and many great things have happened on the topic since then. Today, I am proud to announce the immediate availability of Chef, Ansible and Puppet scripts for automated deployments of our Dynatrace Application Monitoring solution into development, test and production […]

The post Kick-Start Continuous Monitoring with Chef, Ansible and Puppet appeared first on Dynatrace APM Blog.

Test Your Site Faster with Bulk Performance Tests

Fri, 09/25/2015 - 07:43

Share

Recently we deployed a major update to the Zoompf scanning system that uses queue-based performance jobs to scale the number of performance scans our system can handle. As a benefit of this new architecture, I’m excited to announce two great new features that can greatly simplify your performance testing workflow.

Bulk Run Performance Tests

If you log in to your Zoompf account and select View Results, you’ll notice a new button on the toolbar called Run Test.

As you can guess, this new option now allows you to multi-select existing performance tests to run in bulk. Doesn’t sound like much, but this button represents a pretty significant update to the Zoompf platform. Where before you would only be able to manually run one test at a time (waiting 30 seconds for each to complete), now you can queue up multiple tests to run in the background, then come back later to view the results when complete. This should be a huge time saver when running multiple scenarios at a given moment – just fire up those tests, switch over to your other work, then come back later when they’re all complete.

Currently accounts are limited to running up to 10 simultaneous tests at once, and in the future we’ll be expanding this higher as the platform grows.

Bulk Test Upload

As the name implies, the bulk performance test uploader allows you to upload a CSV text file with new or existing performance tests to run. There are two primary advantages to using the bulk uploader:

  1. You have several new test scenarios you want to create rapidly.
  2. You have a suite of existing performance tests you want to repeatedly re-run. (Say after each update to your website).

The bulk uploader now lets you do this in just a few simple clicks.

To use the bulk uploader, log in to your Zoompf account and click the New Test option. Scroll to the bottom and you’ll see a new Bulk Performance Testing section like such:

The next page provides a prompt to upload a text file, and instructions about the file format and your upload limits.

To start, the system will allow you to upload up to 10 tests at a time, and they must be single page scans only. In the future we’ll increase this out as our new queuing architecture expands.

Note: if your Zoompf license is for a fixed # of scans per month, these uploaded tests will count against your monthly total, so use wisely!

The upload file can either be a flat file of Start URLs (one per line), or can use the template linked on the page. If you want to see a good set of sample values for that template, use the export of existing tests link on the uploader page.

Some more information on the template fields:

  • Start URL – The URL to run the performance test against. If an existing single-page performance test uses this URL (and the Test Name field is blank), a new snapshot will be created for that existing test. If not, a new test with a default name will be created and used. This field is ignored if an existing Test Name is provided.
  • Test Name – Uses the existing performance test with this name (if found), else creates a new test with this name. If this field is blank, a default test name will be created.
  • Device Type – If a new test is created, this device type will be used. You can see the possible device type values from the Device Type dropdown when you create a new performance test via the UI. This field is ignored if using an existing test.

After you upload your file, the uploader will first scan the contents and provide you a preview of all actions before processing.

For example:

This is a great way for you to test your file format.

Once you confirm, all tests will be queued for processing, so it may take minutes or even hours to complete depending on how many tests you uploaded and how much work is currently in the queue. You can, however, view results as they appear by navigating into the View Results page.

Both of these features were designed to allow you to scale your workflow with Zoompf more efficiently. We hope you find them helpful.

If we can improve these features, or any others, please feel free to contact us with your ideas!

The post Test Your Site Faster with Bulk Performance Tests appeared first on Zoompf Web Performance.

TCP over IP Anycast – Pipe Dream or Reality?

Thu, 09/24/2015 - 08:15

The following article by Ritesh Maheshwari details how LinkedIn, a Catchpoint customer, switched their DNS solution from Unicast to Anycast, and was originally posted on LinkedIn’s Engineering blog.

LinkedIn is committed to improving the performance and speed of our products for our members. One of the major ways we are doing that is through our Site Speed initiative. While our application engineers work to improve the speed of mobile and web apps, our performance and infrastructure teams have been working hard to deliver bytes faster to our members. To this end, LinkedIn heavily utilizes PoPs for dynamic content delivery. To route end-users to the closest PoP, we recently moved our major domain, www.linkedin.com, to an anycast IP address. This post talks about why we did this, what challenges we faced, and how we overcame them.

Unicast and DNS-based PoP assignment problems

In the unicast world, each endpoint on the internet has a unique IP address assigned to it. This means that each of LinkedIn’s many PoPs (Points of Presence) has a unique IP address associated with it. DNS is then used to assign users to PoPs based on geographical location.

In an earlier blog post, we talked about the inefficiencies of using DNS for assigning users to PoPs. We observed that about 30% of users in the United States were being routed to a suboptimal PoP when using DNS. There were two major reasons for a wrong assignment:

  • DNS assignment is based on the IP address of the user’s DNS resolver and not the user’s device. So, if a user in New York is using a California DNS resolver, they are assigned to our West Coast PoP instead of the East Coast PoP.
  • The database used by DNS providers for converting an IP address to a location might not be completely accurate. Their country-level targeting is generally much better than their city-based targeting.
Anycast: What and Why

With anycast, the same IP address can be assigned to n servers, potentially distributed across the world. The internet’s core routing protocol, BGP, would then automatically route packets from the source to the closest (in hops) server. For example, in the following illustration, if a user Bob wants to connect to an anycast IP 1.1.1.1, its packets would be routed to PoP A because A is only three hops away from Bob, while all other PoPs are four hops or more.

Anycast’s property of automatically routing to the closest PoP gives us an easy solution to our PoP assignment problem. If LinkedIn assigned the same IP to all its PoPs, then:

  • We would not need to rely on DNS-based geographical assignments (DNS would just hand out that one IP)
  • None of the problems associated with DNS-based PoP assignments would arise
  • Our users would get routed to the closest PoP automatically
Anycast promise: Too good to be true?

Given the great properties of IP anycast, why does most of the internet still use unicast? Why have other major web companies not used anycast in an extended manner? Asking these questions gave us some disheartening news.

Internet Instability

Anycast has historically not proven to be useful for stateful protocols in the internet because of the inherent instability of the internet. We can explain this with an example. In the following figure, user Alice is three hops away from both server X and server Y. If Alice’s router does per-packet load balancing, it might end up sending Alice’s packets to both servers X and Y in a round-robin fashion. This means that Alice’s TCP SYN packet might go to server X, but her HTTP GET request might go to server Y. Because server Y doesn’t have an active TCP connection with Alice, it will send a TCP reset (RST) and Alice’s computer will drop the connection and will have to restart. Most routers now do a per-flow load balancing, meaning packets on a TCP connection are always sent over the same path, but even a small percentage of routers with per-packet load balancing can cause the website to be unreachable for users behind that router.

Even with per-flow load balancing, another problem is that if a link on the route to server X goes down, packets on any ongoing TCP connection between Bob and X will now be sent from Bob to Y, which will cause Y to send a TCP RST again.

This is why anycast has popularity in usage for stateless protocols like DNS (which is based on UDP). But recently, some popular CDNs have also started using anycast for HTTP traffic. This gave us hope, because their TCP connection with end users would last for a similar duration as LinkedIn’s TCP connection to our end users. A Nanog presentation in 2006 claimed that anycast works. So, to validate the assumption that TCP over anycast in the modern internet is no longer a problem, we ran a few synthetic tests. We configured our U.S. PoPs to announce an anycast IP address and then configured multiple agents in Catchpoint, a synthetic monitoring service, to download an object from that IP address. Our web servers were configured to deliberately send the response back slowly, taking over a minute for the complete data transfer. If the internet was unstable for TCP over anycast, we would observe continuous or intermittent failures when downloading the object. We would also observe TCP RSTs at the PoPs. But even after running these tests for a week, we did not notice any substantial instability problems! This gave us confidence to proceed further.

Fewer hops != Lower latency

There were two problems with proceeding forward and actually using anycast. First, our tests were on synthetic monitoring agents and not real users. So, we couldn’t say with high confidence that real users would not face problems over anycast. Second, and more importantly, anycast’s PoP assignment might not be any better than DNS. This is because with anycast, users are routed to the closest PoP in number of hops, and not in latency. With anycast, a one-hop cross-continental route with 100ms latency might be preferred over a three-hop in-country route with 10ms latency. But is this a theoretical problem or does this really happen in the internet? We ran additional Catchpoint tests, but they were inconclusive, so we decided to brainstorm a real user-based solution.

RUM to the rescue

In a previous blog we talked about how we instrumented Real User Monitoring, or RUM, for identifying which PoP a user connects to (see the section “Which PoP served my page view?”). Taking inspiration from that solution, we devised a RUM-based technique to run experiments to find out if anycast would work for our users.

Specifically, we did the following:

  1. We configured all our PoPs to announce one IP address (a global anycast IP address) and configured a domain “ac.perf.linkedin.com” to point to that IP address.
  2. After a page is loaded (load event fired), an AJAX request is fired to download a tiny object on ac.perf.linkedin.com.
  3. Because the IP address is anycast, the request would be served by the PoP that is closest to the user in terms of number of hops.
  4. While responding, PoP adds a response header that uniquely identifies it.
  5. RUM reads that header and uses it to identify which PoP served the object over the anycast IP. Thus, we know which PoP would be closest to this particular end-user IP address over anycast.
  6. RUM appends PoP information to the rest of the performance data and sends it to our servers.

Through offline processing, we aggregate this data to find out, for a given geography, what percentage of users would be routed to the closest PoP (in terms of latency) over anycast. Note that we know which is the closest PoP in latency through the work explained in the earlier blog (see the section “PoP Beacons in RUM”).

Global Anycast Results

Region/Country

DNS % Optimal
Assignment

Anycast % Optimal
Assignment

Illinois

70

90

Florida

73

95

Georgia

75

93

Pennsylvania

85

95

New York

77

74

Arizona

60

39

Brazil

88

33

While good for many U.S. states, the global anycast experiment also showed worse results for a few regions (for example, Brazil, Arizona, and New York). It looks as if either our peers, or some transit provider, had routing policies that made users in Brazil see Singapore as a closer PoP in terms of hops. Clearly, this would not work.

One solution was to discover the problematic ISPs and ask them to fix their routing. This would have been a complex, arduous process without guaranteed results. So we devised a different solution. We noticed that:

  • DNS-based geographical assignments are fairly accurate at the continent level. For example, a user in North America would usually be assigned a PoP in North America (though not always the optimal within North America).
  • Our global anycast results showed cross-continent problems. But within the continent, PoP assignments were fairly good.
Regional Anycast

We then decided to try a “Regional Anycast” solution. The regional anycast solution would work as follows:

  • All PoPs in the same continent would get the same anycast IP address.
  • PoPs in different continents would get different anycast IP addresses.
  • We would use DNS-based geographical load balancing to hand out the continent-specific anycast IP for acpc.perf.linkedin.com.
  • We would repeat the previous experiment, but with RUM downloading the object over the acpc.perf.linkedin.com domain.

Specifically, we had three anycast IPs, one for each of the following regions: the Americas, Europe/Africa, and Asia.

Upon running a similar RUM experiment, we found that the regional anycast variant didn’t have the problems seen with global anycast. Based on the results, we decided to start using the regional anycast solution.

Ramp

We first did a pilot test where all of the U.S. was slowly ramped on “Regional Anycast” over the course of few days and monitored for any anomalies. The pilot test results are shown in the following graph, where Y-axis is the percentage of optimal PoP assignment. As we slowly ramped anycast, we can clearly see that many U.S. states saw improvement in the percentage of traffic going to the optimal PoP.

We ramped U.S. on regional anycast earlier this year, and the overall suboptimal PoP assignment dropped from 31% to 10%. While this is a significant gain, we are still investigating why the remaining 10% are not optimally assigned.

Final Thoughts

Currently, we have ramped North America and Europe on regional anycast and are carefully evaluating anycast for the rest of the world.

We do want to emphasize that anycast is not a silver bullet solution for this problem. It seems to resolve, to some extent, the inefficiencies with DNS-based PoP assignment. But it also has suboptimal assignment (in most geographies, assignment is <100% optimal), albeit for a smaller set of users. Similarly, while anycast simplifies our GLB pools, it introduces more complexity when we need to shed load from a PoP.

Acknowledgements

This project has had major contributions from many people across many teams (Performance, Network Operations, Traffic SRE, Edge Perf SRE, GIAS and more). Thanks to everyone involved in our Anycast Working Group including Shawn Zandi for his help in the initial design; Weilu Jia and Jim Ockers for working hard to drive the project to its completion; Stephanie Schuller and Fabio Parodi for helping run the project smoothly; Sanjay Dubey for the inspiration; Naufal Jamal, Thomas JacksonMichael Laursen, Charanraj Prakash, Paul Zugnoni, and many others.

The post TCP over IP Anycast – Pipe Dream or Reality? appeared first on Catchpoint's Blog.

Network Virtualization and the Software Defined Data Center

Thu, 09/24/2015 - 07:30

I was recently having a discussion with one of my clients about the concept of network virtualization. During the conversation, in attempting to explain the business case, I referred to data center virtualization. This was met with some confusion, with the customer explaining that when he heard that term, his immediate thought was of server […]

The post Network Virtualization and the Software Defined Data Center appeared first on Dynatrace APM Blog.

Fix performance issues fast with PRIORITAH!

Thu, 09/24/2015 - 06:42

Share

You may not have the authoritah that Cartman had, but when those nasty website performance issues hit, you have to have “prioritah!” the ability to accurately prioritize those issues, so they get addressed and fixed in short order and in the right order.

Most web property owners don’t know what the problems that affect website performance are, and when they do, they don’t know how to prioritize, nor how to solve the problems. That’s ok, because that’s our job. We have prioritah! Here’s how it works.

Zoompf identifies and tells you how to solve more 425 potential website performance problems. When we send you your report, we’ve already put all those problems into buckets by the following criteria.

Role:

  • IT – server, hosting, infrastructure
  • Designer – images, content
  • F/E developer – rendering, order of loading
  • B/E developer – APIs, database calls

Impact:

  • Severity of issue – are customers affected?
  • Difficulty of fix – an image that needs optimization is different from an API that has changed
  • Number of pages affected

Once you know where the issue ranks within these criteria, we assign a recommended priority to each issue that we find, allowing you to assign the proper resource at the proper time. We also offer you the recommended solution for how to address the issue, based on thousands of website performance issues that we’ve encountered, studied, and repaired.

Now let’s review. You have website performance issues. You don’t know what exactly they are, what causes them, how to fix them, how to prioritize the work, or who to assign to fix each issue. That’s where we come in, to alert you to each issue as it arises, identify the cause, demonstrate the severity of each issue, prioritize the issues, and help you assign the role of the resource to address each issue.

We’re like your Cartman for website performance issues, only more mature.

If you are interested in front-end performance issues, then you will love Zoompf. We have a number of free tools that help you detect and correct front-end performance issues on your website: check out our free report to analyze your website for common performance problems, and if you like the results consider signing up for our free alerts to get notified when changes to your site slow down your performance.

The post Fix performance issues fast with PRIORITAH! appeared first on Zoompf Web Performance.

TCP over IP Anycast – Pipe Dream or Reality?

Wed, 09/23/2015 - 17:42

The following article by Ritesh Maheshwari details how LinkedIn, a Catchpoint customer, switched their DNS solution from Unicast to Anycast, and was originally posted on LinkedIn’s Engineering blog.

LinkedIn is committed to improving the performance and speed of our products for our members. One of the major ways we are doing that is through our Site Speed initiative. While our application engineers work to improve the speed of mobile and web apps, our performance and infrastructure teams have been working hard to deliver bytes faster to our members. To this end, LinkedIn heavily utilizes PoPs for dynamic content delivery. To route end-users to the closest PoP, we recently moved our major domain, www.linkedin.com, to an anycast IP address. This post talks about why we did this, what challenges we faced, and how we overcame them.

Unicast and DNS-based PoP assignment problems

In the unicast world, each endpoint on the internet has a unique IP address assigned to it. This means that each of LinkedIn’s many PoPs (Points of Presence) has a unique IP address associated with it. DNS is then used to assign users to PoPs based on geographical location.

In an earlier blog post, we talked about the inefficiencies of using DNS for assigning users to PoPs. We observed that about 30% of users in the United States were being routed to a suboptimal PoP when using DNS. There were two major reasons for a wrong assignment:

  • DNS assignment is based on the IP address of the user’s DNS resolver and not the user’s device. So, if a user in New York is using a California DNS resolver, they are assigned to our West Coast PoP instead of the East Coast PoP.
  • The database used by DNS providers for converting an IP address to a location might not be completely accurate. Their country-level targeting is generally much better than their city-based targeting.
Anycast: What and Why

With anycast, the same IP address can be assigned to n servers, potentially distributed across the world. The internet’s core routing protocol, BGP, would then automatically route packets from the source to the closest (in hops) server. For example, in the following illustration, if a user Bob wants to connect to an anycast IP 1.1.1.1, its packets would be routed to PoP A because A is only three hops away from Bob, while all other PoPs are four hops or more.

Anycast’s property of automatically routing to the closest PoP gives us an easy solution to our PoP assignment problem. If LinkedIn assigned the same IP to all its PoPs, then:

  • We would not need to rely on DNS-based geographical assignments (DNS would just hand out that one IP)
  • None of the problems associated with DNS-based PoP assignments would arise
  • Our users would get routed to the closest PoP automatically
Anycast promise: Too good to be true?

Given the great properties of IP anycast, why does most of the internet still use unicast? Why have other major web companies not used anycast in an extended manner? Asking these questions gave us some disheartening news.

Internet Instability

Anycast has historically not proven to be useful for stateful protocols in the internet because of the inherent instability of the internet. We can explain this with an example. In the following figure, user Alice is three hops away from both server X and server Y. If Alice’s router does per-packet load balancing, it might end up sending Alice’s packets to both servers X and Y in a round-robin fashion. This means that Alice’s TCP SYN packet might go to server X, but her HTTP GET request might go to server Y. Because server Y doesn’t have an active TCP connection with Alice, it will send a TCP reset (RST) and Alice’s computer will drop the connection and will have to restart. Most routers now do a per-flow load balancing, meaning packets on a TCP connection are always sent over the same path, but even a small percentage of routers with per-packet load balancing can cause the website to be unreachable for users behind that router.

Even with per-flow load balancing, another problem is that if a link on the route to server X goes down, packets on any ongoing TCP connection between Bob and X will now be sent from Bob to Y, which will cause Y to send a TCP RST again.

This is why anycast has popularity in usage for stateless protocols like DNS (which is based on UDP). But recently, some popular CDNs have also started using anycast for HTTP traffic. This gave us hope, because their TCP connection with end users would last for a similar duration as LinkedIn’s TCP connection to our end users. A Nanog presentation in 2006 claimed that anycast works. So, to validate the assumption that TCP over anycast in the modern internet is no longer a problem, we ran a few synthetic tests. We configured our U.S. PoPs to announce an anycast IP address and then configured multiple agents in Catchpoint, a synthetic monitoring service, to download an object from that IP address. Our web servers were configured to deliberately send the response back slowly, taking over a minute for the complete data transfer. If the internet was unstable for TCP over anycast, we would observe continuous or intermittent failures when downloading the object. We would also observe TCP RSTs at the PoPs. But even after running these tests for a week, we did not notice any substantial instability problems! This gave us confidence to proceed further.

Fewer hops != Lower latency

There were two problems with proceeding forward and actually using anycast. First, our tests were on synthetic monitoring agents and not real users. So, we couldn’t say with high confidence that real users would not face problems over anycast. Second, and more importantly, anycast’s PoP assignment might not be any better than DNS. This is because with anycast, users are routed to the closest PoP in number of hops, and not in latency. With anycast, a one-hop cross-continental route with 100ms latency might be preferred over a three-hop in-country route with 10ms latency. But is this a theoretical problem or does this really happen in the internet? We ran additional Catchpoint tests, but they were inconclusive, so we decided to brainstorm a real user-based solution.

RUM to the rescue

In a previous blog we talked about how we instrumented Real User Monitoring, or RUM, for identifying which PoP a user connects to (see the section “Which PoP served my page view?”). Taking inspiration from that solution, we devised a RUM-based technique to run experiments to find out if anycast would work for our users.

Specifically, we did the following:

  1. We configured all our PoPs to announce one IP address (a global anycast IP address) and configured a domain “ac.perf.linkedin.com” to point to that IP address.
  2. After a page is loaded (load event fired), an AJAX request is fired to download a tiny object on ac.perf.linkedin.com.
  3. Because the IP address is anycast, the request would be served by the PoP that is closest to the user in terms of number of hops.
  4. While responding, PoP adds a response header that uniquely identifies it.
  5. RUM reads that header and uses it to identify which PoP served the object over the anycast IP. Thus, we know which PoP would be closest to this particular end-user IP address over anycast.
  6. RUM appends PoP information to the rest of the performance data and sends it to our servers.

Through offline processing, we aggregate this data to find out, for a given geography, what percentage of users would be routed to the closest PoP (in terms of latency) over anycast. Note that we know which is the closest PoP in latency through the work explained in the earlier blog (see the section “PoP Beacons in RUM”).

Global Anycast Results

Region/Country

DNS % Optimal
Assignment

Anycast % Optimal
Assignment

Illinois

70

90

Florida

73

95

Georgia

75

93

Pennsylvania

85

95

New York

77

74

Arizona

60

39

Brazil

88

33

While good for many U.S. states, the global anycast experiment also showed worse results for a few regions (for example, Brazil, Arizona, and New York). It looks as if either our peers, or some transit provider, had routing policies that made users in Brazil see Singapore as a closer PoP in terms of hops. Clearly, this would not work.

One solution was to discover the problematic ISPs and ask them to fix their routing. This would have been a complex, arduous process without guaranteed results. So we devised a different solution. We noticed that:

  • DNS-based geographical assignments are fairly accurate at the continent level. For example, a user in North America would usually be assigned a PoP in North America (though not always the optimal within North America).
  • Our global anycast results showed cross-continent problems. But within the continent, PoP assignments were fairly good.
Regional Anycast

We then decided to try a “Regional Anycast” solution. The regional anycast solution would work as follows:

  • All PoPs in the same continent would get the same anycast IP address.
  • PoPs in different continents would get different anycast IP addresses.
  • We would use DNS-based geographical load balancing to hand out the continent-specific anycast IP for acpc.perf.linkedin.com.
  • We would repeat the previous experiment, but with RUM downloading the object over the acpc.perf.linkedin.com domain.

Specifically, we had three anycast IPs, one for each of the following regions: the Americas, Europe/Africa, and Asia.

Upon running a similar RUM experiment, we found that the regional anycast variant didn’t have the problems seen with global anycast. Based on the results, we decided to start using the regional anycast solution.

Ramp

We first did a pilot test where all of the U.S. was slowly ramped on “Regional Anycast” over the course of few days and monitored for any anomalies. The pilot test results are shown in the following graph, where Y-axis is the percentage of optimal PoP assignment. As we slowly ramped anycast, we can clearly see that many U.S. states saw improvement in the percentage of traffic going to the optimal PoP.

We ramped U.S. on regional anycast earlier this year, and the overall suboptimal PoP assignment dropped from 31% to 10%. While this is a significant gain, we are still investigating why the remaining 10% are not optimally assigned.

Final Thoughts

Currently, we have ramped North America and Europe on regional anycast and are carefully evaluating anycast for the rest of the world.

We do want to emphasize that anycast is not a silver bullet solution for this problem. It seems to resolve, to some extent, the inefficiencies with DNS-based PoP assignment. But it also has suboptimal assignment (in most geographies, assignment is <100% optimal), albeit for a smaller set of users. Similarly, while anycast simplifies our GLB pools, it introduces more complexity when we need to shed load from a PoP.

Acknowledgements

This project has had major contributions from many people across many teams (Performance, Network Operations, Traffic SRE, Edge Perf SRE, GIAS and more). Thanks to everyone involved in our Anycast Working Group including Shawn Zandi for his help in the initial design; Weilu Jia and Jim Ockers for working hard to drive the project to its completion; Stephanie Schuller and Fabio Parodi for helping run the project smoothly; Sanjay Dubey for the inspiration; Naufal Jamal, Thomas JacksonMichael Laursen, Charanraj Prakash, Paul Zugnoni, and many others.

The post TCP over IP Anycast – Pipe Dream or Reality? appeared first on Catchpoint's Blog.

7 Reasons why APM is a No-Brainer for all Organizations

Wed, 09/23/2015 - 07:43

Why bother? That is the question many IT professionals face when trying to sell the value of application performance management internally to their organizations. As a working IT manager, for a Fortune 500 company, I like to save my company money, work more efficiently and ensure that my system users are happy. It sometimes feels […]

The post 7 Reasons why APM is a No-Brainer for all Organizations appeared first on Dynatrace APM Blog.

Politics and Performance: Trump Still Winning in Both Areas

Fri, 09/18/2015 - 10:23

The last time we looked in on the Republican Primary candidates’ websites, the initial goal was to see if the first debate resulted in any performance snafus due to increased traffic. Obviously every candidate wants to drive potential donors to their websites, and there’s no better time for them to do it than when they’re in front of a national television audience. That didn’t turn out to be the case, as there were no significant performance spikes while the candidates were on stage.

What we saw instead were several mistakes in page construction and optimization among the different candidates, most notably a glut of huge elements like videos and high-res images. These dramatically increase the size of the pages, and not surprisingly cause the sites to load much slower than they could be.

After conducting the tests under the same conditions after Wednesday night’s debate, some candidates have learned their lessons, but others seem content to keep making the same mistakes.

Most notably, every single candidate with the exception of Ben Carson had their website get heavier, some of them absurdly so. For instance, Mike Huckabee’s site more than quadrupled in size, going from an average of 1.50 MB in August to 6.02 MB on Wednesday. Another notable increase came from Rand Paul, whose page size went from 3.30 MB to 5.46 MB.

But the real performance faux pas once again belonged to Scott Walker and Jeb Bush. The last time, both governors had huge videos on their homepages, which drove load times way up, and this time was no different. To be fair, Bush’s web team again anticipated problems with traffic increases and removed the video from the page while the debate was on air before putting it back up after it was over. But the best thing from a performance standpoint would be to compress it so that it doesn’t send users’ browsers into overdrive.

As for the load times themselves, Donald Trump, the leader in the polls, once again came in with the fastest site and Marco Rubio with the second fastest (despite the latter once again having an extremely heavy homepage). And Carly Fiorina, who was not included in the first primetime debate and therefore not in our study, had a strong debut with the lightest site that was also the third-fastest. Meanwhile, Scott Walker’s site was painfully slow, taking just under six seconds before a user could interact with it.

This time around, we also made a point to look at the number of objects on each homepage, as well as the number of hosts that the site has to connect to. Here too, Trump came in with the best numbers by a wide margin, and Walker last.

The next debate is scheduled for just before Halloween on Wednesday, October 28, so check back then to see how scary the performance issues have gotten.

 

The post Politics and Performance: Trump Still Winning in Both Areas appeared first on Catchpoint's Blog.

Ensure Availability & Performance in SAP’s Digital Economy

Thu, 09/17/2015 - 05:55

SAP applications play a key role in fulfilling business processes in today’s digital enterprises. Availability problems, even those that impact single users, result in efficiency loss and in a worst case scenario may even stop the process. From an IT operation perspective it is a challenging task to isolate and identify the root-cause of intermittent availability […]

The post Ensure Availability & Performance in SAP’s Digital Economy appeared first on Dynatrace APM Blog.

Top 4 Digital Performance Metrics you should monitor 24/7!

Wed, 09/16/2015 - 05:35

Today’s websites are no longer solely marketing channels, they are critical production factors. If a website fails to deliver a satisfactory customer experience the entire value delivery chain is broken, and a company will not generate revenue regardless of product quality or value proposition. Mastering digital performance is one of the leading challenges of the web […]

The post Top 4 Digital Performance Metrics you should monitor 24/7! appeared first on Dynatrace APM Blog.

Pages