Bits and a Pint: Webpage performance

There is a lot of talk about what you should be looking at to determine if your page is performing optimally. With today's front end architecture's it is generally accepted that several indicators are no longer valid - i.e. domComplete and onLoad. We also know that reviewing the full page time is not a true indicator given that most critical functionality for page interaction may be loaded well before your page is fully loaded. With all this in mind what are the most critical indicators that we should be looking at?

First lets clarify that in this article I'm only speaking about the page's initial load/render performance. While I believe that each indicator has it's own place in the world of performance there are a few key indicators that we use at StubHub to track our page's performance. We have 2 main indicators that we track against company KPI goals. The first is time to interact (TTI) and the second is Full Page Time (FPT). I know what your thinking, "But wait, you just said that Full Page Time is not a good indicator of performance.", while this is generally true, the FPT indicator is a good reference for keeping 3rd party and non-critical functionality in check. Other indicators, non-company KPI goals, that we use deal with time to first byte (TTFB) and the loading of ads.

You might be asking "why don't you look at several of the other indicators, given that you've stated that each indicator is important?". The reality is that we do utilize many of the other factors - page weight, requests count, etc...; however we utilize these as to help us set our KPI goals and to help us tune our pages. I've found it especially useful to use these other indicators when giving the design teams a rough set of guidelines around page design and performance impact, which I might discuss in another article.

Our Testing System

Before we jump into how we use each of the key indicator's I mentioned above, I want to spend a moment on how we measure/collect them. For our collection process we've setup a custom monitoring system that utilizes Webpagetest (http://www.webpagetest.org/), Jenkins (https://jenkins-ci.org/) and HARStorage (https://github.com/pavel-paulau/harstorage).

Jenkins

We use Jenkins as the glue that wires the whole system together. This is our controller that will execute the tests, upload the results to our collector and broadcast periodic test reports.

Webpagetest

For webpagetest we have a public API key that use to execute tests on public nodes and we have multiple private instances installed in several of our data centers across the country so that we can execute tests against both our test and production servers.

HARStorage

HARStorage is an open source python/pylons/mongo based application for parsing HAR files and creating graphs and trending from that data. We've cloned this repo and made modifications to the source for our own needs.

Our KPI's

Time To First Byte (TTFB)

This is a great indicator of overall environment health as well as internet routing health. Since our primary request is not cached on the edge today we an utilize this to track how well we are responding to quests from across the network. If we see this number dropping then we know we have to review our back-end for potential stalls as well as the network traversal paths for any potential issues within the backbones and/or CDN nodes that are handing off to the origin. Ideally if there is nothing dynamically generated in the primary request it should be cached and served from a CDN, which is something we are working towards, but until then we can utilize this indicator to help alert on potential issues early. Also included in here is the cost of DNS lookup and SSL negotiation, so again an alarm here allows us to make sure these vital components are behaving and that paths are tuned properly.

Use Case: We added a new server to our pool that would handle specific subdomain traffic and we started to see the TTFB along this path increase at the 90th% causing us to investigate. What we found was that for some of the requests coming in SSL the negotiation was taking a long time. The cause was that the paths for these new subdomains were not added to our CDN's DNS/SSL negotiation service so any call's from the east coast were forced to traverse tour our west coast server locations to do the necessary lookups/negotiations.

Time To Interactive (TTI)

This is one of the most important indicators as it speaks to when the page is actually ready for the user to interact and conduct the business the page is intended for. If you use webpagetest as your testing platform you can get a sense of this by reviewing the XXX. However, to get an accurate number for our App we created a custom timing object that uses the Performance User Timing spec. This spec isn't supported by all browsers yet, so the code object needs to handle this in the case of non-supporting browsers; but for our synthetic tests we are utilizing chrome which does support the spec, thus allowing us to capture, display, graph and alert on the data.

Our Custom object

    
SHCustomLoadTiming = window.SHCustomLoadTiming || {};
....
....
....
    SHCustomLoadTiming.uxLoadTime;

    SHCustomLoadTiming.setUxLoadTimeToNow = function() {
        if (!SHCustomLoadTiming.isSupported) {
            return;
        }

        var now = new Date().getTime();
        SHCustomLoadTiming.uxLoadTime = now - performance.timing.navigationStart;
        SHCustomLoadTiming.setPerformanceUserTimeMark('user-ready');
        SHCustomLoadTiming.setPerformanceUserTimeMeasure('user-ready', 'navigationStart');
        console.debug('SHCustomLoadTiming: New uxTimer: ' + SHCustomLoadTiming.uxLoadTime);
    };

    SHCustomLoadTiming.setPerformanceUserTimeMark = function(name) {
        if (!SHCustomLoadTiming.isSupported) {
            return;
        }

        //Clear previous marks inbound mark
        var markName = 'mark-' + name;
        window.performance.clearMarks(markName);

        //create the new marks
        window.performance.mark(markName);
    };

    SHCustomLoadTiming.setPerformanceUserTimeMeasure = function(name, startMark) {
        if (!SHCustomLoadTiming.isSupported) {
            return;
        }
        //Clear previous measures for inbound mark
        var markName = 'mark-' + name;
        var measureName = 'measure-' + name;
        window.performance.clearMeasures(measureName);

        //create the new measures
        window.performance.measure(measureName, startMark, markName);

        //Track custom SH measurements
        if (SHCustomLoadTiming.measures.indexOf(name) < 0) {
            SHCustomLoadTiming.measures.push(name);
        }
    };

Calling the custom object

    
//The application publishes an event once it finishes rendering it's critical path elements
this.publishEvent('app:render-ready');

//The core captures the event and triggers the custom object timer
   delayLoader.once('app:render-ready', function() {

        SHCustomLoadTiming.setUxLoadTimeToNow();

        if (location.search.indexOf('commonDelay=false') === -1) {
            require(['common-delay'],
                function(CommonDelay) {
                    console.log('common delay is loaded');
                }
            );
        }
    });

While this is a slower than normal test, you can see from the screenshots of WPT how TTI is roughly 2x that of FPT.

Given the importance of this indicator on a user's perception of the performance of a page we have set this as an engineering KPI. The nice thing about using this indicator within engineering is that the engineers don't feel helpless when trying to achieve the goals. This is mainly due to prioritizing all critical path content to render first thus pushing all 3rd party data, typically out of the developers ability to tune, loading post this indicator.

Full Page Time (FPT)

I mentioned above that FPT is not a "true" indicator of a page's performance. This is because we are generally loading a bunch of 3rd party non-critical items that affect the full page load time. However, this is still an important indicator to track if only to give the business a sense of how the additional non-critical items are affecting the page in totality. At StubHub we track this as a KPI with a company goal set against it. In tracking this and comparing it to the TTI above we can tell immediately if we have bad behaving 3rd party content on the site. Once we dive in, we can see if it's a bad behaving vendor or if our business teams have added too much page bloat in the form of managed content, tracking pixels, ads, etc....

Here is an example report showing at the 90th% the delta of between TTI and FPT for several pages.

Ads Fully Loaded Time (AFLT)

Finally we our tracking the total time it takes for Ads to display and render on the page. We are doing this because we utilize ad exchanges and have observed that there are some ads that get through the exchange rules we've setup and can cause performance issues against our full page time. By recoding this as a custom indicator we can at a glance determine if it's our ads or some other 3rd party service that may be impacting our FPT. Thus far we've been able to use this indicator with our Ads business team to quickly find violators that slip through the rules engine's and block them from future appearances. Some typical violations that can shoot the FPT through the roof will also cause major page bloat which can ultimately yield a poor user performance - especially on a mobile device. The type of bloat we've seen due to this is increased request count, 1000+, and increased page size, 8mb; imagine that on a mobile page... Again to capture this indicator we utilized our custom timing object.

Conclusion

There are a lot of performance indicators out there your goal should be to find the right ones for your specific application and use cases. If nothing completely fits your needs, then take advantage of the Performance User Timing spec and create your own. At the end of the day it's all about tracking the necessary data to improve your sites speed.

Bits and a Pint

Daily Power Ups

Thursday, September 24, 2015

Webpage performance - Key Indicators