[DRAFT] Performance Team Goals July - September 2015 (Q1 2015/2016)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	Jun 4 2015, 12:33 AM

Description

~30% improvement in site performance in 3 months.

Backend

Sub-second median page save time.
- Currently: ~1.4s (33% over target).
- Bottommost metric on https://performance.wikimedia.org/

Sub-80ms backend response time for GET requests to application servers.
- Currently: ~110ms (30% over target).

Frontend

TBD.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Gilles	T101345 [DRAFT] Performance Team Goals July - September 2015 (Q1 2015/2016)
Resolved	• Gilles	T101819 Track state (region)
Resolved	ori	T103720 parsing legacy GeoIP cookies fails (no regex match), enwiki geonotice broken for users with those legacy cookies

Event Timeline

ori created this task.Jun 4 2015, 12:33 AM

ori raised the priority of this task from to Medium.

ori updated the task description. (Show Details)

ori added a project: Performance-Team.

ori added subscribers: ori, • Gilles, Krinkle, aaron.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 4 2015, 12:33 AM

• MZMcBride subscribed.Jun 5 2015, 10:02 PM

• Gilles claimed this task.Jun 9 2015, 9:35 AM

In our last team call we discussed what our reference frontend metrics should be, considering that a large portion of what we track at the moment is outside of our control (inherent latency due to physical distance to data centers, end-user's bandwidth, etc.). It's important that we track metrics that aren't subject to drastic changes that the team wouldn't be responsible for, which would make our own contribution to metrics improvements difficult to evaluate.

For this reason, we've settled on trying to isolate end-users living in Virginia. They are likely to have fast connections to our Ashburn datacenter, they are very unlikely to see the fact that they connect to this particular DC change when we roll out new DCs or new cache PoPs. The fact that they are physically close to the DC also means that latency should account for a much smaller portion of the metrics.

The first step before goal-setting is to look at data about users in VA and figuring out which portion of their current performance is our responsibility. Then we should be able to determine a percentage of improvement we hope to achieve on specific metrics like firstPaint in the coming quarter.

While we wait for VA-specific data, I'm investigating where the biggest wins might lie on the desktop site and what portion of frontend performance they could represent. I will be using the top 10 most popular enwiki articles this month as the sample (they're conveniently very varied in structure). I will be using the GPRS (very slow) and Wifi (very fast) google dev tools profiles to compare the effect in both extremes.

How large is the effect of images in the article competing for bandwidth with essential top-loaded CSS/JS?

Looking at thumbnails only (ignoring icons), if we were to postpone loading of thumbnails until firstPaint, we're looking at reaching firstPaint 17% sooner on average on a very slow connection. However the difference would be within the margin of error for a very fast connection.

This approach is difficult without splitting noJS support into its own site anyway. But it's an interesting confirmation that asset parallelization penalizes very slow connections (loading essentials is delayed by images loading in parallel) but benefits very fast connections (thumbs are fetched for free while the essentials are loaded). Therefore efforts to change the order or mix of parallelized requests would affect overall performance and the distribution of performance, in non-obvious ways.

Looking at overall averages is particularly dangerous, for this particular change the performance worsening for very fast connections would be dwarfed by the absolute gains in seconds seen by very slow connections. Thus making it appear like the site is faster overall, but without telling the story that it got slower for people with fast connectivity.

How much would we gain on the desktop site by only loading the lead section on pageload?

This is a topic we've explored for mobile web as an experimental changeset, but it applies to the desktop site just as much. Here, obviously, firstPaint performance would be improved for both slow and fast connections. firstPaint would be reached between 25 and 30% faster on a mix of articles like the top 10.

This is not a straightforward win, though, as on desktop the large screen real estate means that the lead section is unlikely to fill all the above-the-fold area. Which means people will see the rest of the article appear. Which means that moving firstPaint sooner becomes a perceived speed trade-off that needs to be validated with user testing. I've run into the situation before where showing part of the content + spinner felt slower to users than waiting on a blank page and getting all the content at once.

Now, this 25-30% figure is interesting, because we're essentially tossing most of the content. This gives us a sense of scale of how much we stand to gain on frontend performance. I don't think we're likely to explore either of those ideas for desktop in the coming quarter, as we're better off working on issues that won't be a UX tradeoff. Therefore I recommend that we aim for a modest 10% improvement on firstPaint as the frontend quarterly goal. Even that might be too ambitious, as we don't know yet where the biggest gains in our stack are to be had. Even trimming down the <head> might not be enough, as the cleanup that has already happened on mobile head CSS hasn't shown any obvious improvement in firstPaint yet. But maybe that's because we also need to reduce head JS, as those requests are parallelized.

Krinkle moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Jun 18 2015, 12:13 AM

• Gilles moved this task from Backlog: Maintenance, non-prioritized to Doing (old) on the Performance-Team board.Jun 18 2015, 7:19 PM

We can now make our client-measured performance metrics more stable in regards to external forces by filtering and looking only at Virginia traffic.

He7d3r subscribed.Jul 9 2015, 4:29 PM

• Gilles moved this task from Doing (old) to Backlog: Maintenance, non-prioritized on the Performance-Team board.Jul 21 2015, 8:47 PM

• Gilles closed this task as Resolved.Aug 5 2015, 8:23 AM

Thanks. It would be nice to know what sort of tasks are caused by this goal: for instance, is T92357 one of them?

[DRAFT] Performance Team Goals July - September 2015 (Q1 2015/2016)Closed, ResolvedPublicActions