Page MenuHomePhabricator

Investigate change in seasonality of RL backend metrics after 3 Oct 2019
Closed, ResolvedPublic


Following the deployment of 1.34.0-wmf25 to Wikipedias on 3 Oct 2019, there is the usual temporary change in various backend metrics as result of new code deployments.

However, in the days that followed the graphs did not follow the usual patterns we saw after a deployment in the months/years prior.

There is no negative impact that I can see, it's just different. This task is to understand those differences and to potentially catch unintended side effects.

Event Timeline

Krinkle triaged this task as Medium priority.
Average num of minify calls per req
Alert dashboard / problem
capture.png (1×2 px, 162 KB)

Almost doubled directly after the deploy (strong correlation), and while the usual season pattern did pick up, one the whole it remains clearly elevated.

In case the seasonal pattern takes multiple days to recover (like with INM ratio), here's a comparison to previous weeks and how it behaved after major deployments there:

Alert dashboard / normal previous weeks
capture2.png (1×2 px, 85 KB)

Looks like it's definitely not normal to be elevated in this way, except the first few minutes after a major deployment.

Time to look at the main dashboard to see if we can learn more about what might be causing it based on related metrics.

Detailed dashboard / normal previous week
cap2.png (814×2 px, 146 KB)

This tells me that normally around a deploy, there is a brief increase in backend traffic (http requests for new urls) which naturally will perform minification for the first time for the new scripts. The average doesn't really change much given it's just more urls being backfilled, the requests them selves don't do significantly more work.

The absolute number of minify calls in total across the data centre raises with the number of backend requests, but the average remains mostly unchanged, and recovers within a few minutes – normally.

Detailed dashboard / problem
cap.png (812×2 px, 164 KB)

... This week, the average appears to remain elevated. But strangely enough, the absolute total looks entirely normal.

This suggests one of two things:

  1. Maybe we made a change that A) caused half our backend requests to perform double the number of minifications they did before and B) are somehow not cacheable thus re-doing it all the time and C) have caused the other half of the requests to perform significantly fewer minifications, and D) not significantly changed the overall amount of backend traffic. Where C and D would explain the backend req count and total minify count not having changed.
  2. Maybe we have backend requests that perform little to no minifications at all, and we have made a change that made these requests less common. Thus making the aggregated average appear higher because we have fewer of the kind of requests that need no minification.

Option 1 seems improbable to me as we made no changes last week that could cause that as far as I know.

Option 2 also seemed improbable because I couldn't think of (popular) load.php requests that serve neither JS nor CSS code, or that serve CSS/JS but don't attempt to minify it at all.

I do know of the empty response to matching these qualifications, but one wouldn't generally make a request to that url without any query parameters. We do also have cases in OutputPage where we embed a previewed JS script as module (possibly without minification), but those would't show up as load.php backend responses as they go through index.php?action=edit instead.

Then I realised at T121730 we made a change to "ResourceLoaderImageModule" which is a way to for load.php to serve PNG or SVG files (generated as variants from other SVGs). That is indeed a load.php request that serves no CSS/JS. And the improved caching of that module in T121730 would've made backend requests far less common.

Krinkle closed this task as Resolved.EditedOct 8 2019, 8:08 PM
Krinkle moved this task from Untriaged to Consider write up on the Performance-Team-publish board.
Krinkle updated the task description. (Show Details)

TLDR: I've raised the alert threshold for "Avg minify calls per req" from 40 to 55 to accomodate the shift in traffic; And I've added an annotation to the graphs linking to this task.

While the average went up, this is only because the number of responses that perform 0 minifications (e.g. icon images) went down following T121730 .