Page MenuHomePhabricator

Special:Homepage is rendered much slower (<1 sec to 2+ sec)
Open, HighPublic

Description

Looking at our performance metrics as part of our chores, I noticed that about 7 days ago the rendering times for the homepage went up and stayed high: https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1&from=1718582400000&to=now

image.png (305ร—779 px, 71 KB)

Acceptance Criteria:

  • figure out why homepage performance got much worse on June 18th / 19th.

Note:
This seems to happen between the Group 0 and Group 1 deploy, so somehow I doubt that it is related to the train per se. But maybe some caching config change or something?

Excimer profiles:

XHGui profiles:

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptTue, Jun 25, 1:44 PM

The degradation seems to have started around midnight between June 18th and June 19th.

The closest relevant entries in SAL are:

Hi, I just checked on a bare-metal debug server (mwdebug1001), and it takes 1.9s, so I doubt it's k8s-related

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

image.png (395ร—847 px, 40 KB)

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

image.png (395ร—847 px, 40 KB)

So I think a good next place to look would be to ask AQS folks if anything there changed on June 18.

I would still consider removing the edit count call in getActionData for the desktop version of the module, since it is not used there (AFAICT).

Yep, that's the plan. Also, the mobile version needs some work. We rarely show that counter there, but if we do and it is not available, then it turns to:

image.png (453ร—1 px, 49 KB)

which is not great either. (I'll probably make a separate task for that.)

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

image.png (395ร—847 px, 40 KB)

So I think a good next place to look would be to ask AQS folks if anything there changed on June 18.

I'm sure the developers would be best positioned to say whether anything has changed, but as far as the AQS services themselves are concerned it doesn't seem like there have been any significant increases in latency: service-level view, REST gateway level view (easiest to read if you filter out the proton metrics)

Adding the Data-Platform team tag given that maybe something on AQS changed. (And if we can rule that out, that would also be useful to know.)

Change #1049973 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@master] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1049973

Change #1049974 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@master] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1049974

Change #1049973 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1049973

Change #1049974 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1049974

Change #1050002 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1050002

Change #1050005 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1050005

Change #1050002 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1050002

Mentioned in SAL (#wikimedia-operations) [2024-06-26T20:51:50Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T20:55:16Z] <cjming@deploy1002> cjming, migr: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:05:51Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] (duration: 14m 01s)

Change #1050005 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1050005

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:29:40Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:32:21Z] <cjming@deploy1002> cjming, migr: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:38:29Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] (duration: 08m 48s)

Looking at my own early tracking (https://grafana.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?from=now-24h&orgId=1&to=now&viewPanel=1) it does not look like the change backported last night had the desired effect. Otherwise, I would expect basically no events above 1 second. But let's see if the picture changes when the train rolls forward to Group 2.

Looking into this further, I was able to trigger a run actually taking a long time (3.57 seconds) in SpecialHomepage::execute: https://performance.wikimedia.org/xhgui/run/view?id=667d521c1e076ad57073b11e. This happened after I unselected a topic from a splatter of selected ones.

Other runs, which were just simple reloads, spend only little time (0.115 seconds) in SpecialHomepage::execute and instead almost all of the nominal wall-time (4.1 seconds) comes from the deferred updates (resetting the cache for tasks) https://performance.wikimedia.org/xhgui/run/view?id=667d51e229fe83b0bc29881b

So this suggests a few possible lines of inquiry to me:

  1. Why don't we somehow regenerate the cache when changed filters are saved and only when Special:Homepage is reloaded?
  2. Did the search get slower somehow?

I don't think the first item has changed last week tuesday, so checking-in with the search team about something changing on their end might be worthwhile.

That aside, I have created T368616: Special:Homepage performance issues to collect the opportunities to improve Special:Homepage performance as I notice them.

Now that the train has rolled forward to Group 2, we can tell with high confidence that it was those requests to AQS after all. As implemented in Homepage: don't load yesterdays edits on desktop, they went done on desktop (to what looks like much more consistently less than before)

image.png (309ร—778 px, 70 KB)

And stayed high on mobile:

image.png (310ร—779 px, 75 KB)

(or maybe even got a little bit worse there)

So tackling the "structured tasks mobile preview"-fallback seems like the main priority to fix this now.

Pinging @KStoller-WMF because that requires a product decision for how and when exactly to do that.

(Still, I continue to be unsure about what caused this in the first place. I inquired, and apparently nothing changed on the side of AQS, so something in the transport layer in between must be different since last week Tuesday.)

So tackling the "structured tasks mobile preview"-fallback seems like the main priority to fix this now.
Pinging @KStoller-WMF because that requires a product decision for how and when exactly to do that.

Thanks! I've created T368750: Newcomer Homepage: Suggested Edits (mobile preview) empty state when there are no suggested edits to document a decision and design.

@Michael there was a change on AQS explained in T366851: gocql startup times have increased between v1.2.0 and v1.6.0. After upgrading gocql library, it seems that startup times for all services increased.

Are you still experiencing delays?

CC @Sfaci and @Milimetric

Regarding @VirginiaPoundstone has mentioned, I wanted to add that the delay was confirmed only when AQS service starts.

Taking a look at the exact dates around the issue you mentioned on this ticket, we merged something related to "cross-DC Cassandra client connection" on June 17 (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1043195) but the delay was already there since May 29 (it was added with this change where we had to set the initialDelaySecond property to 30 secs to have enough time for the service to start -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1037033) but we never saw a performance degradation due to that. So I would say that delay itself is not related to this issue.