Page MenuHomePhabricator

P75 Edit response time has almost doubled since Jan 11, 2022
Closed, ResolvedPublic

Description

I was doing chores today and noticed that our MediaWiki PHP: Backend time (eg. logged-in or cache-miss) graph has almost doubled since January 11 for the p75 edit response time.

Screen Shot 2022-01-25 at 2.28.24 PM.png (540×3 px, 184 KB)

Acceptance Criteria

  • p75 edit response time decreases back to pre-Jan 11 level or an explanation is found

Event Timeline

Jdlrobson added a subscriber: Jdlrobson.

Hi editing and platform engineering team, have there been any obvious backend changes that may have caused this?

No obvious changes that I know of. We'll have to review the non-obvious ones… Start with:

I started reviewing the SAL, this deployment lined up suspiciously well:

19:27 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 8f5ca9a: Enable TheWikipediaLibrary on most wikis (T288070) (duration: 01m 00s)

So I started reading TheWikipediaLibrary code, and if you follow through here:

…This looks like on every edit, it does a database query for each wiki on which the current user has an attached account. That could be a lot of queries in the worst case (my own account is attached to 886, but I'm definitely an outlier; it's probably more like 3-5 for the average user).

I'm not sure if that could be bad enough to cause this effect? There is a comment in queryAttached() about trying to "avoid expensive query whenever possible"…

We could probably try to change the order of the various checks in TheWikipediaLibraryHooks.php to do this expensive one last, and see if that helps.

Jdlrobson raised the priority of this task from Medium to Needs Triage.Jan 27 2022, 2:18 AM

Possibly high priority/ubn, Not sure who should be responsible for prioritizing this.

Hi there, I'm in the library team and did work on the extension. We could certainly look at using another hook to fire the notification. In the mean time, disabling the extension seems reasonable to me, though I'm not sure what the impact would be for the display of the already sent notifications.

  • We're not sure if TheWikipediaLibrary is responsible, it's just the first suspicious thing I saw
  • Even if it is, it's been like this for 2 weeks and the wikis are not down, so it's probably not severe enough to run around disabling things

How is that time calculated? Is it an average of all edit saves, or individual timed test edits, or something else?

I ask because we've been rolling out the notification to successive groups of users week-by-week, so if those times are reflective of actual save times from editors, I might have expected to see some pattern which related to this deployment cadence, but as far as I can see it's remained steady (or even slowly decreasing) since the initial increase.

EDIT: Chatted to the engineers, my assumption is likely wrong :)

Just curious, if the hooks are doing expensive database operations couldn't those be added to the job queue?

The graph is coming from real world data, calculated here: https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/2915af7b57670252dcb2d44ca566c6fe3b12708f/includes/WikimediaEventsHooks.php#173

I believe it is telling us that for the 75th percentile of editors an edit is taking double the time.

Even if it is, it's been like this for 2 weeks and the wikis are not down, so it's probably not severe enough to run around disabling things

The wikis are not down sure, but I'm more concerned about the impact on editing, as longer edit responses could be putting people off editing particularly for users on slow connections, and speed matters (reminds me of Ori's 2014 blog post https://wikimediafoundation.org/news/2014/12/29/how-we-made-editing-wikipedia-twice-as-fast/ ) so I think priority would depend on whether we are seeing any negative trends in edits.

@Samwalton9 as Bartosz says it's just a hypothesis that TheWikipediaLibrary is responsible. We should test that hypothesis and explore other ones.

@Jdlrobson The job queue sounds like like exactly what we should be doing

You can probably make it a DeferredUpdate rather than a job, which should have the same effect here (it happens after saving the edit and responding to the user, rather than before), but it's easier to implement. For example, Echo notifications for things that happened in this edit (mentions, page links, etc.) are generated in a DeferredUpdate.

Change 757690 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/TheWikipediaLibrary@master] Re-order some eligibility checks to calculate the most expensive ones last

https://gerrit.wikimedia.org/r/757690

Before we try to rewrite the entire extension, I'd like to see if that simple patch above ^ has a positive effect. Can you review?

TWL calls EchoEvent::create(), which already uses a post-send deferred updates to perform quick checks. This is outside the critical path and yet still enjoys cheap access to contextual information from the web request without overburdening either job queue or database. It then queues a job for the real side-effect and other propagation logic. This is the recommended pattern and should be applied here as well.

Putting it all on the job queue immediately would result in lots of load on job queues and databases with avoidable queries that result in no-ops.

The recommended pattern was also applied to TWL internally already. The state of the extension when we reviewed it prior to deployment (TheWikipediaLibrary.git@2021-11-01:/TheWikipediaLibraryHooks.php):

public static function onPageSaveComplete( .. ) {
   self::maybeSendNotification();
}
private static function maybeSendNotification( .. ) {
   DeferredUpdates::addCallableUpdate( .. );
}

I've looked at the code where the notification is created. Since you're correctly doing this in a DeferredUpdate, there shouldn't be any measurable impact on Save Timing. […]

This means the maintainers are already aware and familiar with deferred updates and job queue, even prior to our perf review. I don't think they need to be re-introduced as such :-).

It looks like this critical detail was accidentally lost in change 738476.

We really need a CentralAuth patch to make getGlobalEditCount() fast. Sweeping it under the DeferredUpdates rug is satisfying but the queries will still be done.

Change 758120 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/CentralAuth@master] Track global user edit counts in a DB table

https://gerrit.wikimedia.org/r/758120

Change 758119 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add hook UserEditCountUpdate

https://gerrit.wikimedia.org/r/758119

Change 757737 had a related patch set uploaded (by Krinkle; author: Jsn.sherman):

[mediawiki/extensions/TheWikipediaLibrary@master] Restore DeferredUpdate in maybeSendNotification

https://gerrit.wikimedia.org/r/757737

dpifke moved this task from Inbox to Doing: Prio Interrupt on the Performance-Team board.
dpifke added a subscriber: dpifke.
Krinkle triaged this task as High priority.Jan 31 2022, 7:30 PM

Change 757737 merged by jenkins-bot:

[mediawiki/extensions/TheWikipediaLibrary@master] Restore DeferredUpdate in maybeSendNotification

https://gerrit.wikimedia.org/r/757737

Change 757690 merged by jenkins-bot:

[mediawiki/extensions/TheWikipediaLibrary@master] Re-order some eligibility checks to calculate the most expensive ones last

https://gerrit.wikimedia.org/r/757690

Change 758119 merged by jenkins-bot:

[mediawiki/core@master] Add hook UserEditCountUpdate

https://gerrit.wikimedia.org/r/758119

We appear to be back at January levels: https://grafana.wikimedia.org/d/000000205/reading-web-performance?viewPanel=43&orgId=1&from=now-90d&to=now
I think this can now be resolved. Thanks for prioritizing this work!

Change 769561 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/CentralAuth@wmf/1.38.0-wmf.25] Track global user edit counts in a DB table

https://gerrit.wikimedia.org/r/769561

Change 758120 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Track global user edit counts in a DB table

https://gerrit.wikimedia.org/r/758120

Change 769561 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.38.0-wmf.25] Track global user edit counts in a DB table

https://gerrit.wikimedia.org/r/769561