Page MenuHomePhabricator

4x increase in database queries after deploy of 1.38.0-wmf.9 to all wikis
Closed, ResolvedPublic

Description

Splitting this out from T293950, where @Marostegui writes:

This has caused a huge increase on queries, this is an example of an enwiki replica:

Captura de pantalla 2021-11-19 a las 8.21.15.png (696×1 px, 178 KB)

https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=16&orgId=1&from=1637139798447&to=1637306390674&var-job=All&var-server=db1163&var-port=9104

This really needs to be investigated and/or reverted, enwiki is having 4x times the amount of queries it used to have.

Event Timeline

brennen triaged this task as Unbreak Now! priority.Nov 19 2021, 7:40 AM
brennen added a subscriber: jeena.
Ladsgroup subscribed.

I'm doing the revert of the risky patch

Pasting what I posted on IRC:

[07:34:41]  <marostegui> Lots of SELECT /* MediaWiki\Page\PageStore::loadPageFromConditions  */  page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_restrictions  FROM `page`
[07:34:41]  <marostegui>   WHERE page_id = XXXX  LIMIT 1
[07:35:37]  <marostegui> 224723 of those queries in 10 seconds
[07:36:07]  <marostegui> I don't know if that is a normal rate for that query, so just throwing the info I am finding 
[07:38:59]  <marostegui> So from what I can tell it is limited to reads, writes seem to be stable

Purging (parsing) article of Alan Turing in enwiki causes 700 queries like that (half of total number of queries)

https://logstash.wikimedia.org/goto/4673d2be300c28c3876de308a6650512

Investigating more.

getPageByNameViaLinkCache doesn't seem to be working or maybe there are too many direct calls bypassing the link cache or even better, both.

Ladsgroup added subscribers: Pchelolo, daniel.

DB-wise we are good but it's not reverted on master but since we have two weeks, I'm suggesting to fix the problem instead of full revert. I did some digging and it seems Title::getFieldFromPageStore is at fault here but I leave it to @daniel and @Pchelolo to finish it. If you need some way to debug queries being made. I can show you some tricks ;) (hopefully planning to document all of these)

Ladsgroup lowered the priority of this task from Unbreak Now! to High.Nov 19 2021, 8:36 AM

Purging (parsing) article of Alan Turing in enwiki causes 700 queries like that (half of total number of queries)

Looks like LinkBatch/LinkCache no longer work properly, and we end up fetching info for each link separately. Not good...

Marostegui renamed this task from 4x increase in queries after deploy of 1.38.0-wmf.9 to all wikis to 4x increase in database queries after deploy of 1.38.0-wmf.9 to all wikis.Nov 19 2021, 2:06 PM

Presumably for wmf.9 by this revert: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/739841.
Still a blocker for wmf.11.

Analysis:

Parser::statelessFetchRevisionRecord() calls RevisionStore::getKnownCurrentRevision(), which calls Title::getArticleID() and Title::getLatestRevID. Both, Title::getArticleID() and Title::getLatestRevID, rely on Title::getFieldFromPageStore().

getArticleID() calls getFieldFromPageStore( 'page_id' ), which in turn calls PageStore::getPageByName() via the special case for page_id.

Next, getLatestRevID() calls getFieldFromPageStore( 'page_latest' ), which finds that mArticleId is known and thus calls PageStore::getPageById(). The idea here is that lookups by page ID are faster than lookups by namespace and name.

However, PageStore::getPageById() bypasses LinkCache, since LinkCache does not support caching by ID! This causes an extra database query for every template.

Possible fixes:

  1. remove the "optimized" special case from getFieldFromPageStore() which causes getPageById() to be used. Beware that the same optimization exists in PageStore::getPageByReference
  2. make getFieldFromPageStore() remember the PageRecord returned by PageStore in a member variable.
  3. make getPageById() caching, probably by adding by-id caching to LinkCache. This cache needs to be reset by Title::resetArticleID, though.

Change 740583 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] PageStore: avoif getPageById, since it bypasses caching.

https://gerrit.wikimedia.org/r/740583

Change 740583 merged by jenkins-bot:

[mediawiki/core@master] PageStore: avoid getPageById, since it bypasses caching.

https://gerrit.wikimedia.org/r/740583

We updated translatewiki.net to master branch yesterday and we started seeing timeouts and memory limits:

Maximum execution time of 15 seconds exceeded at PPNode_Hash_Tree.php:219
Maximum execution time of 15 seconds exceeded at Title.php:246
Maximum execution time of 15 seconds exceeded at Title.php:3038
Maximum execution time of 15 seconds exceeded at Title.php:3041
Maximum execution time of 15 seconds exceeded at Title.php:701
Maximum execution time of 15 seconds exceeded at Title.php:225
Maximum execution time of 15 seconds exceeded at Title.php:242
Maximum execution time of 15 seconds exceeded at Title.php:698
Maximum execution time of 15 seconds exceeded at Title.php:230
Maximum execution time of 15 seconds exceeded at Title.php:242
Maximum execution time of 15 seconds exceeded at Title.php:242
Maximum execution time of 15 seconds exceeded at Title.php:242
Maximum execution time of 15 seconds exceeded at Title.php:3038
Maximum execution time of 15 seconds exceeded at MessageGroupCache.php:72
Allowed memory size of 314572800 bytes exhausted (tried to allocate 548864 bytes) in DatabaseMysqli.php:49
Allowed memory size of 314572800 bytes exhausted (tried to allocate 4096 bytes) in MysqliResultWrapper.php:47
Allowed memory size of 314572800 bytes exhausted (tried to allocate 2621440 bytes) in MessageCollection.php:371
Allowed memory size of 314572800 bytes exhausted (tried to allocate 2621440 bytes) in MessageCollection.php:371

This leads me to believe the above patch doesn't fix the issue fully.

At least some of the translatewiki.net issues seem to be caused by rMW45c692eeeff3: Stop accessing members of the Title class publicly making Title object construction slower (it's used a lot, so it is very performance sensitive), so it's possible that our issues are unrelated to this task.

Using LinkBatch also not give a performance boost on pages with many links, because there are calls inside LinkRenderer which ending up in PageStore, which itself not reusing the LinkCache entries added by LinkBatch - I have try to fix that with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/742228 - not sure if that is the same issue as mention here.

Ladsgroup raised the priority of this task from High to Unbreak Now!.Dec 7 2021, 3:37 PM

The branch has been cut so we need to make sure either this is a proper train blocker and fix the memory/slowness (maybe revert gerrit:739549?) or remove it as a train blocker.

Hi. I'm on train duty this week and I'm looking for status on this ticket which is currently marked as a train blocker.

Hi. I'm on train duty this week and I'm looking for status on this ticket which is currently marked as a train blocker.

The database query issue should be fixed on master (and wmf.12), so I'm merking this as resolved. Per the title of this ticket, I think it can be closed.

However, the status of the memory leak is unknown.

The branch has been cut so we need to make sure either this is a proper train blocker and fix the memory/slowness (maybe revert gerrit:739549?) or remove it as a train blocker.

I don't see the connection to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/739549, can you explain?

The branch has been cut so we need to make sure either this is a proper train blocker and fix the memory/slowness (maybe revert gerrit:739549?) or remove it as a train blocker.

I don't see the connection to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/739549, can you explain?

I didn't check it in depth but it was mentioned here:

At least some of the translatewiki.net issues seem to be caused by rMW45c692eeeff3: Stop accessing members of the Title class publicly making Title object construction slower (it's used a lot, so it is very performance sensitive), so it's possible that our issues are unrelated to this task.

At least some of the translatewiki.net issues seem to be caused by rMW45c692eeeff3: Stop accessing members of the Title class publicly making Title object construction slower (it's used a lot, so it is very performance sensitive), so it's possible that our issues are unrelated to this task.

I'll file a separate ticket and work on a fix.