Page MenuHomePhabricator

Limit concurrency of DPL queries
Open, MediumPublic

Description

Follow-up for T262391 and it's non-public counterpart T262240.

In T262391#6449590 et seq, @Bawolff proposed to limit concurrency of DPL queries by two ways: a query cache (which makes DPL to not do the same work XY times) and a pool counter that makes DPL to not work at more than X things at once.

Both should make DPL somehow safer (but not faster), which is always a bonus.

Event Timeline

Change 626796 had a related patch set uploaded (by Urbanecm; owner: Brian Wolff):
[mediawiki/extensions/intersection@master] Add an option to limit DPL concurrency with pool counter

https://gerrit.wikimedia.org/r/626796

Change 626268 had a related patch set uploaded (by Urbanecm; owner: Brian Wolff):
[mediawiki/extensions/intersection@master] Attempt to add a query cache to DPL

https://gerrit.wikimedia.org/r/626268

More specificly its meant to prevent query pileups. The common case of a query pileup in DPL being a template with a dpl on it being used on tens of thousands of pages, all being refreshed at once, which the short cache will eliminate. The pool counter solution as a fall back should eliminate any other situation of DPL query pileup.

I have closed the non public task that tracked the initial impact of DPL on ruwikinews (T262240).
As far as I know the patches that @Bawolff put together are still pending to be merged (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626268/ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626796/)
Can we get someone to review and merge them? I guess that'd be Platform Engineering or Performance-Team?

jcrespo raised the priority of this task from High to Unbreak Now!.Oct 29 2020, 12:00 PM
jcrespo added a subscriber: jcrespo.

Because this is actively causing outages for 800+ wikis on s3.

Should we (again) disable DPL somewhere?

Should we (again) disable DPL somewhere?

Maybe we can double check if reviewing, merging and deploying https://gerrit.wikimedia.org/r/626796 is realistically possible before the weekend? And maybe also https://gerrit.wikimedia.org/r/c/mediawiki/extensions/intersection/+/626268/ which I guess is more complex?

Platform Engineering can you bump this on your list of reviews priorities if possible?

Note. The former one depends on the latter one, so we would need to review 626268 first anyway :/.

To help with reviewing 626268, I enabled memcached caching as MainCacheType and set wgDLPQueryCacheTime to 30. Then, I created Category:A with handful of pages via for i in {0..45}; do echo -e "Testing page $i\n\n[[Category:A]]" | php /home/urbanecm/unsynced/gerrit/mediawiki/core/maintenance/edit.php --wiki=bwiki Test_page_$i --user Martin_Urbanec; done. Then, I set $wgDebugLogFile = "/var/www/wikis/logs/debug-{$wgDBname}.log"; to see all possible MediaWiki logs.

Then, I opened tail -f /var/www/wikis/logs/debug-bwiki.log | grep -i DBQuery | grep -i DynamicPageListHooks in my terminal, and opened Template:DPL_query for editing in my browser. Then, I pasted the following to the template:

<DynamicPageList>
category             = A
count                = 5
order                = descending
addfirstcategorydate = true
</DynamicPageList>

Then, I previewed it once, and confirmed that [DBQuery] DynamicPageListHooks::processQuery - Template:DPL_query [0s] localhost: SELECT page_namespace,page_title,c1.cl_timestamp FROM page` INNER JOIN categorylinks c1 ON ((page_id = c1.cl_from) AND (c1.cl_to='A')) WHERE page_is_redirect = 0 ORDER BY c1.cl_timestamp DESC LIMIT 5` appears only once in the logs I opened previously in my terminal. Then, I previewed it several times, and it didn't generate any more queries. Then, I waited 30 seconds (wgDLPQueryCacheTime on my machine), and it did query the DB again.

Marked with V+1 on Gerrit, as it seems to do what it is advertised to :).

Change 637499 had a related patch set uploaded (by Urbanecm; owner: Urbanecm):
[operations/mediawiki-config@master] Set wgDLPQueryCacheTime to 120 at all wikis

https://gerrit.wikimedia.org/r/637499

Change 626268 merged by jenkins-bot:
[mediawiki/extensions/intersection@master] Attempt to add a query cache to DPL

https://gerrit.wikimedia.org/r/626268

Change 637059 had a related patch set uploaded (by Urbanecm; owner: Brian Wolff):
[mediawiki/extensions/intersection@wmf/1.36.0-wmf.14] Attempt to add a query cache to DPL

https://gerrit.wikimedia.org/r/637059

Change 637499 merged by jenkins-bot:
[operations/mediawiki-config@master] Set wgDLPQueryCacheTime to 120 at all wikis

https://gerrit.wikimedia.org/r/637499

Mentioned in SAL (#wikimedia-operations) [2020-10-29T15:15:02Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: 19c5aff02c20812c56b8abdcc0ed530393010193: Set wgDLPQueryCacheTime to 120 at all wikis (T263220) (duration: 00m 59s)

Change 637059 merged by jenkins-bot:
[mediawiki/extensions/intersection@wmf/1.36.0-wmf.14] Attempt to add a query cache to DPL

https://gerrit.wikimedia.org/r/637059

Mentioned in SAL (#wikimedia-operations) [2020-10-29T15:22:42Z] <urbanecm@deploy1001> Synchronized php-1.36.0-wmf.14/extensions/intersection/: 483c3bceb926ac6a2cfc40112fb9b4f0671fef72: Attempt to add a query cache to DPL (T263220) (duration: 00m 58s)

jcrespo lowered the priority of this task from Unbreak Now! to Medium.Oct 29 2020, 3:47 PM

This is no longer UBN, please feel free to consider it resolved/declined based on my comments on the subtask.

Change 626796 merged by jenkins-bot:
[mediawiki/extensions/intersection@master] Add an option to limit DPL concurrency with pool counter

https://gerrit.wikimedia.org/r/626796

Note: A config change is needed in wmf-config before the patch will take affect.

Note: A config change is needed in wmf-config before the patch will take affect.

Good point. Do you mind uploading a patch please? I can deploy in the nearest window :).

Change 645994 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[operations/mediawiki-config@master] Add PoolCounter settings for DPL

https://gerrit.wikimedia.org/r/645994

Config patch for review & deployment, bumping to clinic duty.
Note that @Bawolff used to be staff - that doesn't make him immune to mistakes, but it's safe to assume that he generally knows what he's doing in the config :)

Change 645994 merged by jenkins-bot:
[operations/mediawiki-config@master] Add PoolCounter settings for DPL

https://gerrit.wikimedia.org/r/645994

Mentioned in SAL (#wikimedia-operations) [2020-12-10T19:15:22Z] <catrope@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: Add PoolCounter settings for DPL (T263220) (duration: 01m 05s)

Change 647637 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] Revert "Add PoolCounter settings for DPL"

https://gerrit.wikimedia.org/r/647637

Change 647637 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Add PoolCounter settings for DPL"

https://gerrit.wikimedia.org/r/647637

Mentioned in SAL (#wikimedia-operations) [2020-12-10T19:24:43Z] <catrope@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: Revert PoolCounter settings for DPL (T263220) (duration: 01m 03s)

@Urbanecm noticed errors in the logs that looked like Pool key 'nowait:dpl-query:enwikinews' (DPL): Error reading from pool counter server 10.64.0.151., and asked me to revert

I've tried to get this live with Catrope in today Morning B&C window. However, this triggered about 1300 of error messages that looked like Pool key 'nowait:dpl-query:enwikinews' (DPL): Error reading from pool counter server 10.64.0.151. I asked Catrope to revert, as I didn't see a similar error message in the logs:

[urbanecm@mwlog1001 /srv/mw-log]$ grep 'reading from pool' poolcounter.log | head -n 3
2020-12-10 19:10:35 [X9JyqwpAAC4AACHyDIkAAABJ] mwdebug1002 ruwikinews 1.36.0-wmf.21 poolcounter INFO: Pool key 'nowait:dpl-query:ruwikinews' (DPL): Error reading from pool counter server 10.64.0.151.
2020-12-10 19:11:02 [X9JyxgpAAC4AAHW57dMAAABF] mwdebug1002 ruwikinews 1.36.0-wmf.21 poolcounter INFO: Pool key 'nowait:dpl-query:ruwikinews' (DPL): Error reading from pool counter server 10.64.0.151.
2020-12-10 19:11:09 [X9JyzQpAAC4AAHW57fUAAABI] mwdebug1002 ruwikinews 1.36.0-wmf.21 poolcounter INFO: Pool key 'nowait:dpl-query:ruwikinews' (DPL): Error reading from pool counter server 10.64.0.151.
[urbanecm@mwlog1001 /srv/mw-log]$ grep 'reading from pool' poolcounter.log | wc -l                                                                                                                                   1056
[urbanecm@mwlog1001 /srv/mw-log]$ grep 'reading from pool' poolcounter.log | grep -v dpl | head
[urbanecm@mwlog1001 /srv/mw-log]$

@Bawolff Could you have a look and advice, please?

Huh. That's not the error i was expecting if the limit was too low. That sounds like something was wrong with poolcounter generally.

I can't even see where that error is reported. Searching for "Error reading" doesn't bring up anything useful, nor can I find anything in the relevant code path calling info() on a logger.

@daniel Grep for poolcounter-read-error, you'll find https://gerrit.wikimedia.org/g/mediawiki/extensions/PoolCounter/+/4f4851e5f5c9ecf497848f50a89d16ee1f0b18f0/includes/PoolCounter_Client.php#73. It is possible that PoolCounter client somehow logs the status's exit code, if fatal. I can't find where that gets called, but that's where I found that error message.

Oh right, in the extension. I was looking in core.

But anyway... that's an odd failure mode. This basically means that the service killed the connection, right? Or could it be a read timeout?

It's odd indeed, odd enough to rollback it and pause to debate about "why this happens".

https://www.php.net/manual/en/function.fgets.php says " If an error occurs, false is returned.", so I think it can be any kind of an error, which obviously makes it harder to debug.

https://www.php.net/manual/en/function.fgets.php says " If an error occurs, false is returned.", so I think it can be any kind of an error, which obviously makes it harder to debug.

Do we have logs of what is happening on the other side? I'm not familiar with the code of the poolcounter daemon. Time to ping @tstarling I guess...

https://www.php.net/manual/en/function.fgets.php says " If an error occurs, false is returned.", so I think it can be any kind of an error, which obviously makes it harder to debug.

Do we have logs of what is happening on the other side? I'm not familiar with the code of the poolcounter daemon. Time to ping @tstarling I guess...

I think this is a question for serviceops. Tagging them as well.

Wrong tag, but keeping it, as it's related too.