Page MenuHomePhabricator

MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on closed wikis)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   MWException: Error contacting the Parsoid/RESTBase server (HTTP 404)
exception.trace
from /srv/mediawiki/php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/HookUtils.php(98)
#0 /srv/mediawiki/php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/DataUpdatesHooks.php(47): MediaWiki\Extension\DiscussionTools\Hooks\HookUtils::parseRevisionParsoidHtml(MediaWiki\Revision\RevisionStoreRecord)
#1 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Extension\DiscussionTools\Hooks\DataUpdatesHooks->MediaWiki\Extension\DiscussionTools\Hooks\{closure}()
#2 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/DeferredUpdates.php(474): MWCallableUpdate->doUpdate()
#3 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/RefreshSecondaryDataUpdate.php(103): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#4 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/DeferredUpdates.php(474): RefreshSecondaryDataUpdate->doUpdate()
#5 /srv/mediawiki/php-1.39.0-wmf.25/includes/Storage/DerivedPageDataUpdater.php(1801): DeferredUpdates::attemptUpdate(RefreshSecondaryDataUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#6 /srv/mediawiki/php-1.39.0-wmf.25/includes/page/WikiPage.php(2136): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#7 /srv/mediawiki/php-1.39.0-wmf.25/includes/jobqueue/jobs/RefreshLinksJob.php(242): WikiPage->doSecondaryDataUpdates(array)
#8 /srv/mediawiki/php-1.39.0-wmf.25/includes/jobqueue/jobs/RefreshLinksJob.php(163): RefreshLinksJob->runForTitle(Title)
#9 /srv/mediawiki/php-1.39.0-wmf.25/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#10 /srv/mediawiki/rpc/RunSingleJob.php(77): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#11 {main}
Impact
Notes

> 200 in the last 15 minutes, multiple projects

Related Objects

Event Timeline

The code in DiscussionTools is a new feature (T296801). If all else fails, it can be reverted, it's currently an elaborate no-op. (It's supposed to compute some stuff after a page is edited, and save it to some database tables, but the tables don't exist yet, so it just computes and discards it.)

Since this code is trying to read a revision that was just saved from RESTBase, my first guess is that RESTBase is reading from an outdated replica, and responding 404 because the revision (or page) doesn't exist there yet.

If my guess is right, then I don't know how to resolve that, other than by not using RESTBase. I can put in a sleep(5)? (ew). I was hoping RESTBase would do something smart in this case.

Or it might be a completely different problem. If that doesn't sound right, I can have a closer look tomorrow.

my first guess is that RESTBase is reading from an outdated replica

My understanding is that RESTBase is just going to pass on this request to the Parsoid cluster, to fetch the content for the oldid and parse it.

Change 823768 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823768

The exception may be causing other updates to not be executed (e.g. Echo notifications), I think we should put a try…catch around this while we investigate. Feel free to deploy whenever you can.

Change 823768 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823768

The exception may be causing other updates to not be executed (e.g. Echo notifications), I think we should put a try…catch around this while we investigate. Feel free to deploy whenever you can.

It is my understanding that deferred updates failing do not prevent other updates from running. They should be (and I believe are, but freel free to verify locally with a hardcoded throw) try-catch wrapped in core already. That try-catch is what produces the production error log entry this task is filed from. We catch and log these errors from top-level db transaction scopes, such as e.g. entire web requests, individual jobs, and individual deferred updates.

JJMC89 triaged this task as Unbreak Now! priority.Aug 17 2022, 1:29 AM
JJMC89 subscribed.

As a train blocker

Change 823640 had a related patch set uploaded (by Jforrester; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.39.0-wmf.25] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823640

It is my understanding that deferred updates failing do not prevent other updates from running. They should be (and I believe are, but freel free to verify locally with a hardcoded throw) try-catch wrapped in core already. That try-catch is what produces the production error log entry this task is filed from. We catch and log these errors from top-level db transaction scopes, such as e.g. entire web requests, individual jobs, and individual deferred updates.

I had a closer look, and you're right about the deferred updates, but in this case our update runs from the RefreshLinksJob, which AFAICS doesn't have any try…catch. I should have written "revision data update" rather than "deferred update" in the commit message.

I'm not sure if I actually understand how it works, but I'm going to go with the try…catch until someone proves that it's not needed.

Change 823640 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.39.0-wmf.25] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823640

Mentioned in SAL (#wikimedia-operations) [2022-08-17T13:32:40Z] <taavi@deploy1002> Synchronized php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/DataUpdatesHooks.php: Backport: [[gerrit:823640|Add try…catch in failing deferred update (T315383)]] (duration: 03m 18s)

I was browsing through the logs, trying to see if we logged any more details, and I noticed something…

In the last 15 minutes, the exceptions occurred on the following wikis: aawikibooks kjwiki pswikibooks chowiki xhwiktionary nawikibooks biwikibooks aswiktionary nawikiquote towiktionary aswikibooks

Every one of those is a closed/locked wiki. This seems too consistent to be a coincidence?

Why are there even RefreshLinksJob running on wikis that (mostly) can't be edited?

Apparently RESTBase is not set up at all on those wikis, and it just responds 404 just as if the page/revision didn't exist. You also get funny error messages trying to open VisualEditor on those wikis (random page: https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit).

You also get funny error messages trying to open VisualEditor on those wikis (random page: https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit).

Might it be worth disabling VE etc. on these wikis? (would be as simple as setting 'closed' => false for wmgUseVisualEditor, wmgUseParsoid and/or wmgUseRestbaseVRS I think?)

Should we rollback group 0 wikis to 1.39.0-wmf.23 or is that unrelated to the train?

No rollback is needed with the patch we just backported.

We could even try rolling out to group 1, actually, if the problem only affects closed wikis.

Change 824224 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[operations/mediawiki-config@master] RESTBase is not enabled on closed wikis

https://gerrit.wikimedia.org/r/824224

You also get funny error messages trying to open VisualEditor on those wikis (random page: https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit).

Might it be worth disabling VE etc. on these wikis? (would be as simple as setting 'closed' => false for wmgUseVisualEditor, wmgUseParsoid and/or wmgUseRestbaseVRS I think?)

In the above mediawiki-config patch I didn't disable VE, but I did disable RESTBase for those wikis. So they will contact Parsoid directly, which should work.

In the above mediawiki-config patch I didn't disable VE, but I did disable RESTBase for those wikis. So they will contact Parsoid directly, which should work.

lgtm, thank you 🙂

Mentioned in SAL (#wikimedia-operations) [2022-08-17T15:24:04Z] <TheresNoTime> deploying [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] out of window

Change 824224 merged by jenkins-bot:

[operations/mediawiki-config@master] RESTBase is not enabled on closed wikis

https://gerrit.wikimedia.org/r/824224

Mentioned in SAL (#wikimedia-operations) [2022-08-17T15:42:10Z] <samtar@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] (duration: 03m 27s)

Mentioned in SAL (#wikimedia-operations) [2022-08-17T15:42:59Z] <TheresNoTime> finished deploying [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]]

As a side effect VE now works on closed wikis, for those stewards and global interface maintainers who might want to use it. :)

I still see a couple of recent (in the last 15 minutes) hits for this error message. One with HTTP 403 (accessing officewiki) and one with HTTP 404 (accessing mediawikiwiki).

I will investigate, but if the volume is low, the errors are completely harmless now (they're ignored in a try…catch).

Just for reference, here's a link to the relevant logs and a chart: Logstash search for "Error contacting the Parsoid/RESTBase server (HTTP 404)", last 24 hours

image.png (547×2 px, 59 KB)

There were no errors in the last 2 hours. It's a little weird to me that there's a slow drop-off after @cscott's config change was deployed, rather than the issue disappearing instantly… but I guess it's fixed now.


If I exclude all of the closed wikis (closed.dblist), there is exactly one result: Logstash search

image.png (547×2 px, 39 KB)

This is the HTTP 404 accessing mediawikiwiki that @dancy mentions. I don't know what's up with this, but it is certainly a different issue. Maybe we should think some more about reading from outdated replicas (my first debugging idea from yesterday T315383#8159951), or maybe the job tried updating a page that was deleted in the meantime. Either way in my opinion this doesn't block the train.


There are also some errors with HTTP 403 in them, per @dancy: Logstash search

image.png (547×2 px, 41 KB)

We didn't notice this before in the flood of 404, but this also seems new in the last deployment, but it is certainly a different issue as well. All of them appear to be on officewiki, which probably means it's a problem with private wikis. It kind of makes sense, they require auth to read, we usually solve this by forwarding the user's cookies, but in this case we're (potentially) in a job where we don't have the request context to access. We should think about fixing this, since right now the feature won't work on private wikis, but in my opinion this also doesn't block the train. I can silence the errors if they are annoying.

@matmarex Thanks for the analysis. I will remove this as a train blocker task.

(re slow dropoff -- how long do the refresh links jobs take to run to completion? You probably wouldn't pick up the config change until your refresh links finished, so a refresh links job that started just before the config change was synced might still generate a bunch of 404s before it finally finished.)

FWIW, I also see this happening today on commonswiki (that is, not just closed wikis)

I'm going to close this task, and file separate ones about the HTTP 403 on private wikis and the rare HTTP 404 on non-closed wikis. These errors are now wrapped in try…catch, so they're not causing any problems, and there are just a couple hundred of them per day so the volume of logs shouldn't be a problem either (and I'd prefer to keep the logging in case something else breaks with the same error message again).

matmarex renamed this task from MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) to MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on closed wikis).Aug 19 2022, 5:07 PM

Hi @matmarex In case it is relevant, I am also regularly seeing HTTP 504 errors.