MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on closed wikis)
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	TheresNoTime
	Aug 16 2022, 10:03 PM

Description

Error

mwversion: 1.39.0-wmf.25
reqId: a9f3f7b62eecf029ce4b0588
Find reqId in Logstash

normalized_message

[{reqId}] {exception_url}   MWException: Error contacting the Parsoid/RESTBase server (HTTP 404)

exception.trace

from /srv/mediawiki/php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/HookUtils.php(98)
#0 /srv/mediawiki/php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/DataUpdatesHooks.php(47): MediaWiki\Extension\DiscussionTools\Hooks\HookUtils::parseRevisionParsoidHtml(MediaWiki\Revision\RevisionStoreRecord)
#1 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Extension\DiscussionTools\Hooks\DataUpdatesHooks->MediaWiki\Extension\DiscussionTools\Hooks\{closure}()
#2 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/DeferredUpdates.php(474): MWCallableUpdate->doUpdate()
#3 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/RefreshSecondaryDataUpdate.php(103): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#4 /srv/mediawiki/php-1.39.0-wmf.25/includes/deferred/DeferredUpdates.php(474): RefreshSecondaryDataUpdate->doUpdate()
#5 /srv/mediawiki/php-1.39.0-wmf.25/includes/Storage/DerivedPageDataUpdater.php(1801): DeferredUpdates::attemptUpdate(RefreshSecondaryDataUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#6 /srv/mediawiki/php-1.39.0-wmf.25/includes/page/WikiPage.php(2136): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#7 /srv/mediawiki/php-1.39.0-wmf.25/includes/jobqueue/jobs/RefreshLinksJob.php(242): WikiPage->doSecondaryDataUpdates(array)
#8 /srv/mediawiki/php-1.39.0-wmf.25/includes/jobqueue/jobs/RefreshLinksJob.php(163): RefreshLinksJob->runForTitle(Title)
#9 /srv/mediawiki/php-1.39.0-wmf.25/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#10 /srv/mediawiki/rpc/RunSingleJob.php(77): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#11 {main}

Impact

Notes

> 200 in the last 15 minutes, multiple projects

Details

Request URL: https://jobrunner.discovery.wmnet/rpc/RunSingleJob.php

Subject	Repo	Branch	Lines +/-
RESTBase is not enabled on closed wikis	operations/mediawiki-config	master	+2 -0
Add try…catch in failing deferred update	mediawiki/extensions/DiscussionTools	wmf/1.39.0-wmf.25	+9 -2
Add try…catch in failing deferred update	mediawiki/extensions/DiscussionTools	master	+9 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T314973 Improve Topic Subscription support for discussions that are merged, moved, or deleted
Open		None	T315050 Enable people to visit the page the discussion they were subscribed to has been moved to
Open		None	T318449 [Tracking] Improve the sharing of MediaWiki content
Open		None	T318446 [Tracking] Improve sharing link functionality of pages
Open		None	T315507 Introduce support for topic and comment permanent links within the Android and iOS apps
Open		None	T349353 Expose comment/topic permanent links on Special:Contributions
Open		None	T265269 Assist with linking to other comments within the Reply Tool
Open		None	T302011 [Release Ticket] Introduce permalinks on wikitext talk pages
Open		matmarex	T315353 Create database tables for permalinks in production wikis, and enable the feature
Stalled		matmarex	T315510 Start maintenance script to backfill talk page comment database
Invalid		matmarex	T316915 It's 2022 and we still can't use Parsoid in production
Resolved	PRODUCTION ERROR	matmarex	T315383 MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on closed wikis)

Event Timeline

TheresNoTime created this task.Aug 16 2022, 10:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 16 2022, 10:03 PM

TheresNoTime updated the task description. (Show Details)Aug 16 2022, 10:06 PM

TheresNoTime updated the task description. (Show Details)

Arlolra added a project: DiscussionTools.Aug 16 2022, 11:48 PM

Due to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/771974

The code in DiscussionTools is a new feature (T296801). If all else fails, it can be reverted, it's currently an elaborate no-op. (It's supposed to compute some stuff after a page is edited, and save it to some database tables, but the tables don't exist yet, so it just computes and discards it.)

Since this code is trying to read a revision that was just saved from RESTBase, my first guess is that RESTBase is reading from an outdated replica, and responding 404 because the revision (or page) doesn't exist there yet.

If my guess is right, then I don't know how to resolve that, other than by not using RESTBase. I can put in a sleep(5)? (ew). I was hoping RESTBase would do something smart in this case.

Or it might be a completely different problem. If that doesn't sound right, I can have a closer look tomorrow.

my first guess is that RESTBase is reading from an outdated replica

My understanding is that RESTBase is just going to pass on this request to the Parsoid cluster, to fetch the content for the oldid and parse it.

Change 823768 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823768

gerritbot added a project: Patch-For-Review.Aug 17 2022, 12:18 AM

The exception may be causing other updates to not be executed (e.g. Echo notifications), I think we should put a try…catch around this while we investigate. Feel free to deploy whenever you can.

matmarex added a parent task: T314186: 1.39.0-wmf.25 deployment blockers.Aug 17 2022, 12:20 AM

Change 823768 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823768

In T315383#8160021, @matmarex wrote:

The exception may be causing other updates to not be executed (e.g. Echo notifications), I think we should put a try…catch around this while we investigate. Feel free to deploy whenever you can.

It is my understanding that deferred updates failing do not prevent other updates from running. They should be (and I believe are, but freel free to verify locally with a hardcoded throw) try-catch wrapped in core already. That try-catch is what produces the production error log entry this task is filed from. We catch and log these errors from top-level db transaction scopes, such as e.g. entire web requests, individual jobs, and individual deferred updates.

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.26; 2022-08-22).Aug 17 2022, 1:00 AM

As a train blocker

JJMC89 unsubscribed.Aug 17 2022, 1:29 AM

Maintenance_bot removed a project: Patch-For-Review.Aug 17 2022, 1:30 AM

Change 823640 had a related patch set uploaded (by Jforrester; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@wmf/1.39.0-wmf.25] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823640

gerritbot added a project: Patch-For-Review.Aug 17 2022, 1:53 AM

RhinosF1 subscribed.Aug 17 2022, 7:41 AM

In T315383#8160080, @Krinkle wrote:

It is my understanding that deferred updates failing do not prevent other updates from running. They should be (and I believe are, but freel free to verify locally with a hardcoded throw) try-catch wrapped in core already. That try-catch is what produces the production error log entry this task is filed from. We catch and log these errors from top-level db transaction scopes, such as e.g. entire web requests, individual jobs, and individual deferred updates.

I had a closer look, and you're right about the deferred updates, but in this case our update runs from the RefreshLinksJob, which AFAICS doesn't have any try…catch. I should have written "revision data update" rather than "deferred update" in the commit message.

I'm not sure if I actually understand how it works, but I'm going to go with the try…catch until someone proves that it's not needed.

Change 823640 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.39.0-wmf.25] Add try…catch in failing deferred update

https://gerrit.wikimedia.org/r/823640

Maintenance_bot removed a project: Patch-For-Review.Aug 17 2022, 1:30 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-17T13:32:40Z] <taavi@deploy1002> Synchronized php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/DataUpdatesHooks.php: Backport: [[gerrit:823640|Add try…catch in failing deferred update (T315383)]] (duration: 03m 18s)

I was browsing through the logs, trying to see if we logged any more details, and I noticed something…

In the last 15 minutes, the exceptions occurred on the following wikis: aawikibooks kjwiki pswikibooks chowiki xhwiktionary nawikibooks biwikibooks aswiktionary nawikiquote towiktionary aswikibooks

Every one of those is a closed/locked wiki. This seems too consistent to be a coincidence?

Why are there even RefreshLinksJob running on wikis that (mostly) can't be edited?

Apparently RESTBase is not set up at all on those wikis, and it just responds 404 just as if the page/revision didn't exist. You also get funny error messages trying to open VisualEditor on those wikis (random page: https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit).

In T315383#8161812, @matmarex wrote:

You also get funny error messages trying to open VisualEditor on those wikis (random page: https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit).

Might it be worth disabling VE etc. on these wikis? (would be as simple as setting 'closed' => false for wmgUseVisualEditor, wmgUseParsoid and/or wmgUseRestbaseVRS I think?)

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.25; 2022-08-15); removed MW-1.39-notes (1.39.0-wmf.26; 2022-08-22).Aug 17 2022, 2:00 PM

Should we rollback group 0 wikis to 1.39.0-wmf.23 or is that unrelated to the train?

No rollback is needed with the patch we just backported.

We could even try rolling out to group 1, actually, if the problem only affects closed wikis.

Change 824224 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[operations/mediawiki-config@master] RESTBase is not enabled on closed wikis

https://gerrit.wikimedia.org/r/824224

gerritbot added a project: Patch-For-Review.Aug 17 2022, 2:29 PM

In T315383#8161841, @TheresNoTime wrote:

In T315383#8161812, @matmarex wrote:

You also get funny error messages trying to open VisualEditor on those wikis (random page: https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit).

Might it be worth disabling VE etc. on these wikis? (would be as simple as setting 'closed' => false for wmgUseVisualEditor, wmgUseParsoid and/or wmgUseRestbaseVRS I think?)

In the above mediawiki-config patch I didn't disable VE, but I did disable RESTBase for those wikis. So they will contact Parsoid directly, which should work.

In T315383#8161976, @cscott wrote:

In the above mediawiki-config patch I didn't disable VE, but I did disable RESTBase for those wikis. So they will contact Parsoid directly, which should work.

lgtm, thank you 🙂

• ppelberg subscribed.Aug 17 2022, 3:17 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-17T15:24:04Z] <TheresNoTime> deploying [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] out of window

Change 824224 merged by jenkins-bot:

[operations/mediawiki-config@master] RESTBase is not enabled on closed wikis

https://gerrit.wikimedia.org/r/824224

Maintenance_bot removed a project: Patch-For-Review.Aug 17 2022, 3:30 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-17T15:42:10Z] <samtar@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] (duration: 03m 27s)

Mentioned in SAL (#wikimedia-operations) [2022-08-17T15:42:59Z] <TheresNoTime> finished deploying [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]]

As a side effect VE now works on closed wikis, for those stewards and global interface maintainers who might want to use it. :)

I still see a couple of recent (in the last 15 minutes) hits for this error message. One with HTTP 403 (accessing officewiki) and one with HTTP 404 (accessing mediawikiwiki).

I will investigate, but if the volume is low, the errors are completely harmless now (they're ignored in a try…catch).

Just for reference, here's a link to the relevant logs and a chart: Logstash search for "Error contacting the Parsoid/RESTBase server (HTTP 404)", last 24 hours

There were no errors in the last 2 hours. It's a little weird to me that there's a slow drop-off after @cscott's config change was deployed, rather than the issue disappearing instantly… but I guess it's fixed now.

If I exclude all of the closed wikis (closed.dblist), there is exactly one result: Logstash search

This is the HTTP 404 accessing mediawikiwiki that @dancy mentions. I don't know what's up with this, but it is certainly a different issue. Maybe we should think some more about reading from outdated replicas (my first debugging idea from yesterday T315383#8159951), or maybe the job tried updating a page that was deleted in the meantime. Either way in my opinion this doesn't block the train.

There are also some errors with HTTP 403 in them, per @dancy: Logstash search

We didn't notice this before in the flood of 404, but this also seems new in the last deployment, but it is certainly a different issue as well. All of them appear to be on officewiki, which probably means it's a problem with private wikis. It kind of makes sense, they require auth to read, we usually solve this by forwarding the user's cookies, but in this case we're (potentially) in a job where we don't have the request context to access. We should think about fixing this, since right now the feature won't work on private wikis, but in my opinion this also doesn't block the train. I can silence the errors if they are annoying.

@matmarex Thanks for the analysis. I will remove this as a train blocker task.

dancy removed a parent task: T314186: 1.39.0-wmf.25 deployment blockers.Aug 17 2022, 7:09 PM

(re slow dropoff -- how long do the refresh links jobs take to run to completion? You probably wouldn't pick up the config change until your refresh links finished, so a refresh links job that started just before the config change was synced might still generate a bunch of 404s before it finally finished.)

• JMcLeod_WMF moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.Aug 18 2022, 2:10 PM

FWIW, I also see this happening today on commonswiki (that is, not just closed wikis)

thcipriani moved this task from Untriaged to Aug 2022 on the Wikimedia-production-error board.Aug 18 2022, 3:57 PM

Jgiannelos moved this task from Bugs & Crashers to Tracking on the Parsoid board.Aug 18 2022, 4:43 PM

Jgiannelos edited projects, added Parsoid (Tracking); removed Parsoid.

I'm going to close this task, and file separate ones about the HTTP 403 on private wikis and the rare HTTP 404 on non-closed wikis. These errors are now wrapped in try…catch, so they're not causing any problems, and there are just a couple hundred of them per day so the volume of logs shouldn't be a problem either (and I'd prefer to keep the logging in case something else breaks with the same error message again).

matmarex renamed this task from MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) to MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on closed wikis).Aug 19 2022, 5:07 PM

Follow-up tasks:

Hi @matmarex In case it is relevant, I am also regularly seeing HTTP 504 errors.

That is probably T288889: Parsoid API not able to deliver HTML (HTTP 504) for certain (big) articles.

matmarex added a parent task: T316915: It's 2022 and we still can't use Parsoid in production.Sep 2 2022, 4:23 AM

matmarex mentioned this in T318325: VisualEditor throws "Error contacting the Parsoid/RESTBase server (HTTP 404): (no message)" on affiliate wiki.Sep 22 2022, 12:03 PM

	F35469901: image.png
	Aug 17 2022, 6:55 PM

	F35469812: image.png
	Aug 17 2022, 6:55 PM

	F35469797: image.png
	Aug 17 2022, 6:55 PM

MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on closed wikis)Closed, ResolvedPublicPRODUCTION ERRORActions