Page MenuHomePhabricator

UnexpectedValueException: Parsoid does not support content model proofread-index
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   UnexpectedValueException: Parsoid does not support content model proofread-index
exception.trace
from /srv/mediawiki/php-1.40.0-wmf.13/includes/parser/Parsoid/ParsoidOutputAccess.php(196)
#0 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/Handler/HtmlOutputRendererHelper.php(535): MediaWiki\Parser\Parsoid\ParsoidOutputAccess->getParserOutput(MediaWiki\Page\PageStoreRecord, ParserOptions, MediaWiki\Revision\RevisionStoreRecord, integer)
#1 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/Handler/HtmlOutputRendererHelper.php(635): MediaWiki\Rest\Handler\HtmlOutputRendererHelper->getParserOutput()
#2 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/Handler/ParsoidHandler.php(881): MediaWiki\Rest\Handler\HtmlOutputRendererHelper->getPageBundle()
#3 /srv/mediawiki/php-1.40.0-wmf.13/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(92): MediaWiki\Rest\Handler\ParsoidHandler->wt2html(MediaWiki\Parser\Parsoid\Config\PageConfig, array)
#4 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/Router.php(487): MWParsoid\Rest\Handler\PageHandler->execute()
#5 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/Router.php(406): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#6 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/EntryPoint.php(191): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#7 /srv/mediawiki/php-1.40.0-wmf.13/includes/Rest/EntryPoint.php(131): MediaWiki\Rest\EntryPoint->execute()
#8 /srv/mediawiki/php-1.40.0-wmf.13/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#9 /srv/mediawiki/w/rest.php(3): require(string)
#10 {main}
Impact
Notes
  • > 4000 in the last hour

Event Timeline

Started about an hour ago, around the time the train rolled..

TheresNoTime triaged this task as Unbreak Now! priority.Dec 7 2022, 9:19 PM

UBN!, I think this needs some attention irt. rolling back the train if that was indeed the cause

Ugh... this may be my fault... HtmlOutputRendererHelper is now on this code path. It may not know how to handle proofread-index.

So, HtmlOutputRendererHelper throws because ParsoidOutputAccess::supportsContentModel() returns false for "proofread-index". Is this correct? I don't know...

The code is:

	/**
	 * @param string $model
	 *
	 * @return bool
	 */
	public function supportsContentModel( string $model ): bool {
		if ( $model === CONTENT_MODEL_WIKITEXT ) {
			return true;
		}

		return $this->siteConfig->getContentModelHandler( $model ) !== null;
	}

SiteConfig::getContentModelHandler() in turn calls $env->getContentHandler( $contentmodel ). Which apparently doesn't know about "proofread-index". I can't find any mention of it in the Parsoid repo. Parsoid has an extension mechanism for registerign handlers for different models, but I can't find any mention of that in the ProofreadPage repo.

So, how did this use to work? Did it ever work? Did we just pretend it's wikitext?

So one bug here is: asking for an unexpected data model shouldn't cause a 500. It should cause a 400 or 406. I'll get to work fixing that.

I'm pretty sure this was caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/864723. I'd much rather fix it than revert, though.

My current thinking is that the parsoid endpoints should return a 400 for unsupported content models. That would at least remove the log spam. Unless restbase starts logging the status 400 responses.

@daniel I was about to roll back the train but I can wait if you think you can resolve the issue in short time. It's generating a ton of errors.

I have put together a quick and dirty fix that changes this from a 500 to a 400, so we don't spam the log with it.
The fact remains that parsoid can't render pages of this type. And when we ask it to do so, the result is either an error or garbage.
We should find out what asks for the rendering, and why.
It's probably changeprop/restbase blindly asking for a rendering of everything, regardless of content model.

@daniel I was about to roll back the train but I can wait if you think you can resolve the issue in short time. It's generating a ton of errors.

I will have a patch up in a minute

@daniel I was about to roll back the train but I can wait if you think you can resolve the issue in short time. It's generating a ton of errors.

I will have a patch up in a minute

Great! I will leave things in your capable hands.

Change 865784 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Throw a 400 when asking parsoid to render an unknown content model.

https://gerrit.wikimedia.org/r/865784

This is not just proofread-page, but basically all non-wikitext content modles. "proofreadpage-index", "flow-board", "javascript", "wikibase-item", "sanitize-css", etc. So, RB (or whiever upstream content handler it is) shouldn't be forwarding these to Parsoid. But for now, we should just fix ParsoidOutputAccess to pretend these are all wkitext content models.

HTTP 400 is reasonable, but I am not sure how / what RESTBase will do with all these 400s .. hence my conservative caution since we don't have any RB folks here.

Change 865785 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Make parsoid accept all content models.

https://gerrit.wikimedia.org/r/865785

Change 865785 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Make parsoid accept all content models.

https://gerrit.wikimedia.org/r/865785

This is the important fix. It will restore the previous behavior: let parsoid blindly render anything, even if the result is garbage.

The patch for turning the 500 into a 400 isn't really relevant, just nice to have.

Change 865785 merged by jenkins-bot:

[mediawiki/core@master] Make parsoid accept all content models.

https://gerrit.wikimedia.org/r/865785

Change 865749 had a related patch set uploaded (by Samtar; author: Daniel Kinzler):

[mediawiki/core@wmf/1.40.0-wmf.13] Make parsoid accept all content models.

https://gerrit.wikimedia.org/r/865749

Change 865784 merged by jenkins-bot:

[mediawiki/core@master] Throw a 400 when asking parsoid to render an unknown content model.

https://gerrit.wikimedia.org/r/865784

Mentioned in SAL (#wikimedia-operations) [2022-12-07T22:48:31Z] <TheresNoTime> Going to backport [[gerrit:865749]] to wmf/1.40.0-wmf.13 for T324711

Change 865749 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.13] Make parsoid accept all content models.

https://gerrit.wikimedia.org/r/865749

Mentioned in SAL (#wikimedia-operations) [2022-12-07T23:00:48Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]]

Mentioned in SAL (#wikimedia-operations) [2022-12-07T23:02:45Z] <samtar@deploy1002> samtar and samtar: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-12-07T23:14:46Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] (duration: 13m 57s)

TheresNoTime lowered the priority of this task from Unbreak Now! to Needs Triage.

The deployment of this patch has resolved the exceptions.

Screenshot 2022-12-07 at 18.18.58.png (1×2 px, 589 KB)

Noting this comment, this patch should be undone(?) once T311728: Gracefully handle non-wikitext pages when returning HTML from the page endpoints is completed.

Change 866146 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] ParsoidOutputAccess: only cache output for wikitext

https://gerrit.wikimedia.org/r/866146

Change 866250 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] ParsoidOutputAccess: generate dummy output for unsupported models.

https://gerrit.wikimedia.org/r/866250

daniel triaged this task as High priority.Dec 8 2022, 7:59 AM
daniel added a project: RESTBase Sunsetting.

Change 866146 merged by jenkins-bot:

[mediawiki/core@master] ParsoidOutputAccess: only cache output for wikitext

https://gerrit.wikimedia.org/r/866146

Change 866250 merged by jenkins-bot:

[mediawiki/core@master] ParsoidOutputAccess: generate dummy output for unsupported models.

https://gerrit.wikimedia.org/r/866250

daniel moved this task from Unsorted to PCS Service Pile on the RESTBase Sunsetting board.

@cscott @ssastry Really sure this is fixed? Assuming this has been deployed, this is still failing https://id.wikipedia.org/api/rest_v1/page/html/Story%3AWisata_Wajib_di_Toraja. Actually this bug seems to be the reason why we can not generated ZIM offline snapshot of id.wikipedia.org since two quarters. More details about the bug at MWoffliner level at https://github.com/openzim/mwoffliner/issues/1853.

@Kelson Yeah, the resolution from T246403 was that Parsoid

should support all models that serialize to wikitext

https://github.com/wikimedia/mediawiki/commit/e1c3af91774818f512cfa3acf594242624c2f778

You're seeing,

Parsoid does not support content model story

That page has "contentmodel": "story",
https://id.wikipedia.org/w/api.php?action=query&titles=Story:Wisata_Wajib_di_Toraja&prop=info

From the Wikistories Extension, story is an alias for the json contentmodel, which is not something that needs wikitext parsing.

$wgContentHandlers[ 'story' ] = JsonContentHandler::class;

https://github.com/wikimedia/mediawiki-extensions-Wikistories/blob/master/README#L14

@Arlolra Thank you for your technical explanation but I have a hard time to use it to answer my questions which are:

@Arlolra Thank you for your technical explanation but I have a hard time to use it to answer my questions which are:

  • Is the "Story" namespace really only a JSON, so the JsonContentHandler is appropriate?

Not quite - StoryContentHandler is a subclass of JsonContentHandler. So it "is a" JsonContentHandler, but more than "just" a JsonContentHandler.

Sure, why not? Wikidata for example has no wikitext in the main namespace at all. "Content" doesn't have to mean "text".

It's a missing feature, not a bug as such. It's being worked on, see T330772 and T317018. It will be fixed in a couple of months I hope.

The tickets I mentioned above are probably more informative.

@daniel Thank you! Things are clearer now for me. At the core of my problem is that we have always been using $wgContentNamespaces to know which namespaces should be scraped... and here it does not work that well (anymore).