Page MenuHomePhabricator

Parsoid no longer used for non-wikitext content models from external REST API
Open, Needs TriagePublic

Description

Parsoid CI broken by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/953342

The TLDR is:

  • Before gerrit 953342, ParsoidOutputAccess was calling Parsoid (library) directly to handle JSON content model pages
  • After gerrit 953342, ParsoidOutputAccess calls ParsoidOutputAccess to handle JSON content model pages and these requests are handled by core's content handlers and don't hit Parsoid.

Because of the change above, Parsoid's pagebundle headers are no longer present in the output. This causes Parsoid's CI tests to fail.

EDIT: the CI test were fixed (by disabling them) but there are some follow up tasks discussed below. Primarily ensuring that we have support for some other non-wikitext content models like used for proofread page, but also that we maintain coverage of the contentmodel dispatch code in Parsoid so it doesn't coderot, which may mean replacing some of the now-disabled API tests with unit tests of some sort.

Event Timeline

We have determined after a bunch of discussion that Parsoid shouldn't be handling JSON content models (or other non-wikitext content models) for generating HTML output. Metadata generation will be tackled in the future.

One concern was whether RESTBase will fail if it receives the new response from core which doen't have all these headers. Since gerrit 953342 has gone rolled out on the train to testwiki. I created a json test page https://test.wikipedia.org/wiki/Json_page.

Accessing https://test.wikipedia.org/api/rest_v1/page/html/Json_page (which is the RESTBase API) returns a http 200 response that looks right. The only difference from say a page on a different wiki (ex: this itwiki page) is that the HTML (if you inspect page source) is no longer Parsoid HTML. So, RESTBase seems to be handling this just fine.

So, it looks like the only thing we now need to fix is Parsoid's API tests to not query the core REST API and verify the expectations that will no longer be fulfilled. And, it looks like RESTBase doesn't care about these expectations either (as per this test above).

Summary of status:

  • Parsoid's API tests call an API which is no longer part of the Parsoid library, but instead is provided by core.
  • Currently a RESTBase request for JSON content calls that core HTML API, which then uses ParsoidOutputAccess/ParserOutputAccess to generate the HTML
  • After the patch, ParserOutputAccess uses the ContentHandler system to render the HTML, which ultimately does *not* use Parsoid.
  • This is fine -- eventually core should use Parsoid to handle the metadata extraction, in the same way it uses the legacy parser, but we don't need to/meant to replace the core JSON renderer.
  • But Parsoid's API tests should: (a) probably move to core, (b) test JSON handling consistent with how the *core* API handles JSON (aka via core's own renderer)

Action items:

  • Disable Parsoid JSON-related API tests in the short term to fix CI
  • Move the API tests to core, and either remove the JSON-related ones or change them to test that the legacy HTML is returned
  • Rewrite the JSON-related API tests to instead call Parsoid's internal APIs to render that contentmodel, to ensure that the contentmodel dispatch mechanism in Parsoid is still covered by tests and doesn't rot for any future use of it (aka, for "wikitext 2.0" or "markdown" or some such).

Change 966558 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Disable json content http api tests

https://gerrit.wikimedia.org/r/966558

Change 966558 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Disable json content http api tests

https://gerrit.wikimedia.org/r/966558

Change 966608 had a related patch set uploaded (by C. Scott Ananian; author: Arlolra):

[mediawiki/services/parsoid@REL1_41] Disable json content http api tests

https://gerrit.wikimedia.org/r/966608

Change 966608 merged by jenkins-bot:

[mediawiki/services/parsoid@REL1_41] Disable json content http api tests

https://gerrit.wikimedia.org/r/966608

Parsoid's API tests call an API which is no longer part of the Parsoid library, but instead is provided by core.

These rest endpoints are still defined in the Parsoid "extension" though,
https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/extension/restRoutes.json

One concern was whether RESTBase will fail if it receives the new response from core which doen't have all these headers

Note that the two tests were failing for different reasons. The (html, json content) test was causing an internal error to return a 500,

AssertionError: {"message":"Error: exception of type LogicException: Failed to find content language in page bundle data","exception":{"id":"b0a3dab395a949d8a45f3f02","type":"LogicException","file":"/workspace/src/includes/Rest/Handler/Helper/HtmlOutputRendererHelper.php","line":619,"message":"Failed to find content language in page bundle data","code":0,"url":"/rest.php/127.0.0.1/v3/page/html/JSON%20Page/4"

but RESTBase doesn't hit that endpoint. The relevant test was (pagebundle, json content) which returned a 200, as observed when accessing the test page https://test.wikipedia.org/api/rest_v1/page/html/Json_page, but failed an assertion,

AssertionError: expected 'text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/0.0.0"' to satisfy [Function]

So, it looks like the only thing we now need to fix is Parsoid's API tests

There's also the need to verify that other contentmodels for which`ParsoidOutputAccess::supportsContentModel` returns true are still being parsed by Parsoid so that they continue to be linted.

There's also the need to verify that other contentmodels for which`ParsoidOutputAccess::supportsContentModel` returns true are still being parsed by Parsoid so that they continue to be linted.

<slight-tangent>
As discussed today, there are 3 different issues for linting
(a) Page content models other than wikitext are no longer being handled by Parsoid after gerrit 953342 -- so wikitext-adjacent content models (like Proofreadpage) are impacted by the change.
(b) We need to ensure that pregeneration of Parsoid HTML on non-template wikitext source edits via ParsoidCachePrewarmJob doesn't break
(c) HTML pregeneration of Parsoid HTML (for all dependent pages) on *template* edits is unlikely to be turned on (for storage and cpu resource usage reasons).

@MSantos FYI. It might be useful to explore a linting-only solution that addresses all these uniformly. Not sure if the solution is to create a linting-only job that run with low-priority and/or on a different cluster and can be safely dropped under resource load conditions.
</slight-tangent>

But, for (a) that arose because of gerrit 953342, we may need to add the 'useparsoid' code path to core content model handlers and force a Parsoid parse (that triggers linting as side effect).

cscott renamed this task from Parsoid CI broken by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/953342 to Parsoid no longer used for non-wikitext content models from external REST API.Oct 19 2023, 2:12 PM
cscott updated the task description. (Show Details)

Change 967972 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a3

https://gerrit.wikimedia.org/r/967972

Change 967972 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a3

https://gerrit.wikimedia.org/r/967972