Page MenuHomePhabricator

"Properly" address missing srcText issues in PageConfigFrame / missing revID in REST API
Open, Needs TriagePublic

Description

Currently Parsoid/PHP crashes if there's not a valid srcText specified in the top-level Frame (which is PageConfigFrame). This is used in both serialize and parse (see T234548 -- technically we use SelserData instead of Frame right now).

We sort of kludge around this by using the empty string '' as a srcText when we create the PageConfig/Env in a number of paths, but this masks cases where this could be an actual bug (ie, where srcText is required but we're not providing it).

One solution might be to add a boolean parameter to the PageConfigFactory (or something like that) which would come all the way from the API and indicate when we expect this to be a new page (in which case we won't have a srcText) vs where we expect there to be a previous revision and we should error out if we can't fetch the text of that.

UPDATE 2020-08-24: T259855: DiscussionTools touched unrelated parts of the page showed a related version of this bug: the REST API itself doesn't clearly distinguish between "new page creation" (where revid is omitted because there is none yet) and "edit of existing page" (where revid should not be omitted, on pain of corruption).

Core has recently introduced RevisionRecord to address this to some degree: even unsaved pages have a RevisionRecord, although they may not yet have a revisionid.

So in addition to clearly distinguishing between new-page-creation and editing-existing-page in the Frame API, we should also address this at the REST API and ParserCache/RestBase level in order to catch potential corruption bugs and avoid issues where "missing revision id" or "failed fetch of revision id" becomes a page-corruption bug by falling back into the "new page creation" codepaths.

Event Timeline

cscott created this task.Oct 3 2019, 4:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2019, 4:46 PM

Change 539428 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Ensure we have page content for the Api configs

https://gerrit.wikimedia.org/r/539428

Change 539428 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Ensure we have page content for the Api configs

https://gerrit.wikimedia.org/r/539428

Change 547013 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Return a clean 404 error if a non-existant title is requested via REST

https://gerrit.wikimedia.org/r/547013

Change 547013 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Return a clean 404 error if a non-existant title is requested via REST

https://gerrit.wikimedia.org/r/547013

Change 550863 had a related patch set uploaded (by C. Scott Ananian; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Add handling for new / missing pages in html2wt direction

https://gerrit.wikimedia.org/r/550863

Change 550863 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add handling for new / missing pages in html2wt direction

https://gerrit.wikimedia.org/r/550863

I'd like to keep the clarity of SelserData, that accessing the page frame while serializing should throw, unless we're doing selser.

I'd like to keep the clarity of SelserData, that accessing the page frame while serializing should throw, unless we're doing selser.

You and Scott disagree on that. I think Scott is advocating for all page source access to go through the page frame and that that is the way to properly handle VE-edited extensions with HTML updates, nested documents, etc. So, I think that should be resolved before additional changes in this code.

I'm not saying we shouldn't use the page frame (I think I've missed the boat on that), I'm saying that when we do, we add some guardrails around it.

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Feb 15 2020, 12:05 AM

Change 612991 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Unify revision content checks when starting from wikitext

https://gerrit.wikimedia.org/r/612991

Change 612992 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Add a test for serializing when a revision isn't found

https://gerrit.wikimedia.org/r/612992

Change 612991 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Unify revision content checks when starting from wikitext

https://gerrit.wikimedia.org/r/612991

Change 612992 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add a test for serializing when a revision isn't found

https://gerrit.wikimedia.org/r/612992

Change 614872 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump parsoid to 0.13.0-a1

https://gerrit.wikimedia.org/r/614872

Change 614872 merged by jenkins-bot:
[mediawiki/vendor@master] Bump parsoid to 0.13.0-a1

https://gerrit.wikimedia.org/r/614872

cscott renamed this task from "Properly" address missing srcText issues in PageConfigFrame to "Properly" address missing srcText issues in PageConfigFrame / missing revID in REST API.Aug 24 2020, 4:53 PM
cscott updated the task description. (Show Details)