Page MenuHomePhabricator

Parser Use of RevisionRecord for a page that can't exist: (eg Special:MyLanguage/Main Page)
Open, HighPublicPRODUCTION ERROR

Description

This was an issue that arose during refactoring of the Title class. The error below is for historic interest. Since we had a good discussion below about how and why the Parser uses RevisionRecord and Title for "non-existable pages" (pages that not only aren't currently in the database, but *can't* be saved in the database), I'm repurposing this bug to reflect the refactoring tasks discussed below.

Error

MediaWiki version: 1.36.0-wmf.36

message
Constructing RevisionRecord for a page that can't exist: Special:MyLanguage/Main Page [Called from MediaWiki\Revision\MutableRevisionRecord::__construct]

Impact

Notes

A few of them immediately after pushing 1.36.0-wmf.36 to group1

They all came from metawiki / wtp* servers and the URL /w/rest.php/meta.wikimedia.org/v3/transform/wikitext/to/lint

Details

Request URL
https://meta.wikimedia.org/w/rest.php/meta.wikimedia.org/v3/transform/wikitext/to/lint
Stack Trace
exception.trace
from /srv/mediawiki/php-1.36.0-wmf.36/includes/Revision/MutableRevisionRecord.php(116)
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, string, array)
#1 /srv/mediawiki/php-1.36.0-wmf.36/includes/debug/MWDebug.php(376): trigger_error(string, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.36/includes/debug/MWDebug.php(352): MWDebug::sendRawDeprecated(string, boolean, string)
#3 /srv/mediawiki/php-1.36.0-wmf.36/includes/GlobalFunctions.php(1068): MWDebug::deprecatedMsg(string, string, string, integer)
#4 /srv/mediawiki/php-1.36.0-wmf.36/includes/Revision/RevisionRecord.php(123): wfDeprecatedMsg(string, string)
#5 /srv/mediawiki/php-1.36.0-wmf.36/includes/Revision/MutableRevisionRecord.php(116): MediaWiki\Revision\RevisionRecord->__construct(Title, MediaWiki\Revision\MutableRevisionSlots, boolean)
#6 /srv/mediawiki/php-1.36.0-wmf.36/vendor/wikimedia/parsoid/extension/src/Config/PageConfigFactory.php(153): MediaWiki\Revision\MutableRevisionRecord->__construct(Title)
#7 /srv/mediawiki/php-1.36.0-wmf.36/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(364): MWParsoid\Config\PageConfigFactory->create(Title, User, NULL, string, NULL, array)
#8 /srv/mediawiki/php-1.36.0-wmf.36/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(416): MWParsoid\Rest\Handler\ParsoidHandler->createPageConfig(Title, NULL, string, NULL)
#9 /srv/mediawiki/php-1.36.0-wmf.36/vendor/wikimedia/parsoid/extension/src/Rest/Handler/TransformHandler.php(118): MWParsoid\Rest\Handler\ParsoidHandler->tryToCreatePageConfig(array, string)
#10 /srv/mediawiki/php-1.36.0-wmf.36/includes/Rest/Router.php(395): MWParsoid\Rest\Handler\TransformHandler->execute()
#11 /srv/mediawiki/php-1.36.0-wmf.36/includes/Rest/Router.php(322): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\TransformHandler)
#12 /srv/mediawiki/php-1.36.0-wmf.36/includes/Rest/EntryPoint.php(165): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#13 /srv/mediawiki/php-1.36.0-wmf.36/includes/Rest/EntryPoint.php(130): MediaWiki\Rest\EntryPoint->execute()
#14 /srv/mediawiki/php-1.36.0-wmf.36/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#15 /srv/mediawiki/w/rest.php(3): require(string)
#16 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Similar to T205908: Unable to change visibility of log entries when MediaWiki:Mainpage uses Special:MyLanguage (CVE-2020-35477) where Special:MyLanguage is treated like a special page that can't exist, separate from the quick fix for this UBN, we should probably add logic somewhere to separate Special:MyLanguage and other redirecting special pages from other normal special pages, eg Special:MyPage, etc.

hashar triaged this task as Unbreak Now! priority.Mar 24 2021, 7:35 PM

This doesn't immediately look like a new bug, just a request to the REST API that happened to come in after train deploy. The Parsoid-side code hasn't changed in months. It's possible the RevisionRecord code changed in wmf.36? I'll look into it.

(And who's linting a special page?)

RequestID for one of the 8 is 7d9dd8df-569a-4e24-9fcb-625fb3efa68c
No referrer.
The assumption is that this comes via changeprop and is not user visible. Probably not actually UBN priority.

It's possible the RevisionRecord code changed in wmf.36? I'll look into it.

Yes, it's not only possible, it is the case. This is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/672819 again.

To unblock the train we can simply revert it. Again.

In this case Parsoid PageConfigFactory is trying really hard to find a revision for the provided title, gives up in the end, and just does new MutableRevisionRecord( $title ). Which is now prohibited with pages that can not possibly exist.

Parsoid could refuse to process a request where title->canExist is false.

Also, just a note, this is a deprecation warning, not an exception, it's not affecting the users at all, so I'm not sure a UBN train blocker status is fair in this case.

I don't know anything about the code base or what kind of impact it might have. I am happy to lower the priority and drop it from the list of blockers.

Pchelolo lowered the priority of this task from Unbreak Now! to Medium.Mar 24 2021, 8:18 PM

@hashar it's a deprecation warning, so it only shows in the logs, no user impact whatsoever. I propose to drop it from the list of blockers, we'll have a look at it this week and try to get into the next week train.

The strange thing is that $wikitextOverride is non-null. That is, someone seems to be providing wikitext which would be a "future" revision of Special:MyLanguage/Main Page and asking us to lint it. I don't think that's changeprop, but now I'm wondering who it would be? It doesn't help that they are not setting a referrer.

@hashar it's a deprecation warning, so it only shows in the logs, no user impact whatsoever. I propose to drop it from the list of blockers, we'll have a look at it this week and try to get into the next week train.

If you want to make double sure that we don't dawdle, make it a blocker for the next train.

@cscott Will you work on this, or shall we? I suppose it should be caught in the parameter validation of TransformHandler.

Yeah, I'm looking for the right place to put that $title->canExist check. The odd thing is that wikitext is provided in the request, which doesn't exactly match how I expect the linter to work. I think the right place (ie, safest) is actually down in PageConfigFactory, when we're trying to look up a revision ID by title.

Change 674699 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):
[mediawiki/services/parsoid@master] Don't try to create a MutableRevisionRecord for a Special page

https://gerrit.wikimedia.org/r/674699

^ that's my proposal. It still allows *some* crazy stuff, ie you could potentially convert html2wikitext while using a special page as the page title, but catches it "at the last moment" before it creates a Revision or MutableRevision for a page that cannot be edited. This seems to be fairly conservative insofar as avoiding breaking anything else inadvertently.

Dropped this from 1.36.0-wmf.36 list of blockers. Thank for the quick analysis, and surely I should have noticed it was just a deprecation warning.

Dropped this from 1.36.0-wmf.36 list of blockers. Thank for the quick analysis, and surely I should have noticed it was just a deprecation warning.

You removed .37, .36 was already dropped.

(And who's linting a special page?)

Something calls this codepath with Title::newMainPage(), assuming that the title itself doesn't matter for that particular usecase. Since metawiki's main page is set to Special:MyLanguage/Main_Page, it's,....effectively linting a special page.

This practice of passing main page as a "dummy title" when the code author thought "it can't matter" shoot us into foot several times (last week I helped to fix T278429, a train blocker, caused exactly by this practice).

Also, just a note, this is a deprecation warning, not an exception, it's not affecting the users at all, so I'm not sure a UBN train blocker status is fair in this case.

I'm sure releng folks are much better positioned to say this, but I'll offer my 2c on this as well: all log spam is bad. People triaging logs often have no idea about what the given codepath does, and it's really really helpful to not have any messages that "are logged as warnings [or above], don't cause any user impact, and are ignored". The more ignorable messages, the harder it is to spot real issues.

Also, per SIP, "Hard deprecated code MAY act as no-ops instead of actually functioning", so without carefully examining the code, it's impossible to say that it is a no-impact case.

Urbanecm raised the priority of this task from Medium to Unbreak Now!.Mar 30 2021, 1:33 PM

Promoting back to UBN, this blocks current train (wmf.37) from rolling to group0. By (unwritten?) policy, all train blockers are UBN.

This is log spam, but again, this does not lead to any error in production, it's just a deprecation warning. I would suggest the deprecation warning be reverted. As you note, the practice of passing a "dummy title" into the Parser is very very old in the codebase and won't be removed quickly. It occurs in countless extensions as well. And there's nothing in the Parser which breaks when this happens: the Parser just uses the title to resolve some odd parser functions and as a base for subpages; often the content is used for UX or other purposes. There's no reason the title "has" to be real.

In particular, Title::newMainPage() is pretty much the expected default here. And again, there's nothing wrong with linting wikitext in the context of a page title that couldn't be saved into the database -- in particular wikitext is often parsed on special pages as part of rendering the UX. The issue is with the MutableRevisionRecord and PageIdentity abstractions, which appear to be a poor fit for the Parser (or at least with uses of the parser on special pages, for messages, and as part of non-POST aka-no-database-mutation APIs). The Parser's title should be a LinkTarget, not a PageIdentity. If we need to use another abstraction to tie together a LinkTarget, an option revision ID, and a not-saved-into-the-database wikitext, then we should remove MutableRevisionRecord from the parser entirely and use that new abstraction instead.

Note that I had this issue with 0676256cccb40341b6546c7f7b3aa7e0ba067bac (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/617294) as well, introducing a new Parser hook. PageIdentity/ProperPageIdentity wasn't appropriate for the context title there either, for similar reasons: invoking the parser with a context title representing a SpecialPage is actually pretty common.

I think we probably need to tackle the structure issue with RevisionRecord/MutableRevisionRecord and try to represent "parser output which is not going to be saved into the database" with some other class. Perhaps we need an interface which both RevisionRecord and "this new thing" can implement, and which exposes just the parts of the revision record that Parser needs.

Ok, created https://gerrit.wikimedia.org/r/c/mediawiki/core/+/675737 to revert the deprecation.

Re deprecation warnings being a UBN: yes, I agree all log spam is bad. There's however varying decrees of bad, and log spam is definitely not on the top of the list. UBN as a temporary priority to attract attention makes sense, but once developers are aware of the bug and are working on it, there's no reason really to revert, fix uncovered problem and reapply again. For years we've had hard-deprecations logged into a separate logstash channel and nobody was worrying about them, maybe we should return to that practice.

As for Parser LinkTarget/PageIdentity: MutableRevisionRecord is doing the right thing here - revision can only exist on a page that can exist - there can never be a RevisionRecord for a SpecialPage. Also, it's very easy and very possible to add a revision record to the database, which would mean data corruption.

I however can agree with @cscott - Parser should probably not really care about the details of the context it's parsing that much - you can just supply it wikitext with no context title at all and it should probably just parse it no problem. However, parsoid endpoints where we need to look up page content in the DB and then parse it - those still need to use PageIdentity.

I see a few solutions here:

  1. Make up a hierarchy of fakes - FakeRevisionRecord, FakePageIdentity etc and construct those instead of MutableRevisionRecord - the latter is designed to be added to the database, not to look up guaranteed non-existent content.
  2. Make Parser take either proper page, or null - we supply 'Main_Page' anyway in case page can't exist or doesn't matter - what's the benefit of supplying Main_Page over null?
  3. Make Parser take a superclass of LinkTarget/PageIdentity (doesn't exist yet). That will make it a bit easier to transition, but implementation in the parser itself will be come complex, since we're removing some guarantees.

I personally prefer option 2 the most, but Parsing team are the ultimate judge here.

I'll note that option 2 "used to" be the default -- the Parser was constructed with a null title, and various paths through would either initialize it or forget to. This caused a long stream of low-level issues where things would crash on the unexpected null. It was changed rather recently (I want to say 1.34) so that getTitle would always return a non-null title. I need to do some code archeology, but a number of extensions were explicitly initializing title to things like Bogus_Title and I think the *recommendation* at the time was to use the Main Page title, which is supposed to always be guaranteed to exist. I was not aware until T278376#6956605 that metawiki sets the main page to a title which not only doesn't exist, but can't exist.

I'm afraid we're on the verge of revisiting that decades-long running issue, by introducing a new "RevisionRecord" which isn't. Perhaps the way forward is to tighten this down so that the Parser's title is guaranteed to exist -- but then we need something like Title::newRealMainPage() which extensions can use instead of Bogus_Title or Title::newMainPage() which will return an actually-guaranteed-to-exist page. My personal feeling is to *weaken* the guarantee inside Parser, so all the Parser sees is a <LinkTarget $title, string $wikitext> tuple with no other assumed properties, and that we carefully document and mark the places where that weak tuple has to be cast into a real RevisionRecord for database reasons / to handle the fetch of "prior" wikitext in the selective serialization path / to evaluate a number of sketchy low-level parser functions like {{REVISIONID}}. Note that the first two would be justified throwing exceptions to prevent database corruption, while the latter case is usually handled by expanding {{REVISIONID}} to 0 or -1 or some other nonfatal mechanism. It's probably reasonable to force callers to cast the "tuple thing" to a RevisionRecord in these cases and explicitly handle the failure.

So my option (4) is introduce ParserRecord which is just <LinkTarget $title, string $wikitextToParse> (no revision id!) with a method ParserRecord::castToRevisionRecord(), have RevisionRecord implement ParserRecord, and then slowly replace all uses of RevisionRecord with ParserRecord. Then the few places which call castToRevisionRecord can be studied closely to figure out how/whether to handle the special cases where the parser title can't be saved into the DB. Things like FlaggedRevisions and {{REVISIONID}} will have to use ParserRecord::castToRevisionRecord(), but I suspect there won't be very many calls to that in the codebase. I probably need to do some experimental hacking to see if my assumptions here are correct.

My reasoning here is that 99% of the folks who use the Parser barely need the context title, much less a specific revision id for that title, and so we're better off trying to wean folks away from assuming they have access to this info. A RevisionRecord is a powerful class with a lot of information, and I'd rather not give API users direct access unnecessarily -- or at least force them to clearly mark when they do need that info by calling a castToRevisionRecord method that can be easily code-searched.

Change 675875 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@wmf/1.36.0-wmf.37] Revert "Re-apply "Deprecate constructing revision with non-proper page""

https://gerrit.wikimedia.org/r/675875

With the revert above, what is the status of this task? This is still blocking the train which hasn't even reached group 0 yet.

It should not block the train anymore.

Change 675875 merged by jenkins-bot:

[mediawiki/core@wmf/1.36.0-wmf.37] Revert "Re-apply "Deprecate constructing revision with non-proper page""

https://gerrit.wikimedia.org/r/675875

Mentioned in SAL (#wikimedia-operations) [2021-03-31T19:03:52Z] <twentyafterfour@deploy1002> Synchronized php-1.36.0-wmf.37/includes/Revision/RevisionRecord.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/675875 to unblock train refs T278376 T278343 (duration: 00m 58s)

The Parser's title should be a LinkTarget, not a PageIdentity. If we need to use another abstraction to tie together a LinkTarget, an option revision ID, and a not-saved-into-the-database wikitext, then we should remove MutableRevisionRecord from the parser entirely and use that new abstraction instead.

LinkTarget isn't a good fit either - a LinkTarget can be an interwiki link, or a relative section link. Neither can be the context for parsing.

I'm thinking of introducing a PageReference abstraction: this can be a special page, but not an interwiki link or relative section link. It could act as the context for parsing.

But apart from that, I believe that the context page and revision needs to always be optional for the parser. It can be useful to have this context to do some things, but parsing should work without it. If page and revision are optional, the parser can detect when it's called with a non-proper Title and ignore it.

In any case, we need to find a way to guarantee that a RevisionRecord actually represents a revision. Dummy titles are fine, dummy pages... not so much. But fake revisions should really not be a thing.

Pchelolo lowered the priority of this task from Unbreak Now! to Medium.Apr 1 2021, 2:07 PM

Rollbacks were deployed, it's not a UBN anymore

I'm thinking of introducing a PageReference abstraction: this can be a special page, but not an interwiki link or relative section link. It could act as the context for parsing.

I wouldn't object to that, although I'll note that the number of "title-like" things is quickly increasing. Maybe that's a natural consequence of Title being abused as a catch-all for so long.

But apart from that, I believe that the context page and revision needs to always be optional for the parser. It can be useful to have this context to do some things, but parsing should work without it. If page and revision are optional, the parser can detect when it's called with a non-proper Title and ignore it.

The problem is that *quite a lot* of legacy code takes a parser object and fetches a Title from it and does non-parsing things. It's not the parser which requires the title, it's *all the stuff which holds a Parser object* -- including a large number of legacy hooks and thus third party extensions in the wild. I'm all for deprecating and removing those hooks, but I've been working on that for ~2 years now and there's still a long way to go.

The Parser does need the context Title to resolve links in the presence of subpages.

In any case, we need to find a way to guarantee that a RevisionRecord actually represents a revision. Dummy titles are fine, dummy pages... not so much. But fake revisions should really not be a thing.

The problem is, that they *are*. We create the new revision record *before* it is stored in the database, in order to support {{REVISIONID}} on a newly-saved page, etc. And VE stashing/etc also deal with revisions-that-will-be-but-aren't-quite-yet. That's actually a very useful abstraction I'd be very reluctant to give up.

The issue here is with "future revisions" which happen to have a title such that they couldn't actually be stored in the database if you tried. That's not actually a problem -- the time to throw the exception is probably when you try to store it in the database, not when you're just holding a temporary not-in-the-database revision. But the recent patch (reverted twice) tried to be proactively strict about this. Since there are legitimate use cases to use "not a real title" when parsing, we "just" need to have a standard way to create a temporary-stashed-revision-without-a-real-title. Until recently, the recommended way to do this was to use Title::newMainPage() as your title.

The problem is, that they *are*. We create the new revision record *before* it is stored in the database, in order to support {{REVISIONID}} on a newly-saved page, etc. And VE stashing/etc also deal with revisions-that-will-be-but-aren't-quite-yet. That's actually a very useful abstraction I'd be very reluctant to give up.

We already have an abstraction for non-existing revision, it's MutableRevisionRecord, and it's not going anywhere. We do not want however to have an abstraction for non-existable revision.

IMHO Parser doesn't really need a RevisionRecord either. What it needs is a bag of data about the revision under which the content it's parsing is stored, that's what Parsoid demonstrates with PageConfig class. In the perfect world, it would need a triplet ( ContextTitle|null, ContextRevision|null, Content)

Old Parser though is very tied up with RevisionRecord. If we do create a ParserContextRevisionRecord, extending RevisionRecord with looser guarantees and inability to be stored, we could change Parser to use that instead of proper RevisionRecord. Even better if we switched old Parser to 'ParserRevisionContext', completely unrelated to RevisionRecord, but that's much harder to achieve with b/c. Given that we hope to remove old parser, I am not sure it worths the investment. Unless a change like that will help Parsoid transition.

With title - same thing - it doesn't need either Title or LinkTarget - it needs a 'ParserPageContext' - a collection of page-related methods, just like PageConfig in Parsoid.

I'll note that the number of "title-like" things is quickly increasing. Maybe that's a natural consequence of Title being abused as a catch-all for so long.

I have been reluctant to introduce PageReference for that reason, yes.

The problem is that *quite a lot* of legacy code takes a parser object and fetches a Title from it and does non-parsing things.

How many of these non-parsing things will work properly on a Title that points to a special page?

It's not the parser which requires the title, it's *all the stuff which holds a Parser object* -- including a large number of legacy hooks and thus third party extensions in the wild. I'm all for deprecating and removing those hooks, but I've been working on that for ~2 years now and there's still a long way to go.

We can loop the Title object through to the hooks as before. With the same loose guarantees. My main concern is with a "pretend" RevisionRecord. As Petr pointed out, a RevisionRecord that hasn't been saved yet is fine (e.g. in a preview for a page creation), but a RevisionRecord of a special page (or interwiki link) is not.

The Parser does need the context Title to resolve links in the presence of subpages.

The parser should know the page to which the content belongs, so it can be used as context, yes. I'd think it should be possible to produce sensible output when such context is missing. In some situations, that context doesn't make sense anyway - e.g. when parsing system messages.

The problem is, that they *are*. We create the new revision record *before* it is stored in the database, in order to support {{REVISIONID}} on a newly-saved page, etc. And VE stashing/etc also deal with revisions-that-will-be-but-aren't-quite-yet. That's actually a very useful abstraction I'd be very reluctant to give up.

Unsaved revisions are fine. Unsavable revisions are not.

The issue here is with "future revisions" which happen to have a title such that they couldn't actually be stored in the database if you tried. That's not actually a problem -- the time to throw the exception is probably when you try to store it in the database, not when you're just holding a temporary not-in-the-database revision.

The idea is that RevisionRecord::getPage() should return a ProperPageIdentity, so we have a guarantee that it's a page that can actually have revisions.
If we follow your proposal, getPage() itself would have to throw if the RevisionRecord was constructed with a "bad" page. At that point, the offending code would not be on the stack, and the caller has bo way to fix the situation. All code calling RevisionRecord::getPage() will need to handle this error.

That's why I want to handle it in the constructor. Do you have a better solution?

But the recent patch (reverted twice) tried to be proactively strict about this. Since there are legitimate use cases to use "not a real title" when parsing, we "just" need to have a standard way to create a temporary-stashed-revision-without-a-real-title. Until recently, the recommended way to do this was to use Title::newMainPage() as your title.

Title::newMainPage() is still fine - except when the main page isn't a real page but a special page. I didn't even think about the fact that this is possible, before this issue came up.

I don't see a legitimate use case for a RevisionRecord of not-a-real-page. A not-yet-existing-page, sure. But not a special page.

I can also see a legitimate use case for parsing without a revision or page in the context. Maybe even with a special page as the context. But not with a revision of a special page as context, that makes no sense.

We're off on a couple of different points here.

  • See T235392 and all the related patches for some context on 'null' as a context title; in particular to all the null checks in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/533002/8/includes/parser/Parser.php that were removed when we could guarantee that a valid Title was always returned. That's all complexity you'd have to reintroduce if you want Title to be null; I don't think that's a good idea.
  • Note that Parser::setTitle() explicitly sets the title to Special:Badtitle/Parser in some paths. So yes, the Title can be a special page. But I think we agree on that?
  • The fundamental issue seems to be the Parser::fetchCurrentRevisionRecordOfTitle, Parser::statelessFetchTemplate, and similar methods, and the associated hooks. These have very recently (since 1.34 or so) been converted to using RevisionRecord more-or-less consistently.

The valid use case here is that template fetch isn't required to come from the database at all, so there's no reason why the "revision record" returns by these fetches is "database-able". The hooks explicitly allow an extension to redirect the fetch to a different title and/or revision. So you could write an extension that treats a template fetch for '[[Foo]]' as a fetch of [[Special:MyPage/Foo]] and then proceeds to rewrite that to Foo/en. Again, the abstraction here is "a title" or "a revision (which will also have a title)", but only the latter need exist in a database. We've been using MutableRevisionRecord for the former, but perhaps we need to use something else.

It seems like we've got a number of possible solutions here:

a) Introduce a new Title::newBogusPage() method to use instead of Special:Badtitle/Parser, Title::newMainPage, etc, which is guaranteed to return a "database-able" page. Probably it should also be guaranteed to return a page *not actually* in the database, and it would be nice if there were a method to interrogate a Title and determine if it was a bogus page -- although probably checking to see whether it exists is enough for 99.9% of the cases where that might be relevant. Missing titles in the action and rest API should be defined to Title::newBogusPage() instead of [[Main Page]] which has historically been the default.

b) We can insist that Title::newMainPage always return a PageIdentity and figure out some different way to support what metawiki wants to do.

c) We can replace the revisit the various Parser::fetch* and Parser::stateless* APIs and go back to returning <Title $title, int? $revisionID> tuples instead of using RevisionRecord for this, or introduce ParserContextRevisionRecord for this purpose as @Pchelolo suggested.

d) Orthogonally, tighten up Parser::getTitle to LinkTarget or tighter (PageReference, ParserPageContext, etc).

(a) or (b) would solve the immediate issue. They would still restrict the ability of extensions like T47096 to redirect through synthetic titles and substitute "real" revisions for them. That would be fixed by (c). (d) would be interesting as part of the deprecation of the full Title class where it isn't necessary, but probably doesn't actually solve any immediate issues.

While making the use of Title in Parser optional and/or stricter would be nice, it's not needed right now. So let's not get hung up on that.

The issue is really the use of RevisionRecord. If we have a legitimate need for "virtualized" template resolution (beyond special page transclusion, which is handled on a separate code path), I think the correct solution would be (c): revisit the various Parser::fetch* and Parser::stateless* methods. I believe this is the right thing to do, because it provides an oportuity to clean up some conceptual issues, in particular with dependency tracking in the templatelinks table.

It seems to me like this isn't as messy as it might seem:

  • Parser::fetchTemplateAndTitle() does not return a RevisionRecord.
  • Parser::fetchFileAndTitle() does not return a RevisionRecord.
  • Parser::fetchFileNoRegister does not return a RevisionRecord.
  • Parser::statelessFetchRevision() is hard-deprecated since 1.35 and can be deleted.
  • Parser::fetchCurrentRevisionOfTitle() is hard-deprecated since 1.35 and can be deleted.
  • Parser::fetchTemplate() does not return a RevisionRecord. Also, it is hard-deprecated since 1.35 and should be deleted. Looks like DPL still uses it, though.
  • Parser::statelessFetchTemplate() does not return a RevisionRecord. It does however use the Title for tracking tracking, causing non-page Titles to end up in the templatelinks table. Not sure this is intentional, or used for anything. I'm seeing 6 rows with tl_namesapace = -1 on enwiki.
  • Parser::statelessFetchRevisionRecord() appears to be unused outside core. It can be deprecated, and return null in the critical case.
  • Parser::fetchCurrentRevisionRecordOfTitle() would need to be deprecated, and return null for the critical case. A suitable alternative would have to be created.
  • ParserOptions::getCurrentRevisionCallback() is hard-deprecated since 1.35 and can be deleted.
  • ParserOptions::setCurrentRevisionCallback() is hard-deprecated since 1.35 and can be deleted.
  • ParserOptions::getCurrentRevisionRecordCallback() needs to be replaced. It's tagged @internal. It's used by TemplateSandbox though.
  • ParserOptions::setCurrentRevisionRecordCallback() needs to be replaced. It's marked `@internal. It's used by FlaggedRevs, TemplateSandbox, and TemplateStyles. Seems fixable.

We'd have to introduce methods for getting the current content of a page for transclusion, probably returning a touple of [ Content $content, int $revId, int $pageId, PageReference $page ]. $revId and $pageId can be 0 or null or omitted for non-existing or non-exitable revisions. $page is for backling tracking. If we allow it to be null for non-page content, we can make it a ProperPageIdentity. We'd just not write anything into templatelinks in that case.

We'll need the following new methods:

  • Parser::fetchCurrentContentOfTitle, to replace fetchCurrentRevisionRecordOfTitle.
  • Parser::statelessFetchContent, to replace statelessFetchRevisionRecord.
  • ParserOptions::getCurrentContentCallback, to replace getCurrentRevisionRecordCallback.
  • ParserOptions::setCurrentContentCallback, to replace setCurrentRevisionRecordCallback.

I have *NOT* followed all this discussion closely, but can someone answer as to what extent do these issues translate over to Parsoid? And, to what extent is the solution to simply work towards eliminating the use of the old parser rather than keep tweaking it?

Edit: I had inadvertently left out the *NOT* there, a critical qualifier. :)

I have *NOT* followed all this discussion closely, but can someone answer as to what extent do these issues translate over to Parsoid?

Parsoid internally doesn't depend on anything from core, so it already somewhat has the correct approach (PageConfig is something like ParserTitleContext/ParserRevisionContext) I've proposed. The glue code however might be using incorrect abstractions (as evident by this ticket being in parsoid), but it should be relatively easy to solve.

And, to what extent is the solution to simply work towards eliminating the use of the old parser rather than keep tweaking it?

This is blocking some of the immediate work we're doing right now. I don't think we can wait for parsoid in core. We should however design the new interfaces with parsoid in mind

I have *NOT* followed all this discussion closely, but can someone answer as to what extent do these issues translate over to Parsoid? And, to what extent is the solution to simply work towards eliminating the use of the old parser rather than keep tweaking it?

It does translate in the sense that there are conceptual edge cases that we should answer for Parsoid:

  • The first question is whether wikitext can be parsed without having a page as context - and if not, whether a special page is ok as context or not.
  • The second question is whether parsing should require a RevisionRecord as context (potentially an unsaved one). This is where it gets tricky, since RevisionRecords cannot really exist for a SpecialPage. But the old parser insists on having a RevisionRecord.
  • The third question is how the answer to the above question translates to things transcluded on a page - does transcluded content have to come from a revision? How does special page transclusion factor into this? How can extensions inject fake content for transclusion? How does that impact what we track in the templatelinks table?

I think a page (LinkTarget?) for context is needed for resolving relative links, but it should not be a hard requirement. Parsing should be possible without it. I think the use of RevisionRecord should be avoided in the parser. Ideally, the wikitext parser is isolated from the storage infrastructure. If it has a need to load things, it should define an interface that returns just what the parser needs, in terms of what the parser understands. Calling code can provide suitable implementations.

Presumably @cscott has weighed in or will weigh in but my thoughts below. My answers have a Parsoid / ideal-situation lens and so may not translate over as well for the core parser.

It does translate in the sense that there are conceptual edge cases that we should answer for Parsoid:

  • The first question is whether wikitext can be parsed without having a page as context - and if not, whether a special page is ok as context or not.

Yes. I/we like to think of wikitext -> HTML as a content transformation step and there is no reason to depend on existence or otherwise of link targets, page context, how templates resolve, what content they return, etc.

  • The second question is whether parsing should require a RevisionRecord as context (potentially an unsaved one).

It shouldn't.

This is where it gets tricky, since RevisionRecords cannot really exist for a SpecialPage. But the old parser insists on having a RevisionRecord.

<speculation>I shouldn't speculate but instead go do code archeaology or read code .. but just in case it is useful: one reason I can ostensibly think of why that requirement might have arisen is because of PSTs. But, strictly speaking, PSTs are a wikitext -> wikitext transform for real page saves (but which could be decoupled from saves if deemed necessary) and so they are just confusing matters if that is the real reason why RevisionRecord entered the parse picture.</speculation>

  • The third question is how the answer to the above question translates to things transcluded on a page - does transcluded content have to come from a revision? How does special page transclusion factor into this? How can extensions inject fake content for transclusion?

No, transcluded content doesn't need to come from a revision. The same idea as above applies here. When a particular step in the transformation needs content, we query an API (library / internal call) to get that content and that should be it. This abstraction was clear in the Parsoid/JS case, but even in the Parsoid/PHP case, that conceptual abstraction should ideally be preserved.

How does that impact what we track in the templatelinks table?

Change tracking shouldn't be impacted by this. Change tracking should be hooked to change events. And, while page saves are one source of change events, special pages might have other mechanisms for generating change events. So, whether something is a special page or not should not be relevant to change tracking.

I think a page (LinkTarget?) for context is needed for resolving relative links, but it should not be a hard requirement. Parsing should be possible without it. I think the use of RevisionRecord should be avoided in the parser. Ideally, the wikitext parser is isolated from the storage infrastructure. If it has a need to load things, it should define an interface that returns just what the parser needs, in terms of what the parser understands. Calling code can provide suitable implementations.

I agree. A canonical structure for wiki links which can be generated without querying db state. The specific form of a wikilink could then be tweaked in post-processing steps (ex redlinks, other title resolution & link normalization steps). In Parsoid, redlink tagging is modelled as a post-processing step and while they currently integrated into the default flow, they can be removed out of the step that generates canonical content for storage / caching, if appropriate.

But, to prevent proliferation of post-processing steps (something that we are pondering separately), there might be an efficiency argument to be made for introduce a conceptual LinkTarget structure. Right now, Title processing for link targets (titleToString, resolveTitle, etc.) is baked into Parsoid code as well, but that can be refactored to reference something else if necessary.

@daniel Lots of stuff here. Just a few things for now:

  • https://en.wikipedia.org/wiki/Help:Special_page#Linking_and_transcluding explicitly discusses transcluding special pages. That should explain its inclusion in templatelinks.
  • Be sure you're considering the use case in T47096, which might not be fully implemented yet so might not show up in codesearch.
  • As subbu said, from the parser's perspective the contents of {{....}} is just an arbitrary string. There are very few token-level restrictions on the content, other than | (and to some degree # and /). As an architectural issue, it would be best if we didn't add unnecessary restrictions on this at the parser or hook level, to allow interesting future uses. (Because Title is tightly-coupled to core and Parsoid is a library, Parsoid generally uses strings to represent titles.)
  • I'm all in favor of cleaning up the API in Parser, especially if your team is willing to help with that work. This resolution code doesn't really belong in the Parser (or in ParserOptions) at all. As above, the parser should have a clean API where it provides the "template name" (again, as a LinkTarget or a string) and there's a "TemplateResolver" service/class/whatever whose job is to resolve/fetch this, including invoking the hooks used by Translate, FlaggedRevs, etc. We can work out the details, but I would very much like this *not* to be methods of Parser or ParserOptions.

Change 677070 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] Clean up hard-deprecate Parser methods returning Revision

https://gerrit.wikimedia.org/r/677070

@daniel Lots of stuff here. Just a few things for now:

It shouldn't: Special page transclusion is not handled by fetchTemplateAndTitle(), it has special handling in braceSubstitution(). So normally, it's not recorded in templatelinks. Also, I'm special pages in there that return false from isIncludable(), e.g. Special:Random. I suspect that this is the result of something fishy happening... the link to Special:Random seems to be the result of someone having fun with template redirect resolution, see User:SPUI/random2.

  • Be sure you're considering the use case in T47096, which might not be fully implemented yet so might not show up in codesearch.

I think this would be easier with what I propose - if we didn't return fake revisions, but just strings or Content objects, it would become much more straight forward to return just part of a page, or the content of a subpage, or a transformed version of the content, or an HTML rendering of the content, etc. It would give us a lot more flexibility to change what "transclusion" means.

  • As subbu said, from the parser's perspective the contents of {{....}} is just an arbitrary string. There are very few token-level restrictions on the content, other than | (and to some degree # and /). As an architectural issue, it would be best if we didn't add unnecessary restrictions on this at the parser or hook level, to allow interesting future uses. (Because Title is tightly-coupled to core and Parsoid is a library, Parsoid generally uses strings to represent titles.)

Yes, one more reason not to represent it as a RevisionRecord. I'd prefer a little bit of extra context to come with the string, e.g. the content model. Content objects are a bit more bloated than I would like, but I think they are the correct abstraction, at least in theory.

  • I'm all in favor of cleaning up the API in Parser, especially if your team is willing to help with that work. This resolution code doesn't really belong in the Parser (or in ParserOptions) at all. As above, the parser should have a clean API where it provides the "template name" (again, as a LinkTarget or a string) and there's a "TemplateResolver" service/class/whatever whose job is to resolve/fetch this, including invoking the hooks used by Translate, FlaggedRevs, etc. We can work out the details, but I would very much like this *not* to be methods of Parser or ParserOptions.

Perhaps T274177: PageContentAccess service as a replacement for WikiPage::getContent would fit the bill?

As to PET doing the work - we'll have to look into what needs doing, how it fits with our priorities, and how much we want to invest considering that we want to replace the old parser anyway. But I'll bring it up for discussion.

Change 677070 merged by jenkins-bot:

[mediawiki/core@master] Clean up hard-deprecated Parser methods returning Revision

https://gerrit.wikimedia.org/r/677070

Krinkle added a subscriber: Krinkle.

Untagging from Prod errors. Phrase Constructing RevisionRecord for a page no longer found in Codesearch, and exception.trace:"PageConfigFactory.php" no longer matches any deprecation errors (Logstash: mediawiki-deprecated dashboard, 90 days).

This came up incidentally in a conversation with @tstarling w/r/t use of Revision* in the APIs for rendering UX messages with Parsoid. UX messages can use Title, for example:

	"noarticletext": "There is currently no text in this page.\nYou can [[Special:Search/{{PAGENAME}}|search for this page title]] in other pages,\n<span class=\"plainlinks\">[{{fullurl:{{#Special:Log}}|page={{FULLPAGENAMEE}}}} search the related logs],\nor [{{fullurl:{{FULLPAGENAME}}|action=edit}} create this page]</span>.",

but obviously when rendering UX messages for (eg) special pages there are presumably cases where the Title is not a "existable" page. I think @daniel's suggestion is probably the best to-do list here to eliminate the use of Revision* methods in the Parsoid interface methods:

The issue is really the use of RevisionRecord. If we have a legitimate need for "virtualized" template resolution (beyond special page transclusion, which is handled on a separate code path), I think the correct solution would be (c): revisit the various Parser::fetch* and Parser::stateless* methods. I believe this is the right thing to do, because it provides an oportuity to clean up some conceptual issues, in particular with dependency tracking in the templatelinks table.

[...]

  • Parser::statelessFetchRevisionRecord() appears to be unused outside core. It can be deprecated, and return null in the critical case.
  • Parser::fetchCurrentRevisionRecordOfTitle() would need to be deprecated, and return null for the critical case. A suitable alternative would have to be created.
  • ParserOptions::getCurrentRevisionRecordCallback() needs to be replaced. It's tagged @internal. It's used by TemplateSandbox though.
  • ParserOptions::setCurrentRevisionRecordCallback() needs to be replaced. It's marked `@internal. It's used by FlaggedRevs, TemplateSandbox, and TemplateStyles. Seems fixable.

We'd have to introduce methods for getting the current content of a page for transclusion, probably returning a touple of [ Content $content, int $revId, int $pageId, PageReference $page ]. $revId and $pageId can be 0 or null or omitted for non-existing or non-exitable revisions. $page is for backling tracking. If we allow it to be null for non-page content, we can make it a ProperPageIdentity. We'd just not write anything into templatelinks in that case.

We'll need the following new methods:

  • Parser::fetchCurrentContentOfTitle, to replace fetchCurrentRevisionRecordOfTitle.
  • Parser::statelessFetchContent, to replace statelessFetchRevisionRecord.
  • ParserOptions::getCurrentContentCallback, to replace getCurrentRevisionRecordCallback.
  • ParserOptions::setCurrentContentCallback, to replace setCurrentRevisionRecordCallback.

with a new clear API for template expansion, something like:

  • This resolution code doesn't really belong in the Parser (or in ParserOptions) at all. As above, the parser should have a clean API where it provides the "template name" (again, as a LinkTarget or a string) and there's a "TemplateResolver" service/class/whatever whose job is to resolve/fetch this, including invoking the hooks used by Translate, FlaggedRevs, etc. We can work out the details, but I would very much like this *not* to be methods of Parser or ParserOptions.
cscott renamed this task from Constructing RevisionRecord for a page that can't exist: Special:MyLanguage/Main Page [Called from MediaWiki\Revision\MutableRevisionRecord::__construct] to Parser Use of RevisionRecord for a page that can't exist: (eg Special:MyLanguage/Main Page).Sep 15 2021, 9:13 PM
cscott updated the task description. (Show Details)
daniel raised the priority of this task from Medium to High.May 24 2022, 8:10 AM