Page MenuHomePhabricator

Find out the reason and potentially eliminate ParserCache split on action:render
Closed, ResolvedPublic

Description

Currently ParserCache splits on whether the parse is an immediate result of Action=render or not. The reason for this split is unclear, but removing it most likely will allow us to cut the size of the ParserCache by a factor of 2.

We should investigate why the split exist and strive for removing the split.

Event Timeline

18:46 Pchelolo: TimStarling: I wonder, maybe you know - why is entire ParserCache split on action=render vs everything else? https://github.com/wikimedia/mediawiki/blob/master/includes/parser/ParserCache.php#L114 We've been looking at it with Daniel today and could find the reason, thought maybe you could know right away. if not I'll keep figuring it out
18:47 TimStarling: because action=render is (was?) used for commons file description transclusion and so contained fully qualified URLs instead of domain-relative URLs
18:48 TimStarling: Ls
18:48 TimStarling: xchat sucks so much now, I cannot see the "Ls" at the end of that line unless I select it
18:49 TimStarling: should switch to irccloud or something like everyone else
18:50 TimStarling: yeah it is still the same, see FileRepo::getDescriptionRenderUrl() which is fetched in ForeignDBFile::getDescriptionText()
18:54 Pchelolo: hm.. but how does it affect the result of the parse? like, when Action=render is called, it's not setting any special parser options or calling any special methods on a parser?
18:55 Pchelolo: it only sets viewIsRenderAction which doesn't seem to propagate into the parser
18:56 TimStarling: you're not going to like this Pchelolo, are you ready?
18:56 Pchelolo: I'm sitting
18:56 TimStarling: it is the exact same hack that we implemented when commons first started, on line 2300 of Title.php:
18:56 TimStarling: @todo FIXME: This causes breakage in various places when we
18:56 TimStarling:
actually expected a local URL and end up with dupe prefixes.
18:56 TimStarling: if ( $wgRequest->getVal( 'action' ) == 'render' ) {
18:56 TimStarling: $url = $wgServer . $url;
18:56 TimStarling: }
18:56 TimStarling: Title::getLocalUrl() is hacked up to look at the query string
18:59 TimStarling: it's actually had a fixme comment since it was introduced by Brion in 2006 https://www.mediawiki.org/wiki/Special:Code/MediaWiki/12621
19:01 Pchelolo: it even indicates that the correct fix would be to move prepending elsewhere :)
19:01 TimStarling: sorry actually the hack was on the LHS of this change too, but that is where the fixme was introduced
19:03 TimStarling: without the fixme comment it goes back to July 2005, this change has both the parser cache hack and the URL hack: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/9864

Yuck. Can we at least fold this into ParserOptions somehow? Or we convert local to absolute URLs in ParserOutput::getText(), or in the render action...

I was thinking introduce 'useAbsoluteUrls' ParserOption should be the easiest. Will let us drop this hack from both Title and ParserCache, but will proliferate the cache split.

Converting local to absolute URLs in a post-processing step... There is $options array there, and we already pass some special options for render action, but I don't know how easy would it be to replace it on the fly with a regex.. I mean, it would be easy, but I'm not sure how fragile it would be. Don't know if there's a lot of corner cases. Will investigate.

Change 629454 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Remove old hack for splitting ParserCache on action=render

https://gerrit.wikimedia.org/r/629454

I'm going to unroot it and put into backlog for now. This doesn't at all block the project, but it's a great opportunity to eliminate a horrific piece of technical debt.

Change 654487 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/ImageMap@master] Prepare for Title::getLocalURL to stop producing full URLs

https://gerrit.wikimedia.org/r/654487

Change 654503 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Temporary debug log if Title::getLocalURL is used from render action

https://gerrit.wikimedia.org/r/654503

Change 654503 merged by jenkins-bot:
[mediawiki/core@master] Temporary debug log if Title::getLocalURL is used from render action

https://gerrit.wikimedia.org/r/654503

I did an analysis of ParserCache entries and I confirm this work will have a massive impact on ParserCache storage.

Basically get a sample and try to see how many keys have been fragmented with "-0" and "-1" and then multiply it twice to get the fragmentation. My rough analysis of 1/256 keys yielded 100M duplication which is roughly 20% of all of ParserCache.

I will spend some time to make sure it happens. cc @LSobanski @Kormat

I worked on this a few years ago as part of GlobalUserPage/shadow namespaces but didn't finish. I don't think action=render is worth fixing, rather features like InstantCommons should be updated to use the API's action=parse (which didn't exist at the time). LinkRenderer has options to set if URLs should be expanded absolutely or not, the missing parts were 1) getting files to do that as well, because refactoring that was a mess 2) exposing it as an API parameter.

There's some useful CR on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/191533 for working toward that, specifically Anomie suggested:

This doesn't seem very comprehensive, see this test link for some examples:

/w/api.php?action=parse&fullyqualifyurls=1&prop=text&disablepp=1&contentmodel=wikitext&text===Look+at+my+edit+link==%0A*+Awful+book:+ISBN+1466938897%0A*+Gallery:+<gallery>File:Example.jpg</gallery>%0A*+Normal+file:+[[File:Example.jpg]]%0A*+File+with+link:+[[File:Example.jpg|link=Test]]%0A*+Thumb:+[[File:Example.jpg|right]]%0A*+Nonexistent:+[[File:ThisFileShouldHopefullyNotExist.jpg]]

There's probably more, that's just what I found searching core for calls to Title::getLocalURL() and Title::getLinkURL() and quickly looking through the results. If you really want to expand all links, you'll probably need to resort to running something like ApiHelp::fixHelpLinks() over the output HTML.

...which is less elegant but likely the most complete solution without blocking on fixing *everything* (where I got stalled).

Change 715757 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Move adding full url for action=render to post render transform

https://gerrit.wikimedia.org/r/715757

As a DBE (Database Enthusiast) I looked at this, specially given that it triggers extra 5 million renders every day and single-handedly is responsible for twenty percent of ParserCache.

Fixing the links internally is quite a mess as @Legoktm has pointed out.
Using action=parse is not bad and I think it should be done but having rough edges pointed out, I want to mention that lots of third party systems also use InstantCommons and it means changing to action=parse is going to take at least years to be fully adapted.

The post-ParserCache transform sounds like a good idea and rather easily doable. It's not super great but works well. Tried it locally. Here's the patch.

I fully support doing it in a post-processing step, if that's possible.

A complication here is that this hack with checking action=render exist both in ParserCache and in Title. If we just eliminate the PC split without removing the hack from Title, we might cache invalid stuff renders when action=render is called and ParserCache is empty - we will cache a render with full URLs. So we need to drop this hack from both ParserCache and Title at the same time.

To do this we need to find the callers of Title::getLocalURL and see if removing the hack will break anything - thus the debug logging I've added. I guess next step here is to enable the debug log channel, analyze the results and make sure extension code that calls Title::getLocalURL is not expecting to get full URLs

I'm not sure I understand your concern!
Let me explain: The current patch is on RenderAction and takes html output (no matter where they are from) and turns any sort of internal link to external link. It's a very top-down approach which is much easier here than doing bottom-up because it would require a lot of investigation. Am I missing something obvious here?

Hm, actually thinking about it again you're right, top-down will work indeed. My concern indeed was about my bottom-up approach.

I was wondering if middle approach is better - to add an option to ParserOutput::getText and do it there? Then we have a rather clean way of passing if view is a render action in without using request - in Article::view we just set this text option in generateContentOutput to $this->viewIsRenderAction and it's all gets pretty clean..

Middle approach sounds better to me because I couldn't add a proper WAN cache for it (for a day/hour for example), it'll be cached in edges but still I don't want to hand over a DDoS vector to attackers. Doing it from the top, it would be really hard to build a key and have some cache busting but at the middle is much easier.

Change 722632 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] DNM: expand local URLs to absolute URLs in ParserOutput

https://gerrit.wikimedia.org/r/722632

Change 654487 abandoned by Ppchelko:

[mediawiki/extensions/ImageMap@master] Prepare for Title::getLocalURL to stop producing full URLs

Reason:

not needed

https://gerrit.wikimedia.org/r/654487

Change 722632 merged by jenkins-bot:

[mediawiki/core@master] Expand local URLs to absolute URLs in ParserOutput

https://gerrit.wikimedia.org/r/722632

Change 724109 had a related patch set uploaded (by Ladsgroup; author: Ppchelko):

[mediawiki/core@wmf/1.38.0-wmf.1] Expand local URLs to absolute URLs in ParserOutput

https://gerrit.wikimedia.org/r/724109

xhgui is quite happy about it: ParserOutput::getText took 3.5% of the whole request, around 10ms.

Change 724109 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.1] Expand local URLs to absolute URLs in ParserOutput

https://gerrit.wikimedia.org/r/724109

Mentioned in SAL (#wikimedia-operations) [2021-09-27T17:42:28Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.1/includes/page/Article.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part I (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2021-09-27T17:43:37Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.1/includes/parser/ParserOutput.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part II (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2021-09-27T17:44:46Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.1/includes/parser/ParserCache.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part III (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2021-09-27T17:46:04Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.1/includes/Title.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part IV (duration: 00m 56s)

Change 724186 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/core@master] Follow-up d334de9: Add makeurlsabsolute option to action=parse

https://gerrit.wikimedia.org/r/724186

Using moving average of a day show it better:

image.png (349×1 px, 42 KB)

Change 629454 abandoned by Ppchelko:

[mediawiki/core@master] Remove old hack for splitting ParserCache on action=render

Reason:

already done.

https://gerrit.wikimedia.org/r/629454

Change 715757 abandoned by Ppchelko:

[mediawiki/core@master] Move adding full url for action=render to post render transform

Reason:

Done in a different way.

https://gerrit.wikimedia.org/r/715757

I believe this change may have caused a regression, which was reported to me by users of the RTRC gadget. Specifically, RTRC expected links from action=render diffs to contain absolute URLs, but as of this week, it no longer does this, which caused some processing logic to cause some of the features to stop working as it was relying on these responses to only contain fully formed URLs and thus not need to expand them client-side against some kind of base URL.

I'll be working around this for the RTRC gadget, which is fairly easy since it runs inside the wiki context, but this may be less trivial for the more traditional use case of action=render where the HTML would normally be rendered in context of a different website, or in a blank iframe or embeddd WebView in a native app, and thus none the URLs would work.

https://nl.wikipedia.org/w/index.php?diff=60045535&oldid=60045530&action=render&diffonly=1

<a href="/w/index.php?title=…&amp;diff=prev&amp;oldid=60045530" title="…" id="differences-prevlink">← Older edit</a>

<a href="/wiki/Speciaal:Bijdragen" class="mw-userlink mw-anonuserlink"  title="Speciaal:Bijdragen"></a>

etc.

Change 725498 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Fix local rendering of link in diff view with action=render

https://gerrit.wikimedia.org/r/725498

Change 725498 merged by jenkins-bot:

[mediawiki/core@master] Fix local rendering of link in diff view with action=render

https://gerrit.wikimedia.org/r/725498

How can I make it return relative URLs like before? expand-url=0 doesn't do it.

How can I make it return relative URLs like before? expand-url=0 doesn't do it.

It never sent relative URL AFAIK (and from the code). You probably need another action.

Ah I see. I was just confused by the new expand-url=1 in adjacent diff links (which I still don't get what difference its presence makes...).