Page MenuHomePhabricator

DiscussionTools doesn't recognize talk threads with Parsoid HTML
Closed, ResolvedPublic

Description

See output of https://en.wikipedia.org/wiki/Talk:Hospet?useparsoid=1 . .. there are no reply links and the message at the bottom indicates that DT doesn't recognize talk page threads.

This needs to be fixed.

Related Objects

Event Timeline

DiscussionTools can recognize talk threads in Parsoid HTML, but the problem is that Parsoid doesn't run the ParserAfterTidy hook, so important parts of our code don't run, and there's no obvious equivalent we could use. It was suggested that we should use OutputPageParserOutput instead, but…

We used ParserAfterTidy, because we wanted the output of our processing to be cached, and this allowed us to "piggy-back" on the parser cache, only causing a slight increase in disk space used by discussion pages instead of doubling it.

If we use OutputPageParserOutput, then it is no longer cached, which will slow down page load times. We don't really know the impact. From some quick tests it seems noticeable but not terrible. On this large page I see "DiscussionTools time usage: 0.375 seconds" (it's in the parser limit report). [I remembered it being slower… Not sure if my memory is going or if our code magically got faster.]

Also, if we use OutputPageParserOutput, then the processing might no longer be applied in contexts that don't use OutputPage, like the API. We will need to test this carefully, since we load the page content from the API after saving. We had several bugs related to this in the past before we used ParserAfterTidy.

I think we'd prefer to continue piggy-backing on the parser cache, and if we need to change that, it would take a bit more work than it seems.

[ Adding @daniel, @Ladsgroup, @Jgiannelos for visibility. I see Ed already added Scott. ]

This is basically a version of a bigger problem we have identified which broadly goes as follows.

Parsoid generates canonical HTML with all information that is cached (right now in ParserCache, previously in RESTBase). This HTML was originally focused on VisualEditor (and such clients) that need all the information so that it can be roundtripped back to canonical wikitext without dirty diffs.

Now, there are a number of other targets that use a modified form of this canonical HTML (read views will strip information, PCS does transforms, DT wants some hooks to run, MFE may do some other mangling, etc.). So, we can think of this as a html2html transformation.

So, the question becomes (a) what are performance SLOs for these target products (b) how expensive / performant are these html2html transformations (c) should the html2html output be itself cached?

Obviously, the two extremes for (c) are: (1) every target variant (or flavor or pick-your-favorite-noun) gets cached. (2) no target variant gets cached -- they are always generated on demand.

PCS is right now going through this as part of RESTBase deprecation. The goal there is to set SLOs for mobile apps, use those to optimize performance of PCS to see if the PCS transforms meet perf targets (and hence don't need any other caching beyond existing edge caching in Varnish), and if not, figure out what caching solution / strategy is needed. There are conditional caching strategies for ex. Only cache html variants for pages where the generation time exceeds the SLO targets (or some other such policy). If caching, do we use PC (with different keys) for all these Parsoid HTML variants? Do we use different PC instances? We obviously do not have PC capacity for storing all html2html variants right now. But, if we need to expand that capacity, that would be a conversation with SRE informed by some data collected here for different use cases.

What are your thoughts here for DiscussionTools? I think that will help inform what hooks should be used / added and where / when they run. Anything I missed here?

MSantos triaged this task as High priority.Sep 19 2023, 3:35 PM

Change 969390 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/extensions/DiscussionTools@master] WIP: Add ParserOutputPostTransformHook handler for Parsoid HTML

https://gerrit.wikimedia.org/r/969390

Change 969225 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] ParsoidParser: Record page title in ParserCache entries

https://gerrit.wikimedia.org/r/969225

Change 969225 merged by jenkins-bot:

[mediawiki/core@master] ParsoidParser: Record page title in ParserCache entries

https://gerrit.wikimedia.org/r/969225

There are three possible strategies. I am outlining them here along with links to patches. ( FYI:

  1. Add a ParserOutputPostCacheTransformHook handler to DiscussionTools: DT patch here - the commit summary and gerrit discussion has the details, but we also need this core patch for this to actually work properly. The TLDR is that the hook handler now mutates the ParserOutput object (and sets additional state there) which is then used by hook handlers in DT (called from OutputPage methods). But, without the core patch, those handlers run before the ParserOutputPostCacheTransformHook handler in DT runs. Timo made the pertinent observation that the issue here is that post-cache transforms shouldn't be mutating ParserOutput anymore. That makes sense in the current regime but with Parsoid HTML, we are switching to a post-cache transform pipeline to do all kinds of things (even before Parsoid, getText() had accumulated a whole bunch of transforms). That pipeline (work in progress by Isabelle) will create a copy of ParserOutput before mutating it. This also opens up the possibility of caching this OutputTransform pipeline output for scenarios where the pipeline does expensive work. But, in the interim, we may have to pinch our nose and let the post-cache ParserOutput object be mutated to let us run DT with Parsoid HTML. But, no matter what, it requires the core patch to be merged as well.
  2. Have DiscussionTools register a DOM post processor transformer with Parsoid and store the modified HTML in ParserOutput (so, effectively PO will have both original html and DT html cached): DT patch here - this is more in a quick POC. If we want to go this route, we will need a new DOM post processor created in Parsoid which will be called after Parsoid is fully done (this will be the Parsoid equivalent of ParserAfterTidy). It will also require us to actually implement the ParsoidExtensionAPI::isPreview() method properly (vs a stub as is now). This could work and addresses Timo's concerns from #1 in a different way.
  3. Create a new hook in core that runs in OutputPage::addParserOutput (first thing). core patch here -- while I called this hook ParsoidHtmlPreRenderTransform, it could also have been called OutputPageBeforeParserOutput (to go with the naming convention currently in vogue). This is effectively a different solution to the core patch in solution #1. Doing it this way, DiscussionTools could do its entire work in a single handler (since it has all the state it needs) instead of in 3 different hook handlers - I haven't written this patch, but it will be a variant of patches from #1 & #2. But, it introduces caching challenges because there is currently no infrastructure in place that lets us cache OutputPage content (we'll need this if we want to enable this for DiscussionTools in production with Parsoid HTML in this approach).

So, there you go, three different approaches with patches and some discussion. I think after having explored all this, I prefer #1. It fits with the OutputTransform path we are exploring. #2 has the advantage in that it needs no special plumbing (beyond the new transform in Parsoid to be added) and implicitly handles performance concerns. #3 requires a lot more plumbing for handling performance issues, but that plumbing has the advantage that it can make it possible to cache a fully rendered page (skin, ui, and all) if there are no concerns going that route, but also one that is probably most risky.

I also prefer #1. I wish that I could figure out how to do it without moving the location of the OutputPageParserOutputHook, but I couldn't immediately see a way to do that.

I agree that #1 seems like the most future proof approach.

Only two extensions implement a ParserOutputPostCacheTransformHandler.

WikidataPageBanner has a handler for this and OutputPageParserOutputHook and my core patch switches the order in which those hooks are invoked ... and afaict, looking at the code, it looks like it shouldn't matter.

WikibaseMediaInfo has a handler for this and BeforePageDisplayDisplayHook and afaict, looking at the code, it looks like it shouldn't matter.

But, post-deploy, this is something we should look at. WikidataPageBanner is used on wikivoyage. Not yet sure where WikibaseMediaInfo is used on wikis.

Change 969390 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Add ParserOutputPostCacheTransformHook handler for Parsoid HTML

https://gerrit.wikimedia.org/r/969390

Change 974275 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/extensions/DiscussionTools@master] Tweak CSS to deal with Parsoid's <section> tags

https://gerrit.wikimedia.org/r/974275

On itwiki which is now on wmf.5, many pages now render DiscussionTools with "?useparsoid=1" ( this, this, this ).

*But*, it doesn't render for the Main Page.

That is to be investigated - so leaving this task open for a bit.

Now that wmf.5 has rolled out to group 2, I am seeing another issue on enwiki talk pages.

For example, see . https://en.wikipedia.org/wiki/Talk%3APaul%20McAuliffe?useparsoid=1 (not the only one) clearly, the HTML is being processed since the metadata is present on top and in the TOC but the actual comment sections don't have the DT html ... this isn't a CSS issue.

To be investigated.

With the fix for T351461 having rolled out to itwiki, this is now fixed.

Let me know once it's fully deployed everywhere so I can renable parsoid for myself, I had parsoid enabled in my home wiki but had to disable it due to this bug :(

Looks good now. There will be separate tasks for any bugs found and followups needed.

Let me know once it's fully deployed everywhere so I can renable parsoid for myself, I had parsoid enabled in my home wiki but had to disable it due to this bug :(

It's live as of today!

Change 974275 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Tweak CSS to deal with Parsoid's <section> tags

https://gerrit.wikimedia.org/r/974275

Change 989982 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Introduce ParserOutput:setFromParserOptions() and use for preview flag

https://gerrit.wikimedia.org/r/989982

Change 989982 merged by jenkins-bot:

[mediawiki/core@master] Introduce ParserOutput:setFromParserOptions() and use for preview flag

https://gerrit.wikimedia.org/r/989982