Page MenuHomePhabricator

Linter updates in a Parsoid-in-ParserCache world
Open, MediumPublic


Forking from T320534: Put Parsoid output into the ParserCache on every edit.

Thanks for explaining - I should've just led with my real question sorry, which is whether this will change how Linter updates will work (see related T159512#8627272). Lint errors are effectively derived/secondary data from the Parsoid parsing process, so in theory it should happen as part of RefreshLinks. That would require us to do a full Parsoid parse, just like, AIUI, we do with the legacy parser right now. I assume that eventually we will be extracting links tables, etc. from Parsoid, so at some point we'll need to do a Parsoid parse in RefreshLinks regardless.

Event Timeline

Lint errors are effectively derived/secondary data from the Parsoid parsing process, so in theory it should happen as part of RefreshLinks.

I see... this is very expensive, though.... Will template changes even affect linter errors? And should they? They indicate whetehr there is something wron with the page's wikitext. That doesn't change as long as the page itself isn't edited, right?

Yes, lint errors operate on the parsed HTML, so if a template changes something to be unbalanced or uses obsolete HTML, it will trigger a new lint error - and vice versa, templates do get updated to fix lint errors.

daniel triaged this task as Medium priority.Jun 5 2023, 6:14 PM
daniel moved this task from Unsorted to Parsoid pile on the RESTBase Sunsetting board.

Some thoughts after meeting with Aaron, Derick, Timo, Subbu, and Scott just now:

  • Eventually, we want linter data to be handled as Extension Data in ParserOutput. But we don't want to block on that.
  • The best place to trigger a parsoid parse (so linter data to be written as a side effect) is in a DataUpdate. The Linter extension can use the RevisionDataUpdates hook to achieve that. When we switch to paroid page views, the same hook can be used to pull the lint data out of the ParserOutput, instead of doing a separate parse.
  • RevisionDataUpdates is triggered through LinksUpdater, which gets triggered through RefreshLinksJob, or a synchronously during an edit, or (very rarely) diring a page view via WikiPage::triggerOpportunisticLinksUpdate.
  • When triggering a parsoid parse from inside DataUpdate, we don't want to write to the ParserCache. We could when the update is happening due to an edit. But keeping the parsoid cache warming logic separate seems preferrable.
  • Doing the parsoid parse directly in the data updater, rather than through a prewarm job, keeps the code simple and makes debugging easier. Slow parses seem rare enough.
  • Since prewarming the parser cache is done primarily for the benefit of VisualEditor, perhaps that logic should be moved out of core into the VE extension. Anyway, in a parsoid-pageviews world, we don't need the separate jobs anymore.

Miro doodle: