Page MenuHomePhabricator

Parser should batch-request files (and links in files) instead of one-at-a-time
Open, Needs TriagePublic

Description

Related: T56033

Parser::makeImage makes a separate call to findFile for each file, and similarly, file links using the |link= syntax do not benefit from any LinkBatch-y optimization. This has been known to sometimes cause perf issues on Wikipedia (T66056). On some non-WMF wikis (especially those that use a lot of icons), these singleton calls to makeImage can be responsible for ~25% of global parse time.

There are constructs (findFiles, LinkBatch) that do these sorts of bulk queries already. They are hard to apply here because makeImage is called in the middle of a big loop, and the parser doesn't know the filename or link it needs to look up until fairly far into the loop.

I see three possible options:

  1. Split the loop - Split up the handleInternalLinks2 loop (and the code of makeImage) into two loops: one loop that goes just far enough to get the file and link, then the batch queries, and then another loop that does the rest. This is tricky because there is a lot of state inside that loop (and inside makeImage) that essentially needs to be carried around and resumed after the batch. I have a proof-of-concept of this approach, but it doesn't handle all cases.
  2. Do the batching at the end - Basically the LinkHolderArray approach. Have the parser add the "to be filled in later" files/links as UNIQ-ish identifiers, then do the batch, and then and regex-replace them. I think this would be extremely hard to do for makeImage because of all the ways that the output conditionally depends on the file.
  3. Rely on the link tables from the previous parse - Before the parse, pre-warm the LinkCache (and any equivalent file cache) by grabbing all of the entries currently in filelinks/pagelinks database for that page, and using findFiles/LinkBatch on those. Note that we would only be relying on the database to get the list of *keys* to batch request, and wouldn't be relying on the old values (in case a file changes or a page is deleted). This is by far the simplest approach to write (a somewhat hacky version is ~50 lines of code), but I am unsure if relying on the previous parse (even just for efficiency reasons that would otherwise be a no-op) is an unacceptable line to cross.

Is it worth pursuing this? Is there a different approach I'm missing? I will happily own it if there's agreement that it will still be worthwhile even with Parsoid coming soon(?), and if I can get feedback on which implementation strategy you all prefer.

Event Timeline

3 is basically the approach i took in the QuickInstantCommons extension (if you are loading images over network all these problems get so much worse). It was really hacky but the result was very effective. However im not sure it would be acceptable in core (Or at least, not the way i did it in QuickInstantCommons... perhaps if you do it more eloquently than i did it would be fine).

Spent some more time investigating this and I think option #2 (basically a "FileHolderArray") is not as hard as I initially thought, and is the solution that would be least-disruptive to the current parser code.

I'm happy to implement this and PR it to core, but I would love some guidance on what the process/timeline has been for testing/releasing similarly large changes to the parser. More directly, with Parsoid coming relatively soon, are somewhat-disruptive perf improvements to the "legacy" parser likely to be approved?

If I can get an unambiguous "yes" on this from the performance or parser team (maybe @cscott ?) then I'll clean this code up, add some tests and PR it