Related: T56033
Parser::makeImage makes a separate call to findFile for each file, and similarly, file links using the |link= syntax do not benefit from any LinkBatch-y optimization. This has been known to sometimes cause perf issues on Wikipedia (T66056). On some non-WMF wikis (especially those that use a lot of icons), these singleton calls to makeImage can be responsible for ~25% of global parse time.
There are constructs (findFiles, LinkBatch) that do these sorts of bulk queries already. They are hard to apply here because makeImage is called in the middle of a big loop, and the parser doesn't know the filename or link it needs to look up until fairly far into the loop.
I see three possible options:
- Split the loop - Split up the handleInternalLinks2 loop (and the code of makeImage) into two loops: one loop that goes just far enough to get the file and link, then the batch queries, and then another loop that does the rest. This is tricky because there is a lot of state inside that loop (and inside makeImage) that essentially needs to be carried around and resumed after the batch. I have a proof-of-concept of this approach, but it doesn't handle all cases.
- Do the batching at the end - Basically the LinkHolderArray approach. Have the parser add the "to be filled in later" files/links as UNIQ-ish identifiers, then do the batch, and then and regex-replace them. I think this would be extremely hard to do for makeImage because of all the ways that the output conditionally depends on the file.
- Rely on the link tables from the previous parse - Before the parse, pre-warm the LinkCache (and any equivalent file cache) by grabbing all of the entries currently in filelinks/pagelinks database for that page, and using findFiles/LinkBatch on those. Note that we would only be relying on the database to get the list of *keys* to batch request, and wouldn't be relying on the old values (in case a file changes or a page is deleted). This is by far the simplest approach to write (a somewhat hacky version is ~50 lines of code), but I am unsure if relying on the previous parse (even just for efficiency reasons that would otherwise be a no-op) is an unacceptable line to cross.
Is it worth pursuing this? Is there a different approach I'm missing? I will happily own it if there's agreement that it will still be worthwhile even with Parsoid coming soon(?), and if I can get feedback on which implementation strategy you all prefer.