Page MenuHomePhabricator

Eliminate unnecessary duplicate parses (2021-2022)
Closed, ResolvedPublic

Description

Since the duplicate parse detection was improved in T288707 we now have a Logstash dashboard for duplicated parses.

We shouldn't be parsing the same content twice in the same request.

Event Timeline

Is this actually a tracking task? It seems to me that we're going to find and eliminate all issues and then this will be done and mark Resolved, which makes this just a task…

Pchelolo renamed this task from Tracking: duplicate parses to Eliminate unnecessary duplicate parses.Oct 4 2021, 6:33 PM
Pchelolo updated the task description. (Show Details)

Now it's not tracking :)

The proofread ones are actually false positive. The way it works is that ProofreadPage content handler's fillParserOutput basically prepends some wikitext to the page, creates a wikitext content handler and calls getParserOutput on that. Triggering error.

Change 753696 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/ProofreadPage@master] Use fillParserOutput instead of getParserOutput.

https://gerrit.wikimedia.org/r/753696

Change 753696 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Use fillParserOutputInternal instead of getParserOutput.

https://gerrit.wikimedia.org/r/753696

Change 754598 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/ProofreadPage@wmf/1.38.0-wmf.17] Use fillParserOutputInternal instead of getParserOutput.

https://gerrit.wikimedia.org/r/754598

Change 754598 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@wmf/1.38.0-wmf.17] Use fillParserOutputInternal instead of getParserOutput.

https://gerrit.wikimedia.org/r/754598

Mentioned in SAL (#wikimedia-operations) [2022-01-18T08:37:45Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.17/extensions/ProofreadPage/includes/Page/PageContentHandler.php: Backport: [[gerrit:754598|Use fillParserOutputInternal instead of getParserOutput. (T292300)]] (duration: 00m 51s)

Change 754868 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/FlaggedRevs@master] Avoid double parsing

https://gerrit.wikimedia.org/r/754868

Change 754868 merged by jenkins-bot:

[mediawiki/extensions/FlaggedRevs@master] Avoid double parsing

https://gerrit.wikimedia.org/r/754868

Change 755406 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.18] Avoid double parsing

https://gerrit.wikimedia.org/r/755406

Change 755406 abandoned by Ladsgroup:

[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.18] Avoid double parsing

Reason:

I don't have time to babysit the deployment :( I let it go with the train

https://gerrit.wikimedia.org/r/755406

After lots of clean up being done you still see a lot of duplicate parses but a good look at it (which I did today) basically says most of them either useless or not important. Take for example the jobrunners which is 80% of the whole duplicate parses and 1K logs per minute.

It's coming from three jobs:

  • CirrusSearch job
    • Only happens on Wikidata and while being a lot, it doesn't produce HTML and doesn't call term store so it's cheap.
  • CategoryMembershipChangeJob
    • It happens in small degree, 10 per minute. Mostly in commons.
  • RefreshLinksJob
    • Rather smaller size, 300 per minute. Almost exclusively on commons, probably because of MCR so not a really big issue

We can chase the long tail but it's going to be hard and too little gain.

Krinkle renamed this task from Eliminate unnecessary duplicate parses to Eliminate unnecessary duplicate parses (2021-2022).Sep 16 2023, 7:07 AM
Krinkle closed this task as Resolved.
Krinkle assigned this task to Ladsgroup.
Krinkle added a subscriber: Krinkle.

Closing as many improvements have been made. Meanwhile we're several reorgs and staff losses onwards with this untouched for 1.5 year. Best to track future work in a new task as part of a goal separate from the now-closed T277788: Save Timing improvements (2021-2022).