Since the duplicate parse detection was improved in T288707 we now have a Logstash dashboard for duplicated parses.
We shouldn't be parsing the same content twice in the same request.
Since the duplicate parse detection was improved in T288707 we now have a Logstash dashboard for duplicated parses.
We shouldn't be parsing the same content twice in the same request.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T255502 Goal: Save Timing median back under 1 second | |||
Resolved | Krinkle | T277788 Save Timing improvements (2021-2022) | |||
Resolved | Ladsgroup | T292300 Eliminate unnecessary duplicate parses (2021-2022) | |||
Resolved | Ladsgroup | T288639 SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice | |||
Resolved | • Pchelolo | T292302 CommonsMetadata extension causes every page on commons to be always parsed twice | |||
Open | matej_suchanek | T264104 Verify AbuseFilter code that claims to share and re-use ParserOutput from core | |||
Resolved | matmarex | T301309 Refreshlinks job is parsing pages twice | |||
Resolved | Ladsgroup | T301310 CommonsMetadata extension is triggering a duplicate parse in commons |
Is this actually a tracking task? It seems to me that we're going to find and eliminate all issues and then this will be done and mark Resolved, which makes this just a task…
Poofread seems to be half of the duplicate parsers now: https://logstash.wikimedia.org/goto/9d7fdc40cff6c8cec5b0453bd121d78c
The proofread ones are actually false positive. The way it works is that ProofreadPage content handler's fillParserOutput basically prepends some wikitext to the page, creates a wikitext content handler and calls getParserOutput on that. Triggering error.
Change 753696 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/extensions/ProofreadPage@master] Use fillParserOutput instead of getParserOutput.
Found T299124: ProofreadPage frontend makes a request to the page before and after in every page view while checking for this, possibly means we reduce parses anyway :D
Change 753696 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Use fillParserOutputInternal instead of getParserOutput.
Change 754598 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/extensions/ProofreadPage@wmf/1.38.0-wmf.17] Use fillParserOutputInternal instead of getParserOutput.
Change 754598 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@wmf/1.38.0-wmf.17] Use fillParserOutputInternal instead of getParserOutput.
Mentioned in SAL (#wikimedia-operations) [2022-01-18T08:37:45Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.17/extensions/ProofreadPage/includes/Page/PageContentHandler.php: Backport: [[gerrit:754598|Use fillParserOutputInternal instead of getParserOutput. (T292300)]] (duration: 00m 51s)
Change 754868 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/extensions/FlaggedRevs@master] Avoid double parsing
Change 754868 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Avoid double parsing
Change 755406 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.18] Avoid double parsing
Change 755406 abandoned by Ladsgroup:
[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.18] Avoid double parsing
Reason:
I don't have time to babysit the deployment :( I let it go with the train
After lots of clean up being done you still see a lot of duplicate parses but a good look at it (which I did today) basically says most of them either useless or not important. Take for example the jobrunners which is 80% of the whole duplicate parses and 1K logs per minute.
It's coming from three jobs:
We can chase the long tail but it's going to be hard and too little gain.
Closing as many improvements have been made. Meanwhile we're several reorgs and staff losses onwards with this untouched for 1.5 year. Best to track future work in a new task as part of a goal separate from the now-closed T277788: Save Timing improvements (2021-2022).