Page MenuHomePhabricator

Re-evaluate Linter update mechanism
Closed, DeclinedPublic

Description

I think the current system of relying on changeprop to hit Parsoid to make API requests is suboptimal and doesn't work well:

Instead I propose we use the normal MediaWiki links update system. Whenever a page is edited, or a page it transcludes is updated, linksupdates jobs will be triggered, and Linter will make an API request to Parsoid to get the linterrors for the current page and store it. We could use the standard refreshLinks.php script to reparse pages. This is similar to the system the I implemented in the ORES extension to get data out of the ORES service.

Event Timeline

Since Linting requires a full parse, this would double the # of parse requests to Parsoid. We are right now piggybacking onto the existing reparse requests from changeprop to generate these issues. Unless we move away entirely from changeprop for updating RESTBase, this proposal is not desirable from the cluster load POV.

anecdotal evidence that it's not working properly

Can you say more here? There is anecdotal evidence for the other 3 points you make, but, is there anything else apart from that?

The custom script will only need to be run one-off vs. on a regular basis. For that, I am okay with repurposing existing scripts including one of node.js html dumper script or the php refreshLinks script.

Since Linting requires a full parse, this would double the # of parse requests to Parsoid. We are right now piggybacking onto the existing reparse requests from changeprop to generate these issues. Unless we move away entirely from changeprop for updating RESTBase, this proposal is not desirable from the cluster load POV.

Based on https://wikipulse.herokuapp.com/ I think we're looking at 300 wikitext edits per minute (I just subtracted all the Wikidata edits). Does that match the amount of requests parsoid is getting? And is that many extra requests too much?

If doubling the load is too much, I think we should still rely on changeprop and then have a delayed job that checks if the lint errors were updated recently and if not, make the API request to parsoid...but whether that is useful brings us to:

anecdotal evidence that it's not working properly

Can you say more here? There is anecdotal evidence for the other 3 points you make, but, is there anything else apart from that?

Mostly me clicking on random links on a Special:LintErrors subpage and then looking at the history and seeing that someone had already fixed it. Yesterday I noticed a few from a bot on en.wp.

Ultimately I think we need hard stats on whether the current changeprop system is working or not. I'm wondering how we can measure this...

I've done a bit of investigation. All of the errors I could find from that bot contributions @Legoktm posted that are fixed but still showing up on the LitnErrors page are for html5-misnesting. According to Parsoid docs this check provides false positives (correct me if I'm wrong @ssastry)

In order to prove the theory I've checked one page: right now if I go on enwiki Special:LintErrors for HTML5 misnesting, item number 3 is Talk:Francis_J._Beckwith page. However, the error had to be fixed by the bot on October 1 by this edit. Posting both the wiki text of the page before and after the bot edit to Parsoid for re-linting gives the exact same error - html5-misnesting. So in this particular case we're probably looking at an example of a false-positive.

To further prove that it has nothing to do with changes not being propagated I went to Kafka and the event for that particular page edit that should've fixed a linter error is in there. Then I've pulled the record from Cassandra and the TID of the record matches an edit that have been made and the content matches the edit as well. This proves that a request to Parsoid have been made by ChangeProp after this particular edit.

Do you have other examples in mind @Legoktm so that I could continue investigating this? I could try to find them, but it's a waste of time if you already have some links.

Based on https://wikipulse.herokuapp.com/ I think we're looking at 300 wikitext edits per minute (I just subtracted all the Wikidata edits). Does that match the amount of requests parsoid is getting?

We've only got metrics for per-second rates, right now when I'm looking to it the edit rate is ~400 edits/minute and the per-second rate is ~7.5 so the numbers match. See the grafana dashboard and select page-edit_exec for ChangeProp rates.

I've done a bit of investigation. All of the errors I could find from that bot contributions @Legoktm posted that are fixed but still showing up on the LitnErrors page are for html5-misnesting. According to Parsoid docs this check provides false positives (correct me if I'm wrong @ssastry)

In order to prove the theory I've checked one page: right now if I go on enwiki Special:LintErrors for HTML5 misnesting, item number 3 is Talk:Francis_J._Beckwith page. However, the error had to be fixed by the bot on October 1 by this edit. Posting both the wiki text of the page before and after the bot edit to Parsoid for re-linting gives the exact same error - html5-misnesting. So in this particular case we're probably looking at an example of a false-positive.

Most of the false positives have been fixed. That particular page has a real html5-misnesting error (besides the false positives that Nihlus bot fixed). So, all good there.

I've done a bit of investigation. All of the errors I could find from that bot contributions @Legoktm posted that are fixed but still showing up on the LitnErrors page are for html5-misnesting. According to Parsoid docs this check provides false positives (correct me if I'm wrong @ssastry)

In order to prove the theory I've checked one page: right now if I go on enwiki Special:LintErrors for HTML5 misnesting, item number 3 is Talk:Francis_J._Beckwith page. However, the error had to be fixed by the bot on October 1 by this edit. Posting both the wiki text of the page before and after the bot edit to Parsoid for re-linting gives the exact same error - html5-misnesting. So in this particular case we're probably looking at an example of a false-positive.

Most of the false positives have been fixed. That particular page has a real html5-misnesting error (besides the false positives that Nihlus bot fixed). So, all good there.

<font face="Antiqua, serif">''[[User:Hrafn|Hrafn]]<sup>[[User talk:Hrafn|Talk]]</sup><sub>[[Special:Contributions/Hrafn|Stalk]]</sub><sup>''('''[[M:Precisionism|P]]''')</sup></font>

The <sup> .. </sup> straddles a ''-pair which Tidy fixes up differently compared to HTML5 (remex & parsoid). ?action=parsermigration-edit confirms this - the change is subtle, but you can see it if you ctrl+f the 13:01, 11 December 2010 string. the (P) link is in superscript in the Tidy case and not in Remex.

@Legoktm anything to do here? I inclined to close this as declined.

I don't think there is anything to do here.