Implement a way to have linter reprocess all pages
Open, NormalPublic


+Services for advice/help.

Currently Linter updates after a page is edited, causing changeprop to request the new version of it from parsoid. However, sometimes we want to reprocess all pages due to code changes (e.g. T160599) or other errors (e.g. T160573#3135189).

In MediaWiki we have a script that reparses all pages sequentially for this purpose (refreshLinks.php). I was thinking of writing a similar script that would just make requests to a single parsoid instance in order for each page to be reparsed and sent back to linter.

Does that sound like a good plan? Other suggestions?

Legoktm created this task.Mar 27 2017, 8:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2017, 8:26 PM

The other use case (besides bug fixes) is when we implement code to surface other errors / warnings -- which there are plans for.

mobrovac added a subscriber: mobrovac.

Does the Parsoid HTML change as a result of code changes made to the linter? If so, you might consider bumping the content type patch version. We have a filter in RB that checks the version and re-renders the page if the versions don't match.

No, the HTML is independent and doesn't change when linter stuff does.

GWicke added a subscriber: GWicke.Mar 27 2017, 10:50 PM

So, if I understand your description right, you basically want to trigger a parse of each page's current revision in Parsoid, but there is no need to update stored or cached content in RESTBase or Varnish. If so, then the htmldumper script should already be very close to what you are looking for. It supports requesting each title in a wiki from an API, and you can run it without storing any of the outputs. The exact API call might need some customization to hit the internal parsoid instance.

Arlolra triaged this task as Normal priority.Mar 28 2017, 2:21 PM

I agree that HTMLDumper might be the way to go. Regarding the URL, I have put a PR up that allows you to specify it from the command line.

Sorry I forgot to comment here - I ended up writing a small python script to do this for now:

Is htmldumper already deployed somewhere?

It is not, but it's easy to set it up a host with npm, like ruthenium.

ssastry moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Apr 27 2017, 2:35 PM
Elitre added a subscriber: Elitre.Apr 27 2017, 2:57 PM

This has nothing to do with the links tables, as far as I can see.

Maybe it would be useful to remove all known false positives (modules, css, js) from the database. Because the wiki authors cannot do anything to remove them. By the way, at the German Wikivoyage more than 90 % of all linter messages are false positives!

Why do you say this? I fixed a lot of such files.

We cannot fix these modules, css and js files because they are free of errors. We cannot fix false positives. And we cannot fix modules because they use another content model.

IKhitron added a comment.EditedJul 26 2017, 6:53 AM

Please give an example for what is for you wrong in wikitext, but becomes false positive in .js file.

Please have a view to

Most of the files listed here are modules including MediaWiki:Common.css, and they are no wikitext articles. The modules were used more than 10,000 times and do not show any error in the articles where they are used.

I see. Yes, It would be better without them, but I just fix those.

Btw, if it will be fixed, the problem with anonimous users that add (25 years old) after birthday of article issue, when all you need is purging, will be almost solved...

Mentioned in SAL (#wikimedia-operations) [2017-09-04T22:07:35Z] <legoktm> starting script to reparse all pages in parsoid for Linter (python2 http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed - T161556

Mentioned in SAL (#wikimedia-operations) [2017-09-07T18:39:19Z] <legoktm> restarted script to reparse all pages in parsoid for Linter (python3 http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed - T161556