Page MenuHomePhabricator

Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp
Closed, ResolvedPublic

Description

We can use the page_links_updated field to find pages that haven't been updated in a while. This makes it easier for us to run the refreshLinks.php script across large wikis without updating pages that were recently updated.

Event Timeline

Hi, @Legoktm. What do you mean in recently? Last year? Last month? An hour from the previous script run? Half a time until today?
And also, nulledit is a very quick action. I run nulledit bots a lot. Do you think the time to null edit somepage is more than checking newstamps diff? Including the timestamp retrieving itself?
Thanks.

Hi, @Legoktm. What do you mean in recently? Last year? Last month? An hour from the previous script run? Half a time until today?

The point is that it would be configurable based on the person running the script. For the puproses in T157670, we'd use a timestamp that was a few months probably.

And also, nulledit is a very quick action. I run nulledit bots a lot. Do you think the time to null edit somepage is more than checking newstamps diff? Including the timestamp retrieving itself?

Checking whether the page was updated recently is way faster than just running the updates again. Making null edits may seem fast because the server defers some processing until later and tries to give you output as soon as it can, but when we're talking about millions of pages across all wikis, it quickly adds up,

Thank you, @Legoktm.

The point is that it would be configurable based on the person running the script. For the puproses in T157670, we'd use a timestamp that was a few months probably.

And who is this person? It should be automatically, once a time, shouldn't it?

Checking whether the page was updated recently is way faster than just running the updates again. Making null edits may seem fast because the server defers some processing until later and tries to give you output as soon as it can, but when we're talking about millions of pages across all wikis, it quickly adds up,

Thanks, I see.

The point is that it would be configurable based on the person running the script. For the puproses in T157670, we'd use a timestamp that was a few months probably.

And who is this person? It should be automatically, once a time, shouldn't it?

For now it's me doing it manually, but in the future it should be some determined time that runs automatically and regularly.

For now it's me doing it manually, but in the future it should be some determined time that runs automatically and regularly.

Very well, @Legoktm, so will it be an option to run it when I want to and set the "recently" time as I want to? And also, just the last one, when the script runs automatically per period?

Just to be clear on what this is about, here's a link to the manual for refreshLinks.php
https://www.mediawiki.org/wiki/Manual:RefreshLinks.php

Can a developer please implement this task, which has been waiting for over five years? It is blocking the detection of Linter errors on Commons (and probably other MW sites), and it makes it so that tracking categories related to MW software changes take many months, sometimes years, to fill.

It seems like someone needs to set up some cron jobs.

Change 890139 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/core@master] Add --before-timestamp option to refreshLinks.php

https://gerrit.wikimedia.org/r/890139

It is blocking the detection of Linter errors on Commons (and probably other MW sites), and it makes it so that tracking categories related to MW software changes take many months, sometimes years, to fill.

Just to set expectations, my understanding is that this will not immediately fix detection/updating of Linter errors, but it will take care of other tracking categories, etc. My understanding is that Parsoid and therefore Linter updates still happen outside of the normal refreshLinks process. This is being worked on in T320534: Put Parsoid output into the ParserCache on every edit, though it's not super clear to me if that task itself will take care of linter being updated on purge/refreshLinks or whether changes in Linter will be needed too (see my latest comment there).

Change 890139 merged by jenkins-bot:

[mediawiki/core@master] Add --before-timestamp option to refreshLinks.php

https://gerrit.wikimedia.org/r/890139