Page MenuHomePhabricator

Periodically run refreshLinks.php on production sites.
Open, LowPublic

Description

Forked from T132467#2674866:

Changes to MediaWiki code related to parsing can leave links tables out of date

This is a long-standing problem. See T69419 and this discussion from 2014.

Here's another example: I created "Category:Pages using invalid self-closed HTML tags" on 14 July 2016 on en.WP after a change to MW started adding a hidden category to articles with a specific kind of HTML error. As of today, 28 Sep 2016, there are pages on en.WP such as "Portal:East Frisia/Region" that have the error in their code and will properly appear in the category after a null edit, but that have not yet shown up in the category on their own.

That means that fundamentally, categories are not being applied to articles in a timely fashion until those articles are edited. By any measure, taking more than two months to properly apply maintenance categories to pages is a bug that needs to be fixed.

Is there some way that we can force all pages to be null-edited (or the equivalent) with a reasonable frequency? It is not happening right now.

[edited to add:] I have been informed that changes to MediaWiki code that result in maintenance categories and changes to templates/modules that result in category changes are processed differently. This may be two different feature requests.

@tstarling, @ssastry, and I discussed this a few days ago in #mediawiki-parsoid. Tim proposed modifying the refreshLinks.php script to support queuing jobs to update pages based on when they had last been parsed (using page_links_touched). After an initial run, we should set this up as a cron job.

In summary: Sometimes code changes will add new tracking categories or something. But until the page is edited, null edited, or purged with links update, the page will not show up in the category. The proposed solution for this is to run refreshLinks.php on a regular basis.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is not really a task you make progress on, I'm afraid. It's more a "Rain may cause flooding" task, with some issue-amelioration sub-tasks proposed underneath. It certainly won't ever be "done", unless we radically re-think what MediaWiki is and re-write it.

Pleeeeeeeeeeeese, find a way. I run nulledit script a couple of weeks ago on hewiki. The number of Linter problems grown by 50,000.

Even a one-time update might be useful with the Linter changes. Or perhaps we could run a global null-edit bot?

Can I be wild? Or stupid? Is it possible that the dumping bot will make on the fly a forceupdate on a page it copies, twice a month?

Even a one-time update might be useful with the Linter changes.

This is in the works. @Legoktm and I talked about this at wikimania hackathon and we might kick this off in the coming week.

The implementation plan I propose is:

This sounds great. One month seems like a reasonable update time.

Bumping this. Can we get this running, please?

We still need this. Pages are still showing up every day in the Linter error page lists (for en.WP), even though the Linter lists were created many months ago. We can't fix the errors if they are not on a list or a category somewhere.

@ssastry: Are there any updated plans? Asking as you commented on T157670#3531831 - thanks!

With @Legoktm no longer on the team, we haven't been able to make progress on this. Added Platform Engineering to see if they have any thoughts about this task or if they are able to tackle this. Otherwise, this will have to wait for us to get back into this code again -- which we will in the next 12+ months as we upgrade Parsoid and better integrate with core functionality to ensure it does everything the current core parser does.

daniel renamed this task from Changes to MediaWiki code related to parsing can leave links tables out of date to Periodically run refreshLinks.php on production sites..Jan 28 2021, 9:34 AM
daniel updated the task description. (Show Details)

Doing this for ALL pages isn't feasible, a rough estimate comes to about 30 years of rendering time (a billion pages at one per second). We'll have to filter at least by namespace and by category. Which means we can't just run it periodically. It'll be a manual thing every time.

Doing this for ALL pages isn't feasible, a rough estimate comes to about 30 years of rendering time (a billion pages at one per second). We'll have to filter at least by namespace and by category. Which means we can't just run it periodically. It'll be a manual thing every time.

This hypothetical math is unhelpful. Way back in 2017, four years ago, there were 12 million pages in the "NULL" group on en.WP. Null-editing them at one page per second would have taken 139 days. That never happened, here we are four years later, and the problem is not any closer to being solved.

Doing this for ALL pages isn't feasible, a rough estimate comes to about 30 years of rendering time (a billion pages at one per second). We'll have to filter at least by namespace and by category. Which means we can't just run it periodically. It'll be a manual thing every time.

The point isn't to run it for all pages anyways. We already track the last time a page was purged, so we only need to refresh pages that haven't been touched for a certain amount of time (if we set the threshold at years that would be a significant start). The main technical work to be done here is T159512: Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp, then we just set up a cron job and monitor.

In the PET tech planning meeting, Tim just had the idea that we could filter by page_links_updated, so we'd only reparse things that haven't been re-parsed since a given date. That may make it more feasible.

EDIT: Oh, I suppose what @Legoktm meant by "tracking when the page was purged" above.

In the PET tech planning meeting, Tim just had the idea that we could filter by page_links_updated, so we'd only reparse things that haven't been re-parsed since a given date. That may make it more feasible.

EDIT: Oh, I suppose what @Legoktm meant by "tracking when the page was purged" above.

This is also in the task description, where Tim said the exact same thing in 2017.... :)

Here's what page_links_updated looks like on enwiki now. I added page lengths to give an idea of the work involved in parsing them.

MariaDB [enwiki]> select left(page_links_updated,4) as year,count(*),avg(page_len),sum(page_len) from page where page_random between 0.001 and 0.002 group by year;
+------+----------+---------------+---------------+
| year | count(*) | avg(page_len) | sum(page_len) |
+------+----------+---------------+---------------+
| NULL |     2821 |      719.9596 |       2031006 |
| 2010 |        5 |      145.8000 |           729 |
| 2011 |        9 |      116.0000 |          1044 |
| 2012 |       11 |      111.1818 |          1223 |
| 2013 |       16 |       85.2500 |          1364 |
| 2014 |     1139 |     1619.6558 |       1844788 |
| 2015 |     1528 |     1383.2808 |       2113653 |
| 2016 |     1149 |     1116.9634 |       1283391 |
| 2017 |     1846 |     1817.3922 |       3354906 |
| 2018 |     1507 |     2056.5448 |       3099213 |
| 2019 |     6557 |      938.2196 |       6151906 |
| 2020 |    23928 |     2034.6179 |      48684337 |
| 2021 |    12018 |     5909.8995 |      71025172 |
+------+----------+---------------+---------------+
13 rows in set (0.78 sec)

So if we refreshed up to 2020-01-01, including nulls, that would be about 16.6M pages and 18.5 GB of wikitext.

How does 16.6M pages and 18.5 GB of wikitext compare to how many pages and GB are edited or otherwise refreshed per day or month on enwiki?

Edits alone are something like 156 GB per month. The average revision size is much larger than the average page size, because larger pages are edited more frequently.

MariaDB [enwiki]> select count(*),avg(rev_len),sum(rev_len) from revision where rev_id between 991616250 and 997529717;
+----------+--------------+--------------+
| count(*) | avg(rev_len) | sum(rev_len) |
+----------+--------------+--------------+
|  5789860 |   28859.2656 | 167091107733 |
+----------+--------------+--------------+
1 row in set (5.31 sec)

I used rev_id values corresponding to the start and end of December 2020, because the rev_timestamp index is probably cold and slow.

So that looks like only 10% more edit volume (on en.WP) over the course of a month to get caught up, and then something similar to that each month to keep pages current (articles could be kept "more current" than other namespaces, if necessary to control loads). Can someone please kick off this process? Thanks!

@daniel I don't see why the Tech Decision process needs to be used for this.

Can someone please set up a cron job or whatever for this process, at least on en.WP? It is still needed very much. We wait weeks or months for categories to fill up after changes are made that affect pages with transclusions, and even longer when the change is done in MediaWiki code.

Just to make sure this is documented for WMF engineers. Apparently there is a community bot running to slowly null edit 2.7mil enwiki pages that had never had their page_links_updated value set yet.

https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=1077595069#A_page_is_populating_a_hidden_maintenance_category,_but_the_category_is_empty!

Just to make sure this is documented for WMF engineers. Apparently there is a community bot running to slowly null edit 2.7mil enwiki pages that had never had their page_links_updated value set yet.

https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=1077662724#A_page_is_populating_a_hidden_maintenance_category,_but_the_category_is_empty!

I created an anchor so that section link would work! Exclamation point broke it! My copy of the link above works.

Tee hee. It's down to 1.9 mil left to go. Let me know if I'm causing the servers any pain. I'm not aware of any way I can tell, other than slow response times. Certainly isn't stressing out my decade-old Wintel desktop in the least.

In just over ten days I've cleared the NULL members of page_links_updated on enwiki, but for 16 pages that can't be purged, possibly due to random sunspot activity.

Can a system administrator clean these up? Thanks

Those 16 pages have in common that they do not have a revision attached to them (empty page history).

They each have page_latest = 0 set in the database (https://quarry.wmcloud.org/query/10296). I wouldn't worry about these in the context of this task. Being re-rendered is the least of that page's problem as it has no renderable content right now in the first place. For more information about these, see T115081 and T261797.

LOL! I found the Sweet Sixteen! American college basketball fans, the round to determine the Elite Eight kicks off in just a few hours!

Hey if there's still just the same 16 there were over six years ago, I'd assume that whatever caused that isn't going to happen again. You need a script to fix them? I'd say all you should need is the suitcase with the secret access codes that allows you to go inside and just tweak a few bits. Use a hex editor if necessary. But OK, right, I'm not worrying about this any more.

I have a suspect for the cause of "the 16". Noting that the T115081 report was filed on Oct 8 2015, this was just when someone first noticed and reported it. There is a bulge in User, and especially User talk pages to update with page_links_updated = 201505 which is about five months before the bad data was reported. The bulge is caused by a ton of users with names ending ~enwiki so my suspect is the project that implemented Unified login or SUL (single user login).

And now I present the Super Six! These six pages have a special power that makes them highly resistant to conventional null-edit purging. All six share in common that they are the same 6 bytes long: {{OW}} i.e. they transclude Template:OW. This template has over 615,000 transclusions on English WIkipedia, which are generally are purge compliant, but for these six. HERE we see an editor work-around for the page's purge resistance. She actually edited the page to make a redirect to the template, in order to force the purge, and then self-reverted back to {{OW}}.

I presume this problem has been previously reported, in which case someone may reply with a link to the relevant fab phab.

What is the difference between refreshLinks.php (manual) and purgePage.php (manual)?

purgePage.php is the equivalent to action=purge (PurgeAction.php) so when I run my bot that uses API:Purge I presume I'm using equivalent functionality. This seems to adequately get the job done. The only thing I'm missing to fully automate my bot is an API for generating the list of page IDs to work on. I currently only know how to do that manually by making Quarry queries and downloading the results to a file on my PC which my bot then reads.

So what does refreshLinks.php do that purgePage.php doesn't? What does purgePage.php do that refreshLinks.php doesn't? Which uses more system resources? It's not clear to me from reading the code.

(edit) I just found ApiPurge.php. So, extending my questions above... what's the difference in what ApiPurge.php does or doesn't do vs. what the two maintenance scripts do, and how does each compare in system resources used?

P.S. I see that you're logging my activity "to better see expensive usage patterns" ;)

refreshLinks is much finer grained, and (in theory) just updates the link tables that are the ones that are out of date, per the bug description. Purging on the other hand, aims to reset everything about the page's state in various caches, like the parser cache, and anything that varies on page_touched. When you use &forcelinkupdate=1, that just runs the same refreshLinks code on top of the purge. So in terms of system usage, just doing a normal refreshLinks is all that's really needed. There's no need for a script to regularly purge old pages because the things it invalidates would have naturally expired out of the cache by then. I hope that clarifies things, let me know if it doesn't.

Thanks. That does clarify things at a high level; I'm still foggy on the details.

How hard would it be to:

(1) Add forcerecursivelinkupdateonly to API:Purge or create API:RefreshLinks

(2) Create Special:AncientLinks which would be akin to Special:AncientPages and could be used as a generator parameter

Then this problem can be better mitigated by community-run bots.

What's up? I just tried querying the database, which has been working for me for months, and see a query failed error:

View 'enwiki_p.page' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

page_links_updated by date, mainspace

I'm getting this error from Quarry, MySQL Workbench and my PHP script.

Just as I was finishing up the final touches of my automated solution for this issue.

Hopefully this is just a transient problem and not something more serious.

I'm also experience problems with two bots on en.wiki which haven't issued their regularly scheduled reports and receiving "overflow" messages when I try to look at page histories earlier today. Are these connected?

It would be great to make some progress on this task. On en.WP, Category:Pages using ISBN magic links is still populating, seven months after the MediaWiki code change that created it. This bug is blocking implementation of things like the Tidy conversion and removal of magic links.

This is not really a task you make progress on, I'm afraid. It's more a "Rain may cause flooding" task, with some issue-amelioration sub-tasks proposed underneath. It certainly won't ever be "done", unless we radically re-think what MediaWiki is and re-write it. Can you link to the tasks you think are blocked by this so we can look for a way around this?

Now it's raining more often than not on enwiki. I've noticed that my bot may occasionally cause local flooding, though it's hard to tell because I lack good flood monitoring tools. It's just a matter of waiting for the flooding to recede and then progress resumes. If my bot's link purges that actually change links can be identified and reported then we would have needed data from which new issue-amelioration sub-tasks may be developed.

I have recently started working on Linter errors on Commons, and some template and module updates are causing pages to appear in formerly empty error reports. This tells me that the links table is out of date and needs to be refreshed by MediaWiki. In one case, https://commons.wikimedia.org/wiki/User:Maarten_Sepp/sandbox appeared in https://commons.wikimedia.org/wiki/Special:LintErrors/unclosed-quotes-in-heading?namespace=2 which had previously been empty. That sandbox page had not been edited since 2014. Linter error tracking has existed since 2018 or earlier.

We have hacked together a bot to keep pages up to date at en.WP, but the volunteer community should not have to work around this software problem. Please fix this problem at a system level.

Here's what page_links_updated looks like on enwiki now. I added page lengths to give an idea of the work involved in parsing them.

MariaDB [enwiki]> select left(page_links_updated,4) as year,count(*),avg(page_len),sum(page_len) from page where page_random between 0.001 and 0.002 group by year;
+------+----------+---------------+---------------+
| year | count(*) | avg(page_len) | sum(page_len) |
+------+----------+---------------+---------------+
| NULL |     2821 |      719.9596 |       2031006 |
| 2010 |        5 |      145.8000 |           729 |
| 2011 |        9 |      116.0000 |          1044 |
| 2012 |       11 |      111.1818 |          1223 |
| 2013 |       16 |       85.2500 |          1364 |
| 2014 |     1139 |     1619.6558 |       1844788 |
| 2015 |     1528 |     1383.2808 |       2113653 |
| 2016 |     1149 |     1116.9634 |       1283391 |
| 2017 |     1846 |     1817.3922 |       3354906 |
| 2018 |     1507 |     2056.5448 |       3099213 |
| 2019 |     6557 |      938.2196 |       6151906 |
| 2020 |    23928 |     2034.6179 |      48684337 |
| 2021 |    12018 |     5909.8995 |      71025172 |
+------+----------+---------------+---------------+
13 rows in set (0.78 sec)

So if we refreshed up to 2020-01-01, including nulls, that would be about 16.6M pages and 18.5 GB of wikitext.

Here's what page_links_updated looks like on commonswiki now, per this Quarry query

yearcount(*)avg(page_len)sum(page_len)
NULL133156.225620778
201460382.572149791
2015466497.206231698
2016524230.813120946
2017598330.2324197479
20184624254.57741177166
20192257361.4196815724
202022137698.386715460187
202111692797.1279320009
202273284950.874769683903

For better or worse @Wbm1058's bot has proven that nothing will fall over if we refresh pages frequently (better I think).

MariaDB [enwiki_p]> select left(page_links_updated,6) as month,count(*),avg(page_len),sum(page_len) from page where page_random between 0.001 and 0.002 group by month;
+--------+----------+---------------+---------------+
| month  | count(*) | avg(page_len) | sum(page_len) |
+--------+----------+---------------+---------------+
| 202211 |      267 |     1021.1273 |        272641 |
| 202212 |    12320 |      939.4024 |      11573438 |
| 202301 |    18036 |     1709.0089 |      30823685 |
| 202302 |    27044 |     3780.0407 |     102227421 |
+--------+----------+---------------+---------------+
4 rows in set (0.569 sec)

(I'm assuming that all pages since Oct. 2022 have been updated since and it's not just a bad random sample.)

So once T159512 is reviewed+merged I guess we could do something like foreachwiki refreshLinks.php --before-timestamp $(date --utc --date '4 months ago' '+%Y%m%d%H%M%S').

We'll probably want to do some manual runs with larger before timestamps on the very large wikis like Commons and Wikidata first though.

(I'm assuming that all pages since Oct. 2022 have been updated since and it's not just a bad random sample.)

As I've pointed out previously in this Phab, except for the NULL Sweet 16 and the purge-resistant Super 6.

Today, all but 22 pages before November 30, 2022 have been refreshed, and tomorrow likely all but 22 pages before December 1, 2022 will have been refreshed, as my bot which is now running as a continuous job on the Toolforge Kubernetes is currently refreshing November 30. For better or worse. Thanks.

Page links updated by date, all namespaces

OH MY. I just refreshed the NULL list and it's grown to 30 pages.

Krinkle added subscribers: Marostegui, Ladsgroup.

As a maintenance script, it's main cost is local CPU isolates to the mwmaint server which seems easy to reason about. So long as it remains serial, it should also naturally throttle its read load onto memcached and MySQL, especially given it'll have notable CPU work between I/O.

The rate of writes to MySQL is harder to reason about, and might not be comparable to a purge bot given different levels of efficiency there. It'll be a lot faster. That's something to monitor on the MySQL aggregate dashboards in Grafana. Eg if we're talking more than a 10% continuous increase on baseline of some clusters that's something to think about.

Lastly, and the main reason for being in this ticket, is impact on ParserCache. We decided a few years ago to have RefreshLinks no longer save its ParserOutput to the ParserCache for reuse by web traffic as it added pressure to its limited storage capacity for what was a low hit rate. I believe this is likely to change again in the future. I suspect it will likely at least an order of magnitude difference between storing PO for 30 days from all edits and their cascading refreshlinks jobs vs storing it also for all other pages in existence when triggered by this maintenance script. As such, we may want a --no-parsercache-write option or something (cc @Ladsgroup, @Marostegui), eventhough it wouldn't do anything yet today.

The rate of writes to MySQL is harder to reason about, and might not be compatible to a purge bot given different levels of efficiency there. It'll be a lot faster. That's something to monitor on the MySQL aggregate dashboards in Grafana. Eg if we're talking more than a 10% continuous increase on baseline of some clusters that's something to think about.

Ack.

Lastly, and the main reason for being in this ticket, is impact on ParserCache. We decided a few years ago to have RefreshLinks no longer save its ParserOutput to the ParserCache for reuse by web traffic as it added pressure to its limited storage capacity for what was a low hit rate. I believe this is likely to change again in the future. I suspect it will likely at least an order of magnitude difference between storing PO for 30 days from all edits and their cascading refreshlinks jobs vs storing it also for all other pages in existence when triggered by this maintenance script. As such, we may want a --no-parsercache-write option or something (cc @Ladsgroup, @Marostegui), eventhough it wouldn't do anything yet today.

That makes sense, though I'm not sure we need a --no-parser-cache-write option, I think that should just be the default behavior of refreshLinks.php whenever said functionality is reimplemented. Given that during normal script operation the point is to refresh the entire wiki, or with the new flags, just old pages, I can't think of a case we'd actually want it to populate ParserCache.

As a maintenance script, it's main cost is local CPU isolates to the mwmaint server which seems easy to reason about. So long as it remains serial, it should also naturally throttle its read load onto memcached and MySQL, especially given it'll have notable CPU work between I/O.

This is also not super great. mwmaint is a SPOF and can easily be overwhelmed by running too many resource-intensive scripts. Some of the scripts we run in mwmaint are extremely important (like deleting old PII data to make us complaint with privacy policy). My ideal future is that a maint script would trigger a container in wikikube with the main process of the php cli and once done, it is destroyed. This is similar to what Google is doing (according to their SRE book)

The rate of writes to MySQL is harder to reason about, and might not be comparable to a purge bot given different levels of efficiency there. It'll be a lot faster. That's something to monitor on the MySQL aggregate dashboards in Grafana. Eg if we're talking more than a 10% continuous increase on baseline of some clusters that's something to think about.

Lastly, and the main reason for being in this ticket, is impact on ParserCache. We decided a few years ago to have RefreshLinks no longer save its ParserOutput to the ParserCache for reuse by web traffic as it added pressure to its limited storage capacity for what was a low hit rate.

Honestly, given the complexity in mw in dealing with parsed output I'm suspicious that it is still storing it in PC unintentionally. It uses parser output access object and POA stores it without even asking or telling. I need to double check.

I believe this is likely to change again in the future. I suspect it will likely at least an order of magnitude difference between storing PO for 30 days from all edits and their cascading refreshlinks jobs vs storing it also for all other pages in existence when triggered by this maintenance script. As such, we may want a --no-parsercache-write option or something (cc @Ladsgroup, @Marostegui), eventhough it wouldn't do anything yet today.

In terms of storage, if wikidata and commons are explicitly excluded, we should be okay (each of these wikis have north of 100M pages, currently they two alone are responsible for the majority of entries in PC specially given the language fragmentation of commons). I also would be okay with an option to write in PC in case of slow parse (>1s).

Another point to consider is that currently there are many changes happening on PC: the fixing fragmentation of mobile (T326147), restbase shutdown leading to entries being stored in PC (T320534, etc.), Ramping up expiry back to 30 days (T280604), and lastly replacing the old parser with parsoid for read which needs migration periods of storing both, lastly I'm hoping to expand PC to four or five shard but that's longer term. So I highly recommend waiting to be able to isolate changes on PC, to be able to measure the impact, etc.


Here is my alternative proposal: Work on refreshlinks job to make it more efficient, in years of dealing with it, I came to the conclusion that it's terribly broken:

it's really hard to debug (and see its inside using like xhgui) it given the async nature of jobs so I'm not even 100% sure of these issues I mentioned are valid, and I can't write solutions for them easily.

One thing that would make refreshlinks much faster is that once the new parsoid is deployed, we could explore concept of fragmented/partial reparsing of pages. So for example, if Template:Infobox is edited, only the lead section would get reparsed (instead of the status quo: Reparse the whole page which for a large wiki can take weeks) making it a lot more efficient drastically increasing its throughput. That's at least two years away though :(

This is also not super great. mwmaint is a SPOF and can easily be overwhelmed by running too many resource-intensive scripts. Some of the scripts we run in mwmaint are extremely important (like deleting old PII data to make us complaint with privacy policy). My ideal future is that a maint script would trigger a container in wikikube with the main process of the php cli and once done, it is destroyed. This is similar to what Google is doing (according to their SRE book)

We can run it at the lowest nice or some other limiter. If mwmaint is getting overloaded, that seems like a reason to set up a second mwmaint server, not block other scripts that we need to fix consistency issues...

Honestly, given the complexity in mw in dealing with parsed output I'm suspicious that it is still storing it in PC unintentionally. It uses parser output access object and POA stores it without even asking or telling. I need to double check.

I'll test this but maybe we can have the refreshLinks.php script disable the ParserCacheFactory service or something for more confidence? Or maybe a null implementation?

Another point to consider is that currently there are many changes happening on PC: the fixing fragmentation of mobile (T326147), restbase shutdown leading to entries being stored in PC (T320534, etc.), Ramping up expiry back to 30 days (T280604), and lastly replacing the old parser with parsoid for read which needs migration periods of storing both, lastly I'm hoping to expand PC to four or five shard but that's longer term. So I highly recommend waiting to be able to isolate changes on PC, to be able to measure the impact, etc.

This script really should have no impact on ParserCache, if it does, I'd consider that a bug to fix before we start running it.

Here is my alternative proposal: Work on refreshlinks job to make it more efficient, in years of dealing with it, I came to the conclusion that it's terribly broken:
<snip>

I don't disagree with anything you've proposed but I also don't think it should block this.