Page MenuHomePhabricator

Periodically run refreshLinks.php on production sites.
Open, LowPublic

Description

Forked from T132467#2674866:

Changes to MediaWiki code related to parsing can leave links tables out of date

This is a long-standing problem. See T69419 and this discussion from 2014.

Here's another example: I created "Category:Pages using invalid self-closed HTML tags" on 14 July 2016 on en.WP after a change to MW started adding a hidden category to articles with a specific kind of HTML error. As of today, 28 Sep 2016, there are pages on en.WP such as "Portal:East Frisia/Region" that have the error in their code and will properly appear in the category after a null edit, but that have not yet shown up in the category on their own.

That means that fundamentally, categories are not being applied to articles in a timely fashion until those articles are edited. By any measure, taking more than two months to properly apply maintenance categories to pages is a bug that needs to be fixed.

Is there some way that we can force all pages to be null-edited (or the equivalent) with a reasonable frequency? It is not happening right now.

[edited to add:] I have been informed that changes to MediaWiki code that result in maintenance categories and changes to templates/modules that result in category changes are processed differently. This may be two different feature requests.

@tstarling, @ssastry, and I discussed this a few days ago in #mediawiki-parsoid. Tim proposed modifying the refreshLinks.php script to support queuing jobs to update pages based on when they had last been parsed (using page_links_touched). After an initial run, we should set this up as a cron job.

In summary: Sometimes code changes will add new tracking categories or something. But until the page is edited, null edited, or purged with links update, the page will not show up in the category. The proposed solution for this is to run refreshLinks.php on a regular basis.

Event Timeline

https://quarry.wmflabs.org/query/16287 shows when pages on enwp last had their link tables refreshed (the NULL column is for brand new pages or pages that haven't been touched pre-2015ish., though there are some oddities in the table)

Hi. I saw sometimes that there are years between the new tracking category and the page included in them, only after null edit. Last year I took a month and made null edits by bot to not "very" protected pages in hewiki and then manually for the rest. I'll be happy if it could be done automatically at least every month. Thank you.
And @Jonesey95, pay attention.

Thanks for forking this into its own bug. An update: I am still finding pages with self-closed HTML tag errors that have not made it into the error category. A null edit puts them in the category. We are now at eight months after the MW code change that created this error category.

Lego, do you suppose that the proportion of un-updated pages at enwiki is better (e.g., because they have some null-edit bots), worse (e.g., because it's huge), or about average compared to the smaller/mostly inactive wikis?

I believe that the proportion of pages missing from the error category at en.WP is much lower because a few of us have done systematic insource searches for a long list of regex patterns and fixed errors that we found. I searched for many of these patterns a few months ago and found a large number of pages that were not in the error category. Now that the searches have been done, there should be very few pages left to add to the error category.

I have recently been searching for a few of the most common patterns on other wikis and have found many pages not yet in the error category for those wikis. Viewing the page shows the error category, but a null edit to the page is required to get the category page to show that page with the error.

1mysql:wikiadmin@db1082 [dewiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
2+----------+---------------------------------+
3| count(*) | SUBSTR(page_links_updated, 1,6) |
4+----------+---------------------------------+
5| 677131 | 201702 |
6| 983139 | 201701 |
7| 979130 | 201612 |
8| 203688 | 201611 |
9| 62137 | 201610 |
10| 40059 | 201609 |
11| 75238 | 201608 |
12| 2794596 | 201607 |
13| 15 | 201606 |
14| 7 | 201605 |
15| 6 | 201604 |
16| 4 | 201603 |
17| 9 | 201602 |
18| 107 | 201601 |
19| 18 | 201512 |
20| 12 | 201511 |
21| 23 | 201510 |
22| 30 | 201509 |
23| 5 | 201508 |
24| 18 | 201507 |
25| 5 | 201506 |
26| 6 | 201505 |
27| 16 | 201504 |
28| 11 | 201503 |
29| 16 | 201502 |
30| 8 | 201501 |
31| 15 | 201412 |
32| 13 | 201411 |
33| 21 | 201410 |
34| 16 | 201409 |
35| 8 | 201408 |
36| 17 | 201407 |
37| 41 | 201406 |
38| 14 | 201405 |
39| 6 | 201404 |
40| 21 | 201403 |
41| 9 | 201402 |
42| 9 | 201401 |
43| 994 | NULL |
44+----------+---------------------------------+
4539 rows in set (2.90 sec)
46
47mysql:wikiadmin@db1082 [dewiki]> Bye
48legoktm@terbium:~$ sql enwiki
49Reading table information for completion of table and column names
50You can turn off this feature to get a quicker startup with -A
51
52Welcome to the MySQL monitor. Commands end with ; or \g.
53Your MySQL connection id is 310553585
54Server version: 5.5.5-10.0.28-MariaDB MariaDB Server
55
56Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
57
58Oracle is a registered trademark of Oracle Corporation and/or its
59affiliates. Other names may be trademarks of their respective
60owners.
61
62Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
63
64mysql:wikiadmin@db1083 [enwiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
65+----------+---------------------------------+
66| count(*) | SUBSTR(page_links_updated, 1,6) |
67+----------+---------------------------------+
68| 9131803 | 201702 |
69| 4823578 | 201701 |
70| 3586662 | 201612 |
71| 1854227 | 201611 |
72| 1022701 | 201610 |
73| 825076 | 201609 |
74| 1025250 | 201608 |
75| 716416 | 201607 |
76| 306360 | 201606 |
77| 173934 | 201605 |
78| 194121 | 201604 |
79| 149071 | 201603 |
80| 191523 | 201602 |
81| 154113 | 201601 |
82| 132802 | 201512 |
83| 329800 | 201511 |
84| 154395 | 201510 |
85| 169362 | 201509 |
86| 140473 | 201508 |
87| 147591 | 201507 |
88| 195598 | 201506 |
89| 384558 | 201505 |
90| 636823 | 201504 |
91| 124076 | 201503 |
92| 116068 | 201502 |
93| 122729 | 201501 |
94| 124068 | 201412 |
95| 113882 | 201411 |
96| 136086 | 201410 |
97| 128164 | 201409 |
98| 828887 | 201408 |
99| 266600 | 201407 |
100| 220104 | 201406 |
101| 193860 | 201405 |
102| 330776 | 201404 |
103| 316919 | 201403 |
104| 96540 | 201402 |
105| 82552 | 201401 |
106| 1 | 201307 |
107| 1 | 201302 |
108| 1 | 201207 |
109| 2 | 201206 |
110| 1 | 201205 |
111| 6 | 201204 |
112| 1 | 201203 |
113| 1 | 201112 |
114| 1 | 201109 |
115| 1 | 201108 |
116| 1 | 201105 |
117| 1 | 200912 |
118| 2 | 200910 |
119| 2 | 200908 |
120| 1 | 200905 |
121| 1 | 200811 |
122| 2 | 200810 |
123| 1 | 200803 |
124| 1 | 200802 |
125| 1 | 200712 |
126| 1 | 200711 |
127| 4 | 200710 |
128| 6 | 200708 |
129| 5 | 200707 |
130| 5 | 200706 |
131| 6 | 200705 |
132| 6 | 200704 |
133| 5 | 200703 |
134| 4 | 200702 |
135| 6 | 200701 |
136| 13 | 200612 |
137| 6 | 200611 |
138| 13 | 200610 |
139| 3 | 200609 |
140| 8 | 200608 |
141| 8 | 200607 |
142| 7 | 200606 |
143| 9 | 200605 |
144| 4 | 200604 |
145| 36 | 200603 |
146| 1 | 200403 |
147| 1 | 200307 |
148| 1795 | 196912 |
149| 11902661 | NULL |
150+----------+---------------------------------+
15182 rows in set (19.96 sec)
152
153mysql:wikiadmin@db1083 [enwiki]> Bye
154legoktm@terbium:~$ sql cebwiki
155Reading table information for completion of table and column names
156You can turn off this feature to get a quicker startup with -A
157
158Welcome to the MySQL monitor. Commands end with ; or \g.
159Your MySQL connection id is 3481092268
160Server version: 5.5.5-10.0.23-MariaDB-log MariaDB Server
161
162Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
163
164Oracle is a registered trademark of Oracle Corporation and/or its
165affiliates. Other names may be trademarks of their respective
166owners.
167
168Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
169
170mysql:wikiadmin@db1078 [cebwiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
171+----------+---------------------------------+
172| count(*) | SUBSTR(page_links_updated, 1,6) |
173+----------+---------------------------------+
174| 2565556 | 201702 |
175| 753395 | 201701 |
176| 1270082 | 201612 |
177| 82876 | 201611 |
178| 34647 | 201610 |
179| 170590 | 201609 |
180| 181793 | 201608 |
181| 13883 | 201607 |
182| 81862 | 201606 |
183| 30110 | 201605 |
184| 53880 | 201604 |
185| 28122 | 201603 |
186| 8549 | 201602 |
187| 79968 | 201601 |
188| 168174 | 201512 |
189| 116387 | 201511 |
190| 10428 | 201510 |
191| 2345 | 201509 |
192| 5766 | 201508 |
193| 332 | 201507 |
194| 1813 | 201506 |
195| 4897 | 201505 |
196| 4936 | 201504 |
197| 307 | 201503 |
198| 56 | 201502 |
199| 371 | 201501 |
200| 357 | 201412 |
201| 3129 | 201411 |
202| 81091 | 201410 |
203| 64253 | 201409 |
204| 124386 | 201408 |
205| 181274 | 201407 |
206| 159993 | 201406 |
207| 1185 | 201405 |
208| 988 | 201404 |
209| 538 | 201403 |
210| 434 | 201402 |
211| 394 | 201401 |
212| 5 | 196912 |
213| 923899 | NULL |
214+----------+---------------------------------+
21540 rows in set (3.38 sec)
216
217mysql:wikiadmin@db1078 [cebwiki]> Bye
218legoktm@terbium:~$ sql svwiki
219Reading table information for completion of table and column names
220You can turn off this feature to get a quicker startup with -A
221
222Welcome to the MySQL monitor. Commands end with ; or \g.
223Your MySQL connection id is 2879545082
224Server version: 5.5.5-10.0.23-MariaDB-log MariaDB Server
225
226Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
227
228Oracle is a registered trademark of Oracle Corporation and/or its
229affiliates. Other names may be trademarks of their respective
230owners.
231
232Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
233
234mysql:wikiadmin@db1090 [svwiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
235+----------+---------------------------------+
236| count(*) | SUBSTR(page_links_updated, 1,6) |
237+----------+---------------------------------+
238| 1947267 | 201702 |
239| 1698551 | 201701 |
240| 425565 | 201612 |
241| 58575 | 201611 |
242| 90118 | 201610 |
243| 144255 | 201609 |
244| 30793 | 201608 |
245| 46417 | 201607 |
246| 108031 | 201606 |
247| 106669 | 201605 |
248| 157047 | 201604 |
249| 35975 | 201603 |
250| 49793 | 201602 |
251| 65498 | 201601 |
252| 95184 | 201512 |
253| 158946 | 201511 |
254| 12623 | 201510 |
255| 51765 | 201509 |
256| 14116 | 201508 |
257| 10554 | 201507 |
258| 8144 | 201506 |
259| 12809 | 201505 |
260| 42067 | 201504 |
261| 12939 | 201503 |
262| 5714 | 201502 |
263| 8778 | 201501 |
264| 10396 | 201412 |
265| 11521 | 201411 |
266| 26481 | 201410 |
267| 85875 | 201409 |
268| 121330 | 201408 |
269| 264153 | 201407 |
270| 87000 | 201406 |
271| 9954 | 201405 |
272| 6608 | 201404 |
273| 6264 | 201403 |
274| 10429 | 201402 |
275| 5020 | 201401 |
276| 1 | 201212 |
277| 8 | 200911 |
278| 1 | 200707 |
279| 64 | 196912 |
280| 1515259 | NULL |
281+----------+---------------------------------+
28243 rows in set (3.61 sec)

That's the top 4 largest Wikipedias. Someone on de.wp probably ran a null-edit bot, but otherwise it all looks relatively the same. Someone can do some more analysis on those, but I think it would be trivial to force millions of pages to re-render by just editing a few highly used templates (though I would NOT recommend that)

Legoktm, thanks for tracking down this information. I agree with your judgement that editing a few highly used templates would achieve similar results to null-editing pages, but it would still take a long time (see the bug from which this task was forked for more information), it would clog up the job queue, and it would not systematically null-edit every page.

I wonder if someone on de.WP ran a null edit bot through all of the templates in July 2016, which is when the self-closing tag error category was added. That might explain the results above, where there are a couple thousand stragglers.

Lego, do you think that refreshLinks.php could be run soon? (I'm personally more concerned about the smaller wikis, since they're less likely to have access to a null-edit bot than the big ones.)

I'm still waiting on https://gerrit.wikimedia.org/r/#/c/337539/1 to get merged, then I can start running it for smaller wikis while we improve the script to be a little more conservative on larger wikis.

So, if I understand the schedule correctly, the best-case scenario is that it rides the deployment train next week, so almost exactly one week/168 hours from now, it should be done everywhere. As the deployment train reaches each of the smaller wikis, you could start running it. After it's had enough time to run (days? weeks?), then we'll have more accurate numbers at https://tools.wmflabs.org/wikitext-deprecation/

But it won't run at the larger wikis until further revisions, so while master wikignomes such as @Jonesey95 can plan for having new data on the smaller wikis in approximately the middle of March, the bigger wikis (and especially the English Wikipedia) will need to wait a bit longer. (Lego, if I've gotten any of that wrong, and especially if I'm over-optimistic about the speed at which we'll get results, then please correct me.)

So, if I understand the schedule correctly, the best-case scenario is that it rides the deployment train next week, so almost exactly one week/168 hours from now, it should be done everywhere. As the deployment train reaches each of the smaller wikis, you could start running it. After it's had enough time to run (days? weeks?), then we'll have more accurate numbers at https://tools.wmflabs.org/wikitext-deprecation/

Since it's a maint script I just backported it and deployed it. I did a quick run on aawiki (closed wiki with barely any pages), and the results look good:

mysql:wikiadmin@db1078 [aawiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
+----------+---------------------------------+
| count(*) | SUBSTR(page_links_updated, 1,6) |
+----------+---------------------------------+
|      510 | 201703                          |
|        1 | NULL                            |
+----------+---------------------------------+
2 rows in set (0.00 sec)

(the page with NULL is broken entirely, thats another issue)

The only thing I didn't realize beforehand is that this also triggers updates to the search index, and I don't know how to monitor that / how much extra load is acceptable. I'll talk to someone about that tomorrow so I can start runs for the smaller wikis over the weekend.

For the record, I started running this on all small wikis last week until I ran into T159618 and had to stop. I'll probably resume it on small wikis again tomorrow.

Any progress on this one? Can you run it with the automatic rate-limiter turned on? (I forget what it's called.)

@Legoktm, can you tell me how this work on the link tables is going? I assume that this isn't the week to do anything complicated, but T159618 is resolved, so I hope that it will be possible to start this again at some point.

It would be great to make some progress on this task. On en.WP, Category:Pages using ISBN magic links is still populating, seven months after the MediaWiki code change that created it. This bug is blocking implementation of things like the Tidy conversion and removal of magic links.

It would be great to make some progress on this task. On en.WP, Category:Pages using ISBN magic links is still populating, seven months after the MediaWiki code change that created it. This bug is blocking implementation of things like the Tidy conversion and removal of magic links.

This is not really a task you make progress on, I'm afraid. It's more a "Rain may cause flooding" task, with some issue-amelioration sub-tasks proposed underneath. It certainly won't ever be "done", unless we radically re-think what MediaWiki is and re-write it. Can you link to the tasks you think are blocked by this so we can look for a way around this?

This is not really a task you make progress on, I'm afraid. It's more a "Rain may cause flooding" task, with some issue-amelioration sub-tasks proposed underneath. It certainly won't ever be "done", unless we radically re-think what MediaWiki is and re-write it.

Pleeeeeeeeeeeese, find a way. I run nulledit script a couple of weeks ago on hewiki. The number of Linter problems grown by 50,000.

Even a one-time update might be useful with the Linter changes. Or perhaps we could run a global null-edit bot?

Can I be wild? Or stupid? Is it possible that the dumping bot will make on the fly a forceupdate on a page it copies, twice a month?

Even a one-time update might be useful with the Linter changes.

This is in the works. @Legoktm and I talked about this at wikimania hackathon and we might kick this off in the coming week.

The implementation plan I propose is:

This sounds great. One month seems like a reasonable update time.

Bumping this. Can we get this running, please?

We still need this. Pages are still showing up every day in the Linter error page lists (for en.WP), even though the Linter lists were created many months ago. We can't fix the errors if they are not on a list or a category somewhere.

@ssastry: Are there any updated plans? Asking as you commented on T157670#3531831 - thanks!

With @Legoktm no longer on the team, we haven't been able to make progress on this. Added Platform Engineering to see if they have any thoughts about this task or if they are able to tackle this. Otherwise, this will have to wait for us to get back into this code again -- which we will in the next 12+ months as we upgrade Parsoid and better integrate with core functionality to ensure it does everything the current core parser does.

daniel renamed this task from Changes to MediaWiki code related to parsing can leave links tables out of date to Periodically run refreshLinks.php on production sites..Jan 28 2021, 9:34 AM
daniel updated the task description. (Show Details)

Doing this for ALL pages isn't feasible, a rough estimate comes to about 30 years of rendering time (a billion pages at one per second). We'll have to filter at least by namespace and by category. Which means we can't just run it periodically. It'll be a manual thing every time.

Doing this for ALL pages isn't feasible, a rough estimate comes to about 30 years of rendering time (a billion pages at one per second). We'll have to filter at least by namespace and by category. Which means we can't just run it periodically. It'll be a manual thing every time.

This hypothetical math is unhelpful. Way back in 2017, four years ago, there were 12 million pages in the "NULL" group on en.WP. Null-editing them at one page per second would have taken 139 days. That never happened, here we are four years later, and the problem is not any closer to being solved.

Doing this for ALL pages isn't feasible, a rough estimate comes to about 30 years of rendering time (a billion pages at one per second). We'll have to filter at least by namespace and by category. Which means we can't just run it periodically. It'll be a manual thing every time.

The point isn't to run it for all pages anyways. We already track the last time a page was purged, so we only need to refresh pages that haven't been touched for a certain amount of time (if we set the threshold at years that would be a significant start). The main technical work to be done here is T159512: Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp, then we just set up a cron job and monitor.

In the PET tech planning meeting, Tim just had the idea that we could filter by page_links_updated, so we'd only reparse things that haven't been re-parsed since a given date. That may make it more feasible.

EDIT: Oh, I suppose what @Legoktm meant by "tracking when the page was purged" above.

In the PET tech planning meeting, Tim just had the idea that we could filter by page_links_updated, so we'd only reparse things that haven't been re-parsed since a given date. That may make it more feasible.

EDIT: Oh, I suppose what @Legoktm meant by "tracking when the page was purged" above.

This is also in the task description, where Tim said the exact same thing in 2017.... :)

Here's what page_links_updated looks like on enwiki now. I added page lengths to give an idea of the work involved in parsing them.

MariaDB [enwiki]> select left(page_links_updated,4) as year,count(*),avg(page_len),sum(page_len) from page where page_random between 0.001 and 0.002 group by year;
+------+----------+---------------+---------------+
| year | count(*) | avg(page_len) | sum(page_len) |
+------+----------+---------------+---------------+
| NULL |     2821 |      719.9596 |       2031006 |
| 2010 |        5 |      145.8000 |           729 |
| 2011 |        9 |      116.0000 |          1044 |
| 2012 |       11 |      111.1818 |          1223 |
| 2013 |       16 |       85.2500 |          1364 |
| 2014 |     1139 |     1619.6558 |       1844788 |
| 2015 |     1528 |     1383.2808 |       2113653 |
| 2016 |     1149 |     1116.9634 |       1283391 |
| 2017 |     1846 |     1817.3922 |       3354906 |
| 2018 |     1507 |     2056.5448 |       3099213 |
| 2019 |     6557 |      938.2196 |       6151906 |
| 2020 |    23928 |     2034.6179 |      48684337 |
| 2021 |    12018 |     5909.8995 |      71025172 |
+------+----------+---------------+---------------+
13 rows in set (0.78 sec)

So if we refreshed up to 2020-01-01, including nulls, that would be about 16.6M pages and 18.5 GB of wikitext.

How does 16.6M pages and 18.5 GB of wikitext compare to how many pages and GB are edited or otherwise refreshed per day or month on enwiki?

Edits alone are something like 156 GB per month. The average revision size is much larger than the average page size, because larger pages are edited more frequently.

MariaDB [enwiki]> select count(*),avg(rev_len),sum(rev_len) from revision where rev_id between 991616250 and 997529717;
+----------+--------------+--------------+
| count(*) | avg(rev_len) | sum(rev_len) |
+----------+--------------+--------------+
|  5789860 |   28859.2656 | 167091107733 |
+----------+--------------+--------------+
1 row in set (5.31 sec)

I used rev_id values corresponding to the start and end of December 2020, because the rev_timestamp index is probably cold and slow.

So that looks like only 10% more edit volume (on en.WP) over the course of a month to get caught up, and then something similar to that each month to keep pages current (articles could be kept "more current" than other namespaces, if necessary to control loads). Can someone please kick off this process? Thanks!

@daniel I don't see why the Tech Decision process needs to be used for this.