Changes to MediaWiki code related to parsing can leave links tables out of date
Open, LowPublic

Description

Forked from T132467#2674866:

This is a long-standing problem. See T69419 and this discussion from 2014.

Here's another example: I created "Category:Pages using invalid self-closed HTML tags" on 14 July 2016 on en.WP after a change to MW started adding a hidden category to articles with a specific kind of HTML error. As of today, 28 Sep 2016, there are pages on en.WP such as "Portal:East Frisia/Region" that have the error in their code and will properly appear in the category after a null edit, but that have not yet shown up in the category on their own.

That means that fundamentally, categories are not being applied to articles in a timely fashion until those articles are edited. By any measure, taking more than two months to properly apply maintenance categories to pages is a bug that needs to be fixed.

Is there some way that we can force all pages to be null-edited (or the equivalent) with a reasonable frequency? It is not happening right now.

[edited to add:] I have been informed that changes to MediaWiki code that result in maintenance categories and changes to templates/modules that result in category changes are processed differently. This may be two different feature requests.

@tstarling, @ssastry, and I discussed this a few days ago in #mediawiki-parsoid. Tim proposed modifying the refreshLinks.php script to support queuing jobs to update pages based on when they had last been parsed (using page_links_touched). After an initial run, we should set this up as a cron job.

In summary: Sometimes code changes will add new tracking categories or something. But until the page is edited, null edited, or purged with links update, the page will not show up in the category. The proposed solution for this is to run refreshLinks.php on a regular basis.

Legoktm created this task.Feb 9 2017, 10:39 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2017, 10:39 AM

https://quarry.wmflabs.org/query/16287 shows when pages on enwp last had their link tables refreshed (the NULL column is for brand new pages or pages that haven't been touched pre-2015ish., though there are some oddities in the table)

Hi. I saw sometimes that there are years between the new tracking category and the page included in them, only after null edit. Last year I took a month and made null edits by bot to not "very" protected pages in hewiki and then manually for the rest. I'll be happy if it could be done automatically at least every month. Thank you.
And @Jonesey95, pay attention.

Thanks for forking this into its own bug. An update: I am still finding pages with self-closed HTML tag errors that have not made it into the error category. A null edit puts them in the category. We are now at eight months after the MW code change that created this error category.

JJMC89 added a subscriber: JJMC89.Feb 9 2017, 3:01 PM

Lego, do you suppose that the proportion of un-updated pages at enwiki is better (e.g., because they have some null-edit bots), worse (e.g., because it's huge), or about average compared to the smaller/mostly inactive wikis?

Jonesey95 added a comment.EditedFeb 13 2017, 2:24 AM

I believe that the proportion of pages missing from the error category at en.WP is much lower because a few of us have done systematic insource searches for a long list of regex patterns and fixed errors that we found. I searched for many of these patterns a few months ago and found a large number of pages that were not in the error category. Now that the searches have been done, there should be very few pages left to add to the error category.

I have recently been searching for a few of the most common patterns on other wikis and have found many pages not yet in the error category for those wikis. Viewing the page shows the error category, but a null edit to the page is required to get the category page to show that page with the error.

1mysql:wikiadmin@db1082 [dewiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
2+----------+---------------------------------+
3| count(*) | SUBSTR(page_links_updated, 1,6) |
4+----------+---------------------------------+
5| 677131 | 201702 |
6| 983139 | 201701 |
7| 979130 | 201612 |
8| 203688 | 201611 |
9| 62137 | 201610 |
10| 40059 | 201609 |
11| 75238 | 201608 |
12| 2794596 | 201607 |
13| 15 | 201606 |
14| 7 | 201605 |
15| 6 | 201604 |
16| 4 | 201603 |
17| 9 | 201602 |
18| 107 | 201601 |
19| 18 | 201512 |
20| 12 | 201511 |
21| 23 | 201510 |
22| 30 | 201509 |
23| 5 | 201508 |
24| 18 | 201507 |
25| 5 | 201506 |
26| 6 | 201505 |
27| 16 | 201504 |
28| 11 | 201503 |
29| 16 | 201502 |
30| 8 | 201501 |
31| 15 | 201412 |
32| 13 | 201411 |
33| 21 | 201410 |
34| 16 | 201409 |
35| 8 | 201408 |
36| 17 | 201407 |
37| 41 | 201406 |
38| 14 | 201405 |
39| 6 | 201404 |
40| 21 | 201403 |
41| 9 | 201402 |
42| 9 | 201401 |
43| 994 | NULL |
44+----------+---------------------------------+
4539 rows in set (2.90 sec)
46
47mysql:wikiadmin@db1082 [dewiki]> Bye
48legoktm@terbium:~$ sql enwiki
49Reading table information for completion of table and column names
50You can turn off this feature to get a quicker startup with -A
51
52Welcome to the MySQL monitor. Commands end with ; or \g.
53Your MySQL connection id is 310553585
54Server version: 5.5.5-10.0.28-MariaDB MariaDB Server
55
56Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
57
58Oracle is a registered trademark of Oracle Corporation and/or its
59affiliates. Other names may be trademarks of their respective
60owners.
61
62Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
63
64mysql:wikiadmin@db1083 [enwiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
65+----------+---------------------------------+
66| count(*) | SUBSTR(page_links_updated, 1,6) |
67+----------+---------------------------------+
68| 9131803 | 201702 |
69| 4823578 | 201701 |
70| 3586662 | 201612 |
71| 1854227 | 201611 |
72| 1022701 | 201610 |
73| 825076 | 201609 |
74| 1025250 | 201608 |
75| 716416 | 201607 |
76| 306360 | 201606 |
77| 173934 | 201605 |
78| 194121 | 201604 |
79| 149071 | 201603 |
80| 191523 | 201602 |
81| 154113 | 201601 |
82| 132802 | 201512 |
83| 329800 | 201511 |
84| 154395 | 201510 |
85| 169362 | 201509 |
86| 140473 | 201508 |
87| 147591 | 201507 |
88| 195598 | 201506 |
89| 384558 | 201505 |
90| 636823 | 201504 |
91| 124076 | 201503 |
92| 116068 | 201502 |
93| 122729 | 201501 |
94| 124068 | 201412 |
95| 113882 | 201411 |
96| 136086 | 201410 |
97| 128164 | 201409 |
98| 828887 | 201408 |
99| 266600 | 201407 |
100| 220104 | 201406 |
101| 193860 | 201405 |
102| 330776 | 201404 |
103| 316919 | 201403 |
104| 96540 | 201402 |
105| 82552 | 201401 |
106| 1 | 201307 |
107| 1 | 201302 |
108| 1 | 201207 |
109| 2 | 201206 |
110| 1 | 201205 |
111| 6 | 201204 |
112| 1 | 201203 |
113| 1 | 201112 |
114| 1 | 201109 |
115| 1 | 201108 |
116| 1 | 201105 |
117| 1 | 200912 |
118| 2 | 200910 |
119| 2 | 200908 |
120| 1 | 200905 |
121| 1 | 200811 |
122| 2 | 200810 |
123| 1 | 200803 |
124| 1 | 200802 |
125| 1 | 200712 |
126| 1 | 200711 |
127| 4 | 200710 |
128| 6 | 200708 |
129| 5 | 200707 |
130| 5 | 200706 |
131| 6 | 200705 |
132| 6 | 200704 |
133| 5 | 200703 |
134| 4 | 200702 |
135| 6 | 200701 |
136| 13 | 200612 |
137| 6 | 200611 |
138| 13 | 200610 |
139| 3 | 200609 |
140| 8 | 200608 |
141| 8 | 200607 |
142| 7 | 200606 |
143| 9 | 200605 |
144| 4 | 200604 |
145| 36 | 200603 |
146| 1 | 200403 |
147| 1 | 200307 |
148| 1795 | 196912 |
149| 11902661 | NULL |
150+----------+---------------------------------+
15182 rows in set (19.96 sec)
152
153mysql:wikiadmin@db1083 [enwiki]> Bye
154legoktm@terbium:~$ sql cebwiki
155Reading table information for completion of table and column names
156You can turn off this feature to get a quicker startup with -A
157
158Welcome to the MySQL monitor. Commands end with ; or \g.
159Your MySQL connection id is 3481092268
160Server version: 5.5.5-10.0.23-MariaDB-log MariaDB Server
161
162Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
163
164Oracle is a registered trademark of Oracle Corporation and/or its
165affiliates. Other names may be trademarks of their respective
166owners.
167
168Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
169
170mysql:wikiadmin@db1078 [cebwiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
171+----------+---------------------------------+
172| count(*) | SUBSTR(page_links_updated, 1,6) |
173+----------+---------------------------------+
174| 2565556 | 201702 |
175| 753395 | 201701 |
176| 1270082 | 201612 |
177| 82876 | 201611 |
178| 34647 | 201610 |
179| 170590 | 201609 |
180| 181793 | 201608 |
181| 13883 | 201607 |
182| 81862 | 201606 |
183| 30110 | 201605 |
184| 53880 | 201604 |
185| 28122 | 201603 |
186| 8549 | 201602 |
187| 79968 | 201601 |
188| 168174 | 201512 |
189| 116387 | 201511 |
190| 10428 | 201510 |
191| 2345 | 201509 |
192| 5766 | 201508 |
193| 332 | 201507 |
194| 1813 | 201506 |
195| 4897 | 201505 |
196| 4936 | 201504 |
197| 307 | 201503 |
198| 56 | 201502 |
199| 371 | 201501 |
200| 357 | 201412 |
201| 3129 | 201411 |
202| 81091 | 201410 |
203| 64253 | 201409 |
204| 124386 | 201408 |
205| 181274 | 201407 |
206| 159993 | 201406 |
207| 1185 | 201405 |
208| 988 | 201404 |
209| 538 | 201403 |
210| 434 | 201402 |
211| 394 | 201401 |
212| 5 | 196912 |
213| 923899 | NULL |
214+----------+---------------------------------+
21540 rows in set (3.38 sec)
216
217mysql:wikiadmin@db1078 [cebwiki]> Bye
218legoktm@terbium:~$ sql svwiki
219Reading table information for completion of table and column names
220You can turn off this feature to get a quicker startup with -A
221
222Welcome to the MySQL monitor. Commands end with ; or \g.
223Your MySQL connection id is 2879545082
224Server version: 5.5.5-10.0.23-MariaDB-log MariaDB Server
225
226Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
227
228Oracle is a registered trademark of Oracle Corporation and/or its
229affiliates. Other names may be trademarks of their respective
230owners.
231
232Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
233
234mysql:wikiadmin@db1090 [svwiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
235+----------+---------------------------------+
236| count(*) | SUBSTR(page_links_updated, 1,6) |
237+----------+---------------------------------+
238| 1947267 | 201702 |
239| 1698551 | 201701 |
240| 425565 | 201612 |
241| 58575 | 201611 |
242| 90118 | 201610 |
243| 144255 | 201609 |
244| 30793 | 201608 |
245| 46417 | 201607 |
246| 108031 | 201606 |
247| 106669 | 201605 |
248| 157047 | 201604 |
249| 35975 | 201603 |
250| 49793 | 201602 |
251| 65498 | 201601 |
252| 95184 | 201512 |
253| 158946 | 201511 |
254| 12623 | 201510 |
255| 51765 | 201509 |
256| 14116 | 201508 |
257| 10554 | 201507 |
258| 8144 | 201506 |
259| 12809 | 201505 |
260| 42067 | 201504 |
261| 12939 | 201503 |
262| 5714 | 201502 |
263| 8778 | 201501 |
264| 10396 | 201412 |
265| 11521 | 201411 |
266| 26481 | 201410 |
267| 85875 | 201409 |
268| 121330 | 201408 |
269| 264153 | 201407 |
270| 87000 | 201406 |
271| 9954 | 201405 |
272| 6608 | 201404 |
273| 6264 | 201403 |
274| 10429 | 201402 |
275| 5020 | 201401 |
276| 1 | 201212 |
277| 8 | 200911 |
278| 1 | 200707 |
279| 64 | 196912 |
280| 1515259 | NULL |
281+----------+---------------------------------+
28243 rows in set (3.61 sec)

That's the top 4 largest Wikipedias. Someone on de.wp probably ran a null-edit bot, but otherwise it all looks relatively the same. Someone can do some more analysis on those, but I think it would be trivial to force millions of pages to re-render by just editing a few highly used templates (though I would NOT recommend that)

First step: Make refreshLinks.php actually work: https://gerrit.wikimedia.org/r/#/c/337539/

Legoktm, thanks for tracking down this information. I agree with your judgement that editing a few highly used templates would achieve similar results to null-editing pages, but it would still take a long time (see the bug from which this task was forked for more information), it would clog up the job queue, and it would not systematically null-edit every page.

I wonder if someone on de.WP ran a null edit bot through all of the templates in July 2016, which is when the self-closing tag error category was added. That might explain the results above, where there are a couple thousand stragglers.

Lego, do you think that refreshLinks.php could be run soon? (I'm personally more concerned about the smaller wikis, since they're less likely to have access to a null-edit bot than the big ones.)

I'm still waiting on https://gerrit.wikimedia.org/r/#/c/337539/1 to get merged, then I can start running it for smaller wikis while we improve the script to be a little more conservative on larger wikis.

So, if I understand the schedule correctly, the best-case scenario is that it rides the deployment train next week, so almost exactly one week/168 hours from now, it should be done everywhere. As the deployment train reaches each of the smaller wikis, you could start running it. After it's had enough time to run (days? weeks?), then we'll have more accurate numbers at https://tools.wmflabs.org/wikitext-deprecation/

But it won't run at the larger wikis until further revisions, so while master wikignomes such as @Jonesey95 can plan for having new data on the smaller wikis in approximately the middle of March, the bigger wikis (and especially the English Wikipedia) will need to wait a bit longer. (Lego, if I've gotten any of that wrong, and especially if I'm over-optimistic about the speed at which we'll get results, then please correct me.)

So, if I understand the schedule correctly, the best-case scenario is that it rides the deployment train next week, so almost exactly one week/168 hours from now, it should be done everywhere. As the deployment train reaches each of the smaller wikis, you could start running it. After it's had enough time to run (days? weeks?), then we'll have more accurate numbers at https://tools.wmflabs.org/wikitext-deprecation/

Since it's a maint script I just backported it and deployed it. I did a quick run on aawiki (closed wiki with barely any pages), and the results look good:

mysql:wikiadmin@db1078 [aawiki]> select count(*), SUBSTR(page_links_updated, 1,6) from page group by SUBSTR(page_links_updated, 1,6) order by SUBSTR(page_links_updated, 1,6) desc;
+----------+---------------------------------+
| count(*) | SUBSTR(page_links_updated, 1,6) |
+----------+---------------------------------+
|      510 | 201703                          |
|        1 | NULL                            |
+----------+---------------------------------+
2 rows in set (0.00 sec)

(the page with NULL is broken entirely, thats another issue)

The only thing I didn't realize beforehand is that this also triggers updates to the search index, and I don't know how to monitor that / how much extra load is acceptable. I'll talk to someone about that tomorrow so I can start runs for the smaller wikis over the weekend.

For the record, I started running this on all small wikis last week until I ran into T159618 and had to stop. I'll probably resume it on small wikis again tomorrow.

Any progress on this one? Can you run it with the automatic rate-limiter turned on? (I forget what it's called.)

@Legoktm, can you tell me how this work on the link tables is going? I assume that this isn't the week to do anything complicated, but T159618 is resolved, so I hope that it will be possible to start this again at some point.

It would be great to make some progress on this task. On en.WP, Category:Pages using ISBN magic links is still populating, seven months after the MediaWiki code change that created it. This bug is blocking implementation of things like the Tidy conversion and removal of magic links.

It would be great to make some progress on this task. On en.WP, Category:Pages using ISBN magic links is still populating, seven months after the MediaWiki code change that created it. This bug is blocking implementation of things like the Tidy conversion and removal of magic links.

This is not really a task you make progress on, I'm afraid. It's more a "Rain may cause flooding" task, with some issue-amelioration sub-tasks proposed underneath. It certainly won't ever be "done", unless we radically re-think what MediaWiki is and re-write it. Can you link to the tasks you think are blocked by this so we can look for a way around this?

This is not really a task you make progress on, I'm afraid. It's more a "Rain may cause flooding" task, with some issue-amelioration sub-tasks proposed underneath. It certainly won't ever be "done", unless we radically re-think what MediaWiki is and re-write it.

Pleeeeeeeeeeeese, find a way. I run nulledit script a couple of weeks ago on hewiki. The number of Linter problems grown by 50,000.

The main task I was supposed to work on is T159512: Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp, I'll try and find some time to work on it.

Even a one-time update might be useful with the Linter changes. Or perhaps we could run a global null-edit bot?

Can I be wild? Or stupid? Is it possible that the dumping bot will make on the fly a forceupdate on a page it copies, twice a month?

Even a one-time update might be useful with the Linter changes.

This is in the works. @Legoktm and I talked about this at wikimania hackathon and we might kick this off in the coming week.

The implementation plan I propose is:

This sounds great. One month seems like a reasonable update time.

Kipod removed a subscriber: Kipod.Nov 12 2017, 7:18 AM

Bumping this. Can we get this running, please?

We still need this. Pages are still showing up every day in the Linter error page lists (for en.WP), even though the Linter lists were created many months ago. We can't fix the errors if they are not on a list or a category somewhere.

Izno added a subscriber: Izno.Sep 3 2018, 10:07 PM