Page MenuHomePhabricator

Duplicate ar_rev_id values in several wikis
Closed, ResolvedPublic

Description

Some wikis have archive tables with duplicate ar_rev_id values. This causes duplicate key errors when running populateContentTables.php.

Analysis

populateArchiveRevId.php allocated rev_id values by inserting a dummy row into the revision table and then deleting it in the same transaction. It's evident from the MariaDB documentation that this is unsafe in MariaDB 10.1. In fact even ordinary article deletion is unsafe in MariaDB 10.1: if the latest revision is deleted, and then the master DB server is restarted, the autoincrement value will be reset to MAX(rev_id)+1 and so new revision creations will reuse the ID supposedly reserved by ar_rev_id.

According to the MySQL documentation, the autoincrement value will be reset to MAX(...)+1 if a row is inserted or updated with a non-null value specified for the autoincrement column. It is unclear whether this is also true for MariaDB.

The affected wikis are: aawikibooks, cawiki, gotwikibooks, kswikiquote, lvwikibooks, nostalgiawiki, wawikibooks and wikimania2005wiki.

Taking aawikibooks as an example: The deletion log shows that 1749 pages in namespace 8 were deleted in 2007. Another 28 pages were also deleted in 2007. Then one was deleted in 2008. In 2015, 5 pages were deleted, with 9 revisions.. This is a closed, read-only wiki which has not been edited since 2015. ar_rev_id was apparently introduced in 2005, so none of these archive rows should have had ar_rev_id=NULL, and so none should have been affected by populateContentTables.php. But I can't see how it could have been caused by the original deletion in October 2007, since there are conflicting ar_rev_id values for revisions that would have been in the revision table at the same time.

A list of affected archive rows in aawikibooks:

1wikiadmin@10.64.0.205(aawikibooks)> select * from (select ar_rev_id, count(*) as n, group_concat(ar_id) as ar_ids from archive group by ar_rev_id) as counts where n>1;
2+-----------+---+-----------+
3| ar_rev_id | n | ar_ids |
4+-----------+---+-----------+
5| 3003 | 2 | 29,987 |
6| 3004 | 2 | 36,988 |
7| 3005 | 2 | 37,989 |
8| 3006 | 2 | 47,990 |
9| 3007 | 2 | 55,991 |
10| 3008 | 2 | 71,1029 |
11| 3009 | 2 | 76,1030 |
12| 3010 | 2 | 80,1031 |
13| 3011 | 2 | 84,1032 |
14| 3012 | 2 | 131,1073 |
15| 3013 | 2 | 139,1105 |
16| 3014 | 2 | 172,1107 |
17| 3015 | 2 | 207,1108 |
18| 3016 | 2 | 208,1109 |
19| 3017 | 2 | 213,1110 |
20| 3018 | 2 | 224,1111 |
21| 3019 | 2 | 227,1113 |
22| 3020 | 2 | 238,1114 |
23| 3021 | 2 | 253,1116 |
24| 3022 | 2 | 287,1117 |
25| 3023 | 2 | 294,1124 |
26| 3024 | 2 | 305,1125 |
27| 3025 | 2 | 309,1127 |
28| 3026 | 2 | 316,1131 |
29| 3027 | 2 | 320,1136 |
30| 3028 | 2 | 337,1219 |
31| 3029 | 2 | 354,1230 |
32| 3030 | 2 | 368,1293 |
33| 3031 | 2 | 426,1294 |
34| 3032 | 2 | 498,1302 |
35| 3033 | 2 | 524,1357 |
36| 3034 | 2 | 557,1439 |
37| 3035 | 2 | 564,1446 |
38| 3036 | 2 | 612,1537 |
39| 3037 | 2 | 813,1572 |
40| 3038 | 2 | 890,1574 |
41| 3039 | 2 | 894,1581 |
42| 3040 | 2 | 902,1629 |
43| 3041 | 2 | 903,1630 |
44| 3042 | 2 | 919,1636 |
45| 3043 | 2 | 950,1663 |
46| 3044 | 2 | 967,1668 |
47| 3045 | 2 | 969,1728 |
48| 3046 | 2 | 972,1741 |
49| 3047 | 2 | 975,1757 |
50| 3048 | 2 | 976,1781 |
51| 3049 | 2 | 977,1793 |
52| 3050 | 2 | 979,1815 |
53| 3051 | 2 | 983,1823 |
54| 3052 | 2 | 985,1824 |
55| 3053 | 2 | 1018,1849 |
56| 3054 | 2 | 1355,1858 |
57| 3055 | 2 | 1356,1875 |
58| 3056 | 2 | 1360,1876 |
59| 3057 | 2 | 1361,1877 |
60| 3058 | 2 | 1396,1892 |
61| 3059 | 2 | 1397,1913 |
62| 3060 | 2 | 1427,1921 |
63| 3061 | 2 | 1476,1930 |
64| 3062 | 2 | 1486,1936 |
65| 3063 | 2 | 1492,1939 |
66| 3064 | 2 | 1544,1942 |
67| 3065 | 2 | 1606,1945 |
68| 3066 | 2 | 1672,1992 |
69| 3067 | 2 | 1721,2011 |
70| 3068 | 2 | 1920,2018 |
71| 3069 | 2 | 1924,2019 |
72| 3070 | 2 | 1995,2020 |
73| 3071 | 2 | 1996,2021 |
74| 3072 | 2 | 2022,2074 |
75| 3073 | 2 | 2023,2075 |
76| 3074 | 2 | 2062,2093 |
77| 3075 | 2 | 2071,2094 |
78| 3076 | 2 | 2096,2117 |
79| 3077 | 2 | 2097,2135 |
80| 3078 | 2 | 2098,2136 |
81| 3079 | 2 | 2099,2138 |
82| 3080 | 2 | 2100,2189 |
83| 3081 | 2 | 2101,2221 |
84| 3082 | 2 | 2103,2375 |
85| 3083 | 2 | 2104,2376 |
86| 3084 | 2 | 2107,2377 |
87| 3085 | 2 | 2109,2378 |
88| 3086 | 2 | 2124,2379 |
89| 3087 | 2 | 2174,2380 |
90| 3088 | 2 | 2179,2422 |
91| 3089 | 2 | 2203,2526 |
92| 3090 | 2 | 2206,2527 |
93| 3091 | 2 | 2208,2575 |
94| 3092 | 2 | 2220,2576 |
95| 3093 | 2 | 2249,2582 |
96| 3094 | 2 | 2257,2588 |
97| 3095 | 2 | 2350,2643 |
98| 3096 | 2 | 2351,2656 |
99| 3097 | 2 | 2381,2666 |
100| 3098 | 2 | 2382,2672 |
101| 3099 | 2 | 2383,2673 |
102| 3100 | 2 | 2395,2677 |
103| 3101 | 2 | 2397,2678 |
104| 3102 | 2 | 2457,2681 |
105| 3103 | 2 | 2479,2685 |
106| 3104 | 2 | 2506,2694 |
107| 3105 | 2 | 2531,2737 |
108| 3106 | 2 | 2567,2739 |
109| 3107 | 2 | 2571,2740 |
110| 3108 | 2 | 2574,2779 |
111| 3109 | 2 | 2689,2783 |
112| 3110 | 2 | 2762,2804 |
113| 3111 | 2 | 2763,2997 |
114| 3112 | 2 | 2790,3007 |
115| 3113 | 2 | 2819,3029 |
116| 3114 | 2 | 2855,3038 |
117| 3115 | 2 | 2976,3039 |
118| 3116 | 2 | 2977,3040 |
119| 3117 | 2 | 2979,3042 |
120| 3118 | 2 | 3045,3121 |
121| 3119 | 2 | 3047,3174 |
122| 3120 | 2 | 3083,3183 |
123| 3121 | 2 | 3091,3184 |
124| 3122 | 2 | 3096,3185 |
125| 3123 | 2 | 3102,3188 |
126| 3124 | 2 | 3104,3202 |
127| 3125 | 2 | 1505,3203 |
128| 3126 | 2 | 92,3204 |
129| 3127 | 2 | 133,3222 |
130| 3128 | 2 | 147,3231 |
131| 3129 | 2 | 157,3238 |
132| 3130 | 2 | 168,3247 |
133| 3131 | 2 | 175,3258 |
134| 3132 | 2 | 176,3267 |
135| 3133 | 2 | 177,3281 |
136| 3134 | 2 | 178,3284 |
137| 3135 | 2 | 214,6569 |
138| 3136 | 2 | 288,6570 |
139| 3137 | 2 | 289,6798 |
140| 3138 | 2 | 308,6799 |
141| 3139 | 2 | 355,6801 |
142| 3140 | 2 | 406,6802 |
143| 3141 | 2 | 431,6804 |
144| 3142 | 2 | 457,6805 |
145| 3143 | 2 | 501,6806 |
146| 3144 | 2 | 917,6807 |
147| 3145 | 2 | 953,6808 |
148+-----------+---+-----------+

Note that ar_id was only introduced in 2012, it's not exactly monotonic with deletion time. Some deletion times:

wikiadmin@10.64.16.191(aawikibooks)> select ar_id,log_timestamp from logging,archive where log_namespace=ar_namespace and log_title=ar_title and ar_id in (29,985,987,1018,1849,92,3204,6569) order by log_timestamp;
+-------+----------------+
| ar_id | log_timestamp  |
+-------+----------------+
|    29 | 20070108212726 |
|    29 | 20070108212726 |
|    92 | 20070108212729 |
|    92 | 20070108212729 |
|   985 | 20070108212804 |
|   985 | 20070108212804 |
|   987 | 20070108212804 |
|   987 | 20070108212804 |
|  1018 | 20070108212805 |
|  1018 | 20070108212805 |
|  1849 | 20070108212826 |
|  1849 | 20070108212826 |
|  6569 | 20070918192042 |
|  6569 | 20070918192042 |
|  3204 | 20070918192049 |
|  3204 | 20070918192049 |
|  6569 | 20080328185334 |
+-------+----------------+

In summary, I don't know what happened but it's obviously broken.

The MCR write-both mode, with its unique index on slot_revision_id, should at least prevent conflicting rows from being inserted, although this potentially comes at the cost of throwing an exception on every edit until someone manually advances the autoincrement value for rev_id is advanced beyond MAX(ar_rev_id).

That is to say, if the most recent revision is deleted, and then a master switch is done, all edits would fail until manual action is taken.

Fixing it

In the long term, we will have to stop using ar_rev_id, since evidently we made incorrect assumptions about how DBMSs work. The whole concept of ar_rev_id appears to be invalid.

A quick hack would be to permanently insert a row into the revision table of the affected wikis, with a rev_id specified so as to make room for the extra ar_rev_id values. This wouldn't fix the master switch issue.

When inserting a row into revision, we could do SELECT MAX(ar_rev_id) FROM archive. If the rev_id returned when inserting is less than MAX(ar_rev_id), then we could update the rev_id in the newly-inserted row to some larger value.

Event Timeline

tstarling triaged this task as Medium priority.Aug 16 2018, 1:32 AM
tstarling created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The MCR write-both mode, with its unique index on slot_revision_id, should at least prevent conflicting rows from being inserted, although this potentially comes at the cost of throwing an exception on every edit until someone manually advances the autoincrement value for rev_id is advanced beyond MAX(ar_rev_id).

There already is a unique index on rev_id, and during normal operation, new slot rows are only created when new revision rows are created. The only exception is the populateContentTables.php script. As Tim noted, that script did throw an exception when encountering the duplicate ar_rev_id values.

That is to say, if the most recent revision is deleted, and then a master switch is done, all edits would fail until manual action is taken.

If the master switch causes the auto-increment counter for rev_id to be reset, this could indeed happen. It's very unlikely, but if it did happen, it would be very annoying.

Is there a good way to see whether a DBError is caused by a primary key conflict? If we had a way to detect this, we could fall back to determining the new rev_id from SELECT MAX(slot_revision_id)+1. Maybe we could even just always do this, instead of relying on rev_id's auto-increment... but I'm not 100% sure that would be safer and performant. Would it?

In the long term, we will have to stop using ar_rev_id, since evidently we made incorrect assumptions about how DBMSs work. The whole concept of ar_rev_id appears to be invalid.

I don't see any alternative to ar_rev_id, unless we stop using the archive table altogether.

A quick hack would be to permanently insert a row into the revision table of the affected wikis, with a rev_id specified so as to make room for the extra ar_rev_id values. This wouldn't fix the master switch issue.

...and of course, we'd have to actually assign the extra ar_rev_id values.

When inserting a row into revision, we could do SELECT MAX(ar_rev_id) FROM archive. If the rev_id returned when inserting is less than MAX(ar_rev_id), then we could update the rev_id in the newly-inserted row to some larger value.

I'd prefer to use MAX(slot_revision_id), since that's the actual primary index. But I suppose that amounts to the same, does it not?

I'll make a patch for this. Please confirm that the additional query on master isn't a performance problem.

Change 453117 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Make sure we don't try to use a deleted rev ID.

https://gerrit.wikimedia.org/r/453117

The patch I made should protect against the edge case that @jcrespo mentioned. Not sure that it's worth the overhead given how unlikely we are to hit that situation, but that can be discussed on the ticket.

I just realized, we already fixed duplicate ar_rev_id entries before... And then the problem re-appeared? That doesn't sound good.

I re-opened T193180: Clean up archive rows with duplicate revision IDs and made it a blocker for this task. @Anomie, can you look into this?

populateArchiveRevId.php allocated rev_id values by inserting a dummy row into the revision table and then deleting it in the same transaction. It's evident from the MariaDB documentation that this is unsafe in MariaDB 10.1. In fact even ordinary article deletion is unsafe in MariaDB 10.1: if the latest revision is deleted, and then the master DB server is restarted, the autoincrement value will be reset to MAX(rev_id)+1 and so new revision creations will reuse the ID supposedly reserved by ar_rev_id.

It's safe enough in itself. The problem is the one you identify that's shared with normal deletions too.

deduplicateArchiveRevId.php does the same thing, BTW.

In summary, I don't know what happened but it's obviously broken.

Possibly what happened for aawikibooks was that the database was already in the inconsistent state with respect to the rev_id auto-increment value when deduplicateArchiveRevId.php was run the first time, so the inserts into revision that were supposed to assign new IDs were instead duplicating other existing archive rows.

That is to say, if the most recent revision is deleted, and then a master switch is done, all edits would fail until manual action is taken.

I don't think so. The failed insert to slots would still have had a successful-but-rolled-back insert into revision, which would have incremented the auto-increment counter. Once enough edits failed to increment the counter past where it should be, edits would start working again.

In the long term, we will have to stop using ar_rev_id, since evidently we made incorrect assumptions about how DBMSs work. The whole concept of ar_rev_id appears to be invalid.

I don't see any alternative to ar_rev_id, unless we stop using the archive table altogether.

T20493: RFC: Unify the various deletion systems seems like it might go in that direction.

I just realized, we already fixed duplicate ar_rev_id entries before... And then the problem re-appeared? That doesn't sound good.

I re-opened T193180: Clean up archive rows with duplicate revision IDs and made it a blocker for this task. @Anomie, can you look into this?

Once we somehow fix the bug where the auto-increment value is inconsistent with ar_rev_id, we should be able to just re-run deduplicateArchiveRevId.php.

We'd then want to check in case any of the reassigned rows are the ones that populateContentTables.php had populated (versus being the rows that it errored out on). Or we could just blank the slots and content tables.

We'd then want to check in case any of the reassigned rows are the ones that populateContentTables.php had populated (versus being the rows that it errored out on). Or we could just blank the slots and content tables.

populateContentTables.php failed when hitting the first duplicate, so no "bad" population happened. But now we have one row in the slots table (per wiki) that corresponds to two archive rows. One of these needs to get a new value assigned to ar_rev_id, but which one? We should be able to find out by checking ar_text_id.

When hitting the first duplicate is the point. If rows A and B had the same ar_rev_id, the slots table would have been populated for A and then B would be the "first duplicate". If the deduplication decided to reassign A rather than B, we'd have the problem.

Offhand I think populateContentTables.php goes in order by ar_id while deduplicateArchiveRevId.php will wind up reassigning the row with the higher ar_id, so things will work out right anyway, but I haven't actually checked that.

A third option would be to delete the slots and content rows for just the duplicated slot_revision_ids, then deduplicate.

Notes for self: MySQL (since 5.5, maybe earlier) will update the auto-increment value on insert with an explicit value for the ID column (but not update). So we'll do that to fix the value when the problem is detected, and have the deduplicate script do it too. Then re-run the deduplicator and then the populator.

I just got hit by `Error: 1062 Duplicate entry '2871-1' for key 'PRIMARY' (localhost)` on my local machine, very likely caused by me deleting some pages, and then rebooting, causing mysql to be re-started. While this situation may be rare enough on a production system, it's going to be quite common on tests systems. It seems me do need some protection against that in regular production code. I'll try to touch up the patch I made to use insert instead of update. (why the hell doesn't update the auto-increment counter?)

@Anomie I outlined an idea on the patch, can you tell me if that would work? I'm not sure I got the locking behavior right.

The affected wikis are: aawikibooks, cawiki, gotwikibooks, kswikiquote, lvwikibooks, nostalgiawiki, wawikibooks and wikimania2005wiki.

cawiki does not seem to have any affected rows. Can you double check that that isn't a typo?

From T183488#4500564, it looks like cawiki failed because of a MySQL restart rather than this issue.

Change 453117 merged by jenkins-bot:
[mediawiki/core@master] Make sure we don't try to use a deleted rev ID.

https://gerrit.wikimedia.org/r/453117

Change 455550 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@wmf/1.32.0-wmf.18] Make sure we don't try to use a deleted rev ID.

https://gerrit.wikimedia.org/r/455550

Change 455550 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.18] Make sure we don't try to use a deleted rev ID.

https://gerrit.wikimedia.org/r/455550

Mentioned in SAL (#wikimedia-operations) [2018-08-27T13:25:34Z] <anomie@deploy1001> Synchronized php-1.32.0-wmf.18/includes/Storage/RevisionStore.php: Backport for T202032 (duration: 00m 49s)

Note the above backport should wind up having almost no effect on live Wikimedia wikis at this time, since the only ones that seem affected by the MySQL bug are closed wikis where there's no ability to edit. The only effect should be two more SELECT MAX($field) FROM $table-style queries during the creation of each revision, and a few microseconds of PHP code.

The backport was mainly so I can start re-running the maintenance scripts today instead of waiting until the end of the week.

Mentioned in SAL (#wikimedia-operations) [2018-08-27T14:42:44Z] <anomie> Running deduplicateArchiveRevId.php on aawikibooks for T202032

Mentioned in SAL (#wikimedia-operations) [2018-08-27T14:43:51Z] <anomie> Running deduplicateArchiveRevId.php on gotwikibooks, kswikiquote, lvwikibooks, nostalgiawiki, wawikibooks and wikimania2005wiki for T202032

Confirmed that no wikis have duplicate ar_rev_id values now.