Page MenuHomePhabricator

Unbreak replication in beta cluster
Closed, ResolvedPublic

Description

Because of the combination of T183242: DB handles obtained with DB_REPLICA should not allow writes, T183245: Ensure replica DB in labs is read-only and some broken code in MW core that was merged into master recently, replication broke in beta cluster

wikiadmin@deployment-db04[deploymentwiki]> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: deployment-db03.eqiad.wmflabs
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: deployment-db03-bin.000086
          Read_Master_Log_Pos: 489306779
               Relay_Log_File: deployment-db04-relay-bin.000283
                Relay_Log_Pos: 312519
        Relay_Master_Log_File: deployment-db03-bin.000086
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump  */  INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 489296618
              Relay_Log_Space: 323284
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump  */  INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 172234526
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
1 row in set (0.00 sec)

The text table has diverged:

wikiadmin@deployment-db03[deploymentwiki]> select * from text where old_id > 20965;
+--------+--------------------+-------------------------------+
| old_id | old_text           | old_flags                     |
+--------+--------------------+-------------------------------+
|  20966 | DB://cluster1/8207 | nativeDataArray,gzip,external |
|  20967 | DB://cluster1/8208 | nativeDataArray,gzip,external |
|  20968 | DB://cluster1/8209 | nativeDataArray,gzip,external |
|  20969 | DB://cluster1/8210 | nativeDataArray,gzip,external |
|  20970 | DB://cluster1/8211 | utf-8,gzip,external           |
|  20971 | DB://cluster1/8218 | nativeDataArray,gzip,external | (DIFFERENT)
|  20972 | DB://cluster1/8224 | nativeDataArray,gzip,external | (DIFFERENT)
+--------+--------------------+-------------------------------+
7 rows in set (0.00 sec)

wikiadmin@deployment-db04[deploymentwiki]> select * from text where old_id > 20965;
+--------+--------------------+-------------------------------+
| old_id | old_text           | old_flags                     |
+--------+--------------------+-------------------------------+
|  20966 | DB://cluster1/8207 | nativeDataArray,gzip,external |
|  20967 | DB://cluster1/8208 | nativeDataArray,gzip,external |
|  20968 | DB://cluster1/8209 | nativeDataArray,gzip,external |
|  20969 | DB://cluster1/8210 | nativeDataArray,gzip,external |
|  20970 | DB://cluster1/8211 | utf-8,gzip,external           |
|  20971 | DB://cluster1/8212 | utf-8,gzip,external           | (DIFFERENT)
|  20972 | DB://cluster1/8213 | utf-8,gzip,external           | (DIFFERENT)
|  20973 | DB://cluster1/8214 | utf-8,gzip,external           | (DIFFERENT)
|  20974 | DB://cluster1/8215 | utf-8,gzip,external           | (DIFFERENT)
|  20975 | DB://cluster1/8216 | utf-8,gzip,external           | (DIFFERENT)
|  20976 | DB://cluster1/8217 | utf-8,gzip,external           | (DIFFERENT)
|  20977 | DB://cluster1/8219 | utf-8,gzip,external           | (DIFFERENT)
|  20978 | DB://cluster1/8220 | utf-8,gzip,external           | (DIFFERENT)
|  20979 | DB://cluster1/8221 | utf-8,gzip,external           | (DIFFERENT)
|  20980 | DB://cluster1/8222 | utf-8,gzip,external           | (DIFFERENT)
|  20981 | DB://cluster1/8223 | utf-8,gzip,external           | (DIFFERENT)
|  20982 | DB://cluster1/8225 | utf-8,gzip,external           | (DIFFERENT)
+--------+--------------------+-------------------------------+
17 rows in set (0.01 sec)

The revision table is in better shape. The master and replica agree on all rows except one row that is present on the master and missing from the replica. This row also has a rev_text_id (20972) that refers to a text row that the master and replica disagree on (and it looks like the replica's version is the right one), and that same ID is mentioned in the error that broke replication.

wikiadmin@deployment-db03[deploymentwiki]> select rev_id, rev_text_id, rev_timestamp from revision order by rev_id desc limit 15;
+--------+-------------+----------------+
| rev_id | rev_text_id | rev_timestamp  |
+--------+-------------+----------------+
|   7944 |       20982 | 20171219010811 | (MISSING ON REPLICA)
|   7943 |       20981 | 20171219002242 |
|   7942 |       20980 | 20171218193859 |
|   7941 |       20979 | 20171218193849 |
|   7940 |       20978 | 20171218193735 |
|   7939 |       20977 | 20171218193138 |
|   7938 |       20976 | 20171218183128 |
|   7937 |       20975 | 20171218182715 |
|   7936 |       20974 | 20171218180348 |
|   7935 |       20973 | 20171218174138 |
|   7934 |       20972 | 20171218152013 |
|   7917 |       20900 | 20171211164804 |
|   7892 |       20783 | 20171202182607 |
|   7891 |       20782 | 20171202154042 |
|   7890 |       20336 | 20171202153840 |
+--------+-------------+----------------+
15 rows in set (0.00 sec)

wikiadmin@deployment-db04[deploymentwiki]> select rev_id, rev_text_id, rev_timestamp from revision order by rev_id desc limit 15;
+--------+-------------+----------------+
| rev_id | rev_text_id | rev_timestamp  |
+--------+-------------+----------------+
|   7943 |       20981 | 20171219002242 |
|   7942 |       20980 | 20171218193859 |
|   7941 |       20979 | 20171218193849 |
|   7940 |       20978 | 20171218193735 |
|   7939 |       20977 | 20171218193138 |
|   7938 |       20976 | 20171218183128 |
|   7937 |       20975 | 20171218182715 |
|   7936 |       20974 | 20171218180348 |
|   7935 |       20973 | 20171218174138 |
|   7934 |       20972 | 20171218152013 |
|   7917 |       20900 | 20171211164804 |
|   7892 |       20783 | 20171202182607 |
|   7891 |       20782 | 20171202154042 |
|   7890 |       20336 | 20171202153840 |
|   7848 |       20696 | 20171126213200 |
+--------+-------------+----------------+
15 rows in set (0.00 sec)

1 RoanKattouw: Aaron tried to help with investigating the chronology issue yesterday, but didn't get anywhere, it seems. That stuff is a bit esoterik to me... do you have an idea how to investigate? Basically, new revisions fail to load seconds after an edit. it *looks* like the page table is out iof sync with the revision table. But nobody knows how that can be.
2[12:36:40] <DanielK_WMDE> ah, addshore merged the revert
3[12:36:50] <DanielK_WMDE> I'll file a revert of the revert in a minute
4[12:36:54] <addshore> not yet, just fixing the commit message
5[12:37:00] <addshore> but gerrit has died for me
6[12:37:17] <DanielK_WMDE> wffm
7[12:37:18] <addshore> RoanKattouw: yes, the christmas time for merging and getting it on beta was the plan :)
8[12:37:41] <DanielK_WMDE> addshore: we failed to consider that christmas is a bad time to get people to investigate arcane issues, though :)
9[12:37:47] <DanielK_WMDE> aaron is gone for the holidays, i guess
10[12:37:49] <addshore> indeed
11[12:38:08] <DanielK_WMDE> i was expecting issues, but i was hoping they would be a bit more clear cut...
12[12:39:01] <RoanKattouw> haha yeah that's a problem
13
14[12:39:53] <RoanKattouw> I guess you could try to figure out what the state of the DB is that the broken requests see (revision table entry referring to a nonexistent page table entry? or the reverse?) and try to figure out if the order of insertions might possibly be wrong
15[12:40:21] <addshore> DanielK_WMDE: Just finished fixing the commit message issues, anything else we need to add to that revert?
16[12:40:34] <DanielK_WMDE> RoanKattouw: no, thje page table references a revision that then fails to load (it may or may not be in the db, we don't know)
17[12:40:54] <DanielK_WMDE> addshore: i don't think so
18[12:41:15] <RoanKattouw> Hmm that's weird
19[12:41:31] <RoanKattouw> One thing I wonder is how the page table row was able to be written then
20[12:41:41] <RoanKattouw> Because you don't know the revision ID until after you've written the revision row
21[12:41:53] <addshore> hmm, beta db is currently locked?
22[12:42:08] <RoanKattouw> Have you isolated a broken case? Do broken cases stay broken for a long-ish period of time, or only briefly?
23[12:42:21] <addshore> RoanKattouw: I was about to say you can try it on beta, but the db is locked
24[12:42:34] <DanielK_WMDE> RoanKattouw: it does get written. the revision is found once replicas catch up. it just fails to load right after the edit.
25[12:43:23] <DanielK_WMDE> RoanKattouw: while investigating, aaron found that replication weas broken on beta. we firt thought that that was all there was to it. but fixing replication didn't fix the issue. it's all a bit... obscure
26[12:44:29] <DanielK_WMDE> RoanKattouw: to replicate, just edit. you get redirected to the "plain" url, and get an error (revision not found). you reload, and it works fine.
27[12:45:02] <DanielK_WMDE> RoanKattouw, addshore: the relevant code in Article pre-fills the page with that error message. so that's the error you'll see if *anything* goes wrong. which means we don't actually know what goes wrong.
28[12:46:19] <addshore> DanielK_WMDE: I dont think the beta replication issue is fixed
29[12:46:19] <addshore> "The database has been automatically locked while the replica database servers catch up to the master."
30[12:46:50] <RoanKattouw> "You reload, and it works fine" --> so it's not a long-lasting issue, it's brief
31[12:46:55] <RoanKattouw> That makes investigating it harder
32[12:47:34] <DanielK_WMDE> RoanKattouw: indeed. and MAYBE it's actually an issue with replication on beta, not with the code, as addshore just said...
33[12:47:39] <RoanKattouw> addshore: Hmm if replication is *completely* broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow)
3412:47:52] <RoanKattouw> Well, let's see if the revert fixes it
35[12:47:57] <addshore> https://deployment.wikimedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= is the same as when arron looked yesterday
36[12:48:03] <RoanKattouw> If it does, then your code is doing at least something wrong
37[12:48:06] <addshore> wtf is up with beta
38[12:48:15] <DanielK_WMDE> RoanKattouw: that indeed is a meaningful experiment :)
39[12:48:51] <RoanKattouw> addshore: Hmm at that URL I see 0 lag right now
40[12:49:07] <addshore> RoanKattouw: try creating a page though, it says the db is locked due to lag
41[12:49:38] <RoanKattouw> I see
42[12:49:57] <RoanKattouw> Very strange
43[12:50:30] <RoanKattouw> The DB does appear to be read-only
44[12:50:35] <RoanKattouw> Marking notifications as read doesn't stick
45[12:51:06] <Zppix> Would jenkins doing a scap cause that?
46[12:51:15] <addshore> Zppix: the locking? no
47[12:51:22] <DanielK_WMDE> RoanKattouw: maybe it's r/o for a reason that has nothing to do with replag?
48[12:51:23] <RoanKattouw> No, something weird is going on
49[12:51:36] <RoanKattouw> I checked $wgReadOnly and it's false, so that isn't it
50[12:51:43] <addshore> DanielK_WMDE: when editing though the reason it shows says it is due to replag
51[12:51:52] <RoanKattouw> wfReadOnly() returns true, so let's see why
52[12:51:55] <DanielK_WMDE> addshore: wtf?
53[12:52:07] <addshore> https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/lastFailedBuild/console
54[12:52:26] <addshore> Wikimedia\\Rdbms\\DBReplicationWaitError from line 369 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03
55[12:53:02] <addshore> first failure is Build #22122 (18-Dec-2017 19:20:00)
56[12:55:47] <RoanKattouw> OK I can confirm that replication is broken on deployment-db04
57[12:55:58] <addshore> RoanKattouw: any idea how we can fix that?
58[12:56:08] <RoanKattouw> https://www.irccloud.com/pastebin/tREOTlIr/
59[12:56:53] <addshore> ooooooh, well, it has things to do with externalstore and revisions / old text, could be related too the patch
60[12:56:58] <RoanKattouw> addshore: I know like 1/3 of the theory and I haven't done it before
61[12:57:26] <RoanKattouw> Yeah the error that broke replication is a primary key collision in the text table, I hadn't realized that that could be related to MCR but you're right
62[12:57:42] <RoanKattouw> The query itself specifies NULL as the old_id so that should have been fine
63[12:57:55] <addshore> That actually could be a very big hint as to what is wrong
64[12:58:27] <addshore> patch got merged, code started being run, replication broke, lag wasn't bad enough to cause a db lock for a while though, we keep seeing the same issue, and eventually the db locks up
65[12:58:52] <RoanKattouw> If replication breaks completely, it should only take 5 minutes for the lag to shut everything down
66[12:59:03] <RoanKattouw> At most; I'm not sure if MW detects this broken state specially and goes r/o immediately
67[12:59:28] <RoanKattouw> But obviously if replication is broken but lag measurements still work, the lag will increase at a rate of 1 second per second, and the 300s lag threshold will be hit pretty quickly
68[13:00:21] <RoanKattouw> OK so I've looked at the text table on deployment-db04 which interestingly has IDs going all the way up to 20982, and has the auto_increment value correctly set to 20983
69[13:00:24] <RoanKattouw> So it doesn't look broken
70[13:00:28] <RoanKattouw> Now going to look at that table on the master
71[13:01:18] <RoanKattouw> Whoa, wtf
72[13:01:24] <RoanKattouw> The text table on the master has FEWER entries
73[13:01:33] <addshore> interesting....... urm :P
74[13:01:42] <RoanKattouw> The highest old_id on the master is 20972
75[13:02:09] <addshore> 10 entries less
76[13:02:31] <addshore> and 10 edits have been made on deployment wiki since the code landed
77[13:02:38] <RoanKattouw> OK something weird happened
78[13:02:48] <RoanKattouw> The master and slave disagree on some rows:
79[13:03:07] <RoanKattouw> Master:
80[13:03:08] <RoanKattouw> https://www.irccloud.com/pastebin/mJeq37Zx/
81[13:03:18] <RoanKattouw> Slave:
82[13:03:22] <RoanKattouw> https://www.irccloud.com/pastebin/sRtUJazL/
83[13:03:33] <DanielK_WMDE> RoanKattouw: it'S possible that my patch broke that. I tried to keep that logic untouched while moving it to a different class, but it's possible I broke that.
84[13:03:49] <DanielK_WMDE> But it's also very possible that it's completely unrelated. I sure hope it is :)
85[13:03:58] <RoanKattouw> Any idea whether the difference between nativeDataArray and utf-8 is related to your patch?
86[13:04:18] <addshore> we defintly touched that area, so most likely
87[13:04:21] <RoanKattouw> Also note that there is perfect non-overlap here
88[13:04:32] <addshore> right, revert time it is
89[13:04:45] <RoanKattouw> The master and replica agree on everything up to oldid 20970 which has text address 8211
90[13:05:02] <Zppix> What is oldid 20970?
91[13:05:08] <DanielK_WMDE> RoanKattouw: nativeDataArray? What is that? Never heard of it. I moved the code that handles these flags. If the logic changed in the meantime, it's possible i accidentally reverted that, in an inconsistent way
92[13:05:12] <RoanKattouw> Then the replica picked up 8212-8217, 8219-8223 and 8225
93[13:05:16] <Zppix> Perhaps we could get a timeline?
94[13:05:22] <RoanKattouw> And the master picked up 8218 and 8224
95[13:05:30] <DanielK_WMDE> RoanKattouw: what's the timestamp of that revision?
96[13:05:37] <RoanKattouw> Which one, 20970?
97[13:05:41] <RoanKattouw> The last one they agree on?
98[13:05:43] <DanielK_WMDE> The patch was merged yesterday... I think around this time of day
99[13:06:22] <DanielK_WMDE> RoanKattouw: merged yesterday at 16:10 CET.
100[13:06:43] <RoanKattouw> There is no row in the revision table with either rev_id=20970 or rev_text_id=20970
101[13:07:41] <addshore> Merged @ Dec 18 15:10 PM UTC
102[13:07:52] <RoanKattouw> Wait, on the MASTER there isn't, on the replica there is
103[13:08:00] <DanielK_WMDE> ...what?!
104[13:08:02] <RoanKattouw> The DBs also disagree on what's in the revision table
105[13:08:22] <RoanKattouw> The 15 most recent revision table entries on the master:
106[13:08:24] <addshore> 15:18 it is on the beta clster and scap has run
107[13:08:24] <RoanKattouw> https://www.irccloud.com/pastebin/lnRPQI4Z/
108[13:08:25] <DanielK_WMDE> that sounds fubar...
109[13:08:28] <RoanKattouw> And on the replica:
110[13:08:37] <RoanKattouw> https://www.irccloud.com/pastebin/r7AxQfDT/
111[13:08:57] <addshore> 15:19 I have already spotted the issue https://usercontent.irccloud-cdn.com/file/Mc9x8ROA/image.png
112[13:09:08] <addshore> that will be the first edit made with the new code
113[13:09:44] <RoanKattouw> Oh wait that's not a disagreement, my bad
114[13:09:49] <addshore> RoanKattouw: what is up with rev 7934 where does it appear & look like?
115[13:10:05] <DanielK_WMDE> RoanKattouw: nativeDataArray seems to be an AbuseFilter thing. Poor man's content model from many years ago, it seems
116[13:10:13] <RoanKattouw> There are strange gaps in the rev_id sequence, but the only real difference is the existence of rev_id 7944 on the master which seems reasonable
117[13:10:15] <RoanKattouw> Aah OK
118[13:10:16] <DanielK_WMDE> I didn't touch that. should still work as before.
119[13:10:26] <RoanKattouw> Right and the query that broke replication had an AbuseFilter method in its comment
120[13:10:36] <DanielK_WMDE> \O/
121[13:10:42] <RoanKattouw> | 7934 | 20972 | 20171218152013 |
122[13:10:50] <addshore> Right
123[13:10:52] <RoanKattouw> The two DBs agree on that
124[13:11:00] <DanielK_WMDE> Buuut.... that actually may indicate an issue with my patch. Maybe AbuseFilter is relying on internals that i now changed
125[13:11:06] <DanielK_WMDE> maybe I broke compat with AF
126[13:11:07] <RoanKattouw> That is actually exactly the same rev_text_id as in the replication error!
127[13:11:07] <addshore> DanielK_WMDE: indeed
128[13:11:22] <RoanKattouw> Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump */ INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
129[13:11:30] <addshore> RoanKattouw: right, and I believe that is the first time i spotted the issue / the first edit i made after the code landed
130[13:11:48] <addshore> Welll...
131[13:11:48] <addshore> 15:19, 18 December 2017 Addshore (talk | contribs | block) deleted page TXTTN (view/restore)
132[13:12:01] <addshore> (diff | hist) . . N TXTTN‎; 15:20 . . (+7)‎ . . ‎Addshore (talk | contribs | block)‎ (Created page with "slakfa;")
133[13:12:01] <DanielK_WMDE> the timestamp matches
134[13:12:26] <RoanKattouw> addshore: Oh you deleted stuff, that could explain the rev_id gap
135[13:12:26] <DanielK_WMDE> addshore: but why is there an edit by abusefilter?... you didn't edit a filter, did you?
136[13:12:27] <addshore> and that rev i created at 15:20 is https://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=TXTTN&oldid=7934
137[13:12:53] * DanielK_WMDE feels an abuse filter rewrite coming on
138[13:13:04] <addshore> RoanKattouw: yes, although in the UI i cant see what i deleted, or what timestamp or revid as the db is locked https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Undelete/TXTTN
139[13:13:22] <RoanKattouw> I have it
140[13:13:30] <RoanKattouw> https://www.irccloud.com/pastebin/Lh4gkYeA/
141[13:13:36] <RoanKattouw> It helps that it was the most recently deleted thing
142[13:14:29] <DanielK_WMDE> so, what was broken? AF or deletion?
143[13:14:31] <addshore> so i created a page, and it worked, iu deleted it, it wokred, i recreated it and started seeing the issue
144[13:14:45] <addshore> and that was due to abusefilter exploding during the first creation or deletion?
145[13:14:47] <DanielK_WMDE> how can *any* of this break replication, btw?
146[13:15:16] <DanielK_WMDE> addshore: why would abuse filter write revisions?
147[13:15:22] <DanielK_WMDE> unless youedit a filter?
148[13:15:24] <addshore> DanielK_WMDE: no idea
149[13:15:31] <addshore> So, "nativeDataArray" is ONLY used in AbuseFilter
150[13:15:32] <DanielK_WMDE> maybe someone else was editing a filter?
151[13:15:49] <DanielK_WMDE> (and maybe this is a red herring)?
152[13:16:00] <addshore> https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/blob/master/includes/AbuseFilter.class.php#L1286
153[13:16:45] <addshore> *scrolls up to look at the flag mess we already discovered*
154[13:17:37] <addshore> 20970 has no nativeDataArray flag on either slave or master, 20971 and 20972 have it on master but not on the slave
155[13:17:43] <DanielK_WMDE> addshore: can you find the user & commit message associated with the nativeArrayData entries?
156[13:20:12] <RoanKattouw> Yeah old_id 20972 points to completely different things on the master and the replica
157[13:20:20] <RoanKattouw> Different flags but also different external store addresses
158[13:20:29] <DanielK_WMDE> how the hell does that happen?
159[13:20:34] <RoanKattouw> No idea
160[13:20:46] <addshore> its almost like some code is somehow manging to write to the slave...
161[13:20:47] <RoanKattouw> I'm digging though the binlogs right now trying to find out what other queries were run on the tex ttable
162[13:20:53] <RoanKattouw> So far I've only found the one that killed replication
163[13:21:00] <RoanKattouw> Hmm yeah maybe
164[13:21:12] <addshore> it looks like abusefilter stuff manage to write to master only
165[13:21:19] <RoanKattouw> I thought that maybe the master and replica had switched roles, but maybe the replica is writeable when it shouldn't be? That would break replication pretty quickly too
166[13:21:22] <DanielK_WMDE> addshore: oooohhhh.... what would happen if you tried that? it should fail HARD! But does it?
167[13:21:44] <addshore> DanielK_WMDE: I would have thought it would fail really hard, but perhaps not?>
168[13:21:46] <DanielK_WMDE> RoanKattouw: maybe aaron switched them yesterday when trying to fix things?
169[13:21:56] <Zppix> Have you guys tried writing to the replicas directly?
170[13:22:01] <RoanKattouw> That's possible
171[13:22:09] <Zppix> It should fail no?
172[13:22:10] <RoanKattouw> I'll check if the replica is read-only
173[13:22:39] <RoanKattouw> Welp, the replica is not read-only
174[13:23:11] <Zppix> Hmmm
175[13:23:16] <Zppix> That cant be good
176[13:23:41] <addshore> so, what is the first insert / update / delete we see on the replica that is not on the master?
177[13:23:44] <addshore> (if there is one) ?
178[13:24:12] <RoanKattouw> I'm trying to figure that out now
179[13:25:09] <RoanKattouw> The master's binlog file is 489MB
180[13:25:12] <addshore> I'll filter out messages out of this IRC channel and dump the chat log somewhere nice to look at once we have finished here, I think this is all rather valuable info we don't want to have to go hunting for agian
181[13:25:24] <RoanKattouw> So analyzing it is a bit slow
182[13:26:24] <DanielK_WMDE> I'll look into making unit tests use a r/o wrapper for the DB REPLICA connection. That may not be easy, though
18313:29:08] <RoanKattouw> OK so I'm going through the query history on both servers to try to prove that they received contradictory insertions
184[13:29:13] <RoanKattouw> But it's slow going
185[13:29:21] <addshore> I'm looking through the usages of DB SLAVE / MASTER in abusefilter and out new revstore stuff now
186[13:30:10] <addshore> $dbw = $this->getDBConnection( DB_REPLICA );
187[13:30:11] <RoanKattouw> There is not much in the beta SAL so I can't tell if db04 was legitimately the master at some point
188[13:30:12] <addshore> DanielK_WMDE: RoanKattouw ^^
189[13:30:14] <RoanKattouw> LOL
190[13:30:16] <RoanKattouw> Wel...
191[13:30:22] <RoanKattouw> That would do it
192[13:30:22] <addshore> Think I might have found it....
193[13:30:34] <RoanKattouw> Do we not have a mechanism in MW to prevent writes to DB_REPLICA connections?!
194[13:30:41] <addshore> RoanKattouw: apparently not
195[13:30:45] <addshore> it just does ->insert
196[13:31:13] <addshore> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231
197[13:31:15] <DanielK_WMDE> RoanKattouw: looking into fixing that at least for unit tests
198[13:31:22] <DanielK_WMDE> ideally for everything.
199[13:31:50] <addshore> there should be a NoWrite implementation of the db interface
200[13:31:56] <RoanKattouw> Well that would do it
201[13:31:58] <addshore> that refuses to accept calls to any writ methods
202[13:32:06] <RoanKattouw> That's exactly the place where we would expect this, too
203[13:32:11] <addshore> or infact... 2 separate interfaces..
204[13:32:14] <RoanKattouw> OK I'm going to file a couple tasks here
205[13:32:20] <addshore> RoanKattouw: agreed
206[13:32:25] <RoanKattouw> 1) We need all replica DBs to have writes disabled, in labs as well as production
207[13:32:31] <RoanKattouw> 2)
208[13:32:45] <RoanKattouw> 2) Trying to write to a DB_REPLICA handle needs to throw an exception
209[13:32:46] <Zppix> If we disable writes how could masters write to replicas?
210[13:32:51] <apergos> O.o you can write to replicas??
211[13:32:58] <RoanKattouw> apergos: Yuuup
212[13:32:59] <addshore> apergos: on the beta cluster yes
213[13:33:02] <apergos> owie
214[13:33:05] <RoanKattouw> Zppix: Replication is exempt from the no-writes flag
215[13:33:06] <addshore> and within mediawiki yes,
216[13:33:11] <Zppix> Ah ok
217[13:33:20] <RoanKattouw> 3) deployment-db04 is hosed and needs to become un-hosed
218[13:33:29] <Zppix> (I am not to smart when it comes to db)
219[13:33:55] <RoanKattouw> (I wondered the same thing actually, so I looked up the documentation and it said that replication and admins were exempt)
220[13:34:33] <RoanKattouw> I'm going to begin with depooling deployment-db04 so as to unbreak the beta wikis
221[13:34:37] <apergos> what the heck is getting a db replica handle and trying to write to it anyways?
222[13:34:43] <RoanKattouw> Then we probably need to reprovision it or something
223[13:34:48] <RoanKattouw> apergos: The new MCR code :)
224[13:34:54] <RoanKattouw> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231
225[13:34:56] <apergos> oh dear...
226[13:35:03] <RoanKattouw> Yeah...
227[13:35:13] <RoanKattouw> So the text table's IDs diverged on the master and replica DBs
228[13:35:21] <apergos> I know it's not really funny but I am lmao here
229[13:35:24] <addshore> there could perhaps be a very simply phpcs sniff for that too
230[13:35:26] <RoanKattouw> Then when the master tried to replicate its writes, replication died
231[13:35:30] <apergos> of course
232[13:35:33] <DanielK_WMDE> apergos: my bad. whoops. we need more eyes on this code...
233[13:35:52] <apergos> well the good thing is it gets discovered in labs
234[13:35:54] <RoanKattouw> apergos: If you're in #wikimedia-dev already, see my pastes in the backscroll there, you'll see the two servers disagreeing on the contents of the text and revision tables
235[13:36:03] <addshore> apergos: RoanKattouw the great news is this issue is much simpler to fix than some of the other possible reasons for breaking that we were thinking about.
236[13:36:07] <RoanKattouw> We also had revision IDs with different meanings on the two servers
237[13:36:11] <RoanKattouw> Yes
238[13:36:12] <apergos> eewwww
239[13:36:20] <RoanKattouw> And it's not a chronology protection issue
240[13:36:20] <DanielK_WMDE> RoanKattouw, addshore: i'm introducing allowWrite into the datbase base class. I'm also introducing a wrapper so writing to a replica connection will fail even if there is only one db and one connection.
241[13:36:22] <apergos> yes I suppose it would all go pear-shaped pretty soon after
242[13:36:33] <addshore> DanielK_WMDE: sounds good
243[13:36:34] <apergos> doing the backread in the other channel
244[13:36:49] <RoanKattouw> DanielK_WMDE: Yes that's good, and necessary to catch these things in dev. I'll file that task first so you can claim it
245[13:36:53] <addshore> DanielK_WMDE: shall I fix the line or you?
246[13:37:15] <DanielK_WMDE> RoanKattouw, addshore: thank you for helping with this!
247[13:37:18] <DanielK_WMDE> addshore: i'll fix it.
248[13:37:52] <Zppix> Im glad i mentioned directly writing to replicas :P
249[13:38:15] <DanielK_WMDE> Zppix: thank you, indeed :)+
250[13:38:39] <DanielK_WMDE> Zppix: I would have expected that to just fail HARD, not silently corrupt the database
251[13:38:51] <Zppix> I wonder if this possibly in prod as well
252[13:38:55] <Zppix> Possible*
253[13:39:06] <DanielK_WMDE> RoanKattouw: will you also file a ticket to make sure replicates are not writable in production? This kind of thing can SERIOUSLY mess up the database :(
254[13:39:15] <DanielK_WMDE> Zppix: yea, scary
255[13:39:33] <RoanKattouw> Yes
256[13:39:38] <RoanKattouw> See my 1-2-3 above
257[13:41:53] <apergos> that was a fun little read
258[13:42:08] <apergos> not easy to track down either
259[13:42:29] <DanielK_WMDE> yep... several wrong leads

1[16:35] <RoanKattouw> addshore: DanielK_WMDE_: (Moving here because both -tech and -dev are noisy) During my walk I figured out WHY the "revision does not exist" message happened. It was related to ChronologyProtector, but not in the way we thought: it happened BECAUSE it was doing its job
2[16:36] <addshore> :D
3[16:36] <RoanKattouw> For newly created pages/revisions, the text table rows were written to the replica (and so weren't on the master), but the revision table rows were written to the master (and replicated to the replica)
4[16:36] <RoanKattouw> When you first save the page, the replica hasn't caught up yet, so CP ensures that your next page view reads from the master (which never happens otherwise), because it's the only up-to-date server
5[16:37] <RoanKattouw> The master doesn't have the text row, so trying to get the text fails
6[16:37] <RoanKattouw> Then when you refresh, the replica has caught up, so your page view reads from the replica, and it has both rows so it works fine
7[16:37] <addshore> aaah, and then the second refresh reads from the replica
8[16:37] <addshore> Sounds like a thoughtful walk :)
9[16:37] <RoanKattouw> The real problems began when an AbuseFilter rule was hit, because AF was still writing to the master
10[16:38] <RoanKattouw> So the master assigns that text row an old_id which it thinks is the next available old_id, but the replica has already used that ID for something else
11[16:38] <addshore> and then that is the point the replication exploded
12[16:38] <RoanKattouw> Then when the replica tries to replicate that insertion, it fails because of an ID collision, and replication stops
13[16:39] <RoanKattouw> Leaving both the replica and the master in a broken state: the master has revision rows pointing to old_ids that don't exist, or if they do, point to AbuseFilter data
14[16:39] <addshore> RoanKattouw: so wikidatawiki on beta is also broken
15[16:40] <RoanKattouw> And the replica has one AbuseFilter log entry that points to an old_id that points to revision text (and no more, becaues replication stops at this point)
16[16:40] <RoanKattouw> Yeah I can imagine
17[16:40] <RoanKattouw> I'm just about to check the others
18[16:40] <addshore> https://phabricator.wikimedia.org/T183232
19[16:40] <addshore> well that ticket actually talks about enwiki
20[16:40] <RoanKattouw> I fixed enwiki and deploymentwiki by transferring the text rows from the replica to the master and updating the references to them in the revision table for their new IDs
21[16:40] <addshore> you can probably write a query to find all wikis that have edits between now and the time the patch was first merged / landed on beta
22[16:41] <RoanKattouw> I could do that but it's easier to just compare the text tables on the master and replica
23[16:41] <RoanKattouw> If there are rows they disagree on, that means I need to fix things
24[16:42] <RoanKattouw> Going to start doing that with wikidatawiki now
25[16:42] <addshore> well, sorry for this fallout, and thanks for helping!
26[16:43] <RoanKattouw> No worries!
27[16:43] <RoanKattouw> 4 layers failed here
28[16:43] <RoanKattouw> (Author, reviewer, MW DB abstraction, read only flag on the DB server)
29[16:43] <RoanKattouw> So I can hardly blame any individual one

Event Timeline

Ugh, I just realized that there are more DBs on these servers other than deploymentwiki, and that some of the others also have divergences and are likely to have a master DB that's not in a consistent state.

Right now my plan is:

  1. Manually correct/delete rows to put the master in a consistent state, for each of these databases (only a handful of rows are affected on the two databases I looked at, so hopefully this will be quick)
  2. Depool the replica; this should bring the beta wikis out of read-only mode
  3. Get someone else to fix or reprovision the replica, because I don't know how to do that

Change 399182 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] Depool deployment-db04

https://gerrit.wikimedia.org/r/399182

Change 399182 merged by jenkins-bot:
[operations/mediawiki-config@master] Depool deployment-db04

https://gerrit.wikimedia.org/r/399182

Change 399188 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] [MCR] Fix SqlBlobStore using DB_REPLICA for writes

https://gerrit.wikimedia.org/r/399188

Change 399188 merged by jenkins-bot:
[mediawiki/core@master] [MCR] Fix SqlBlobStore using DB_REPLICA for writes

https://gerrit.wikimedia.org/r/399188

I depooled db04 but keps the labs wikis in read-only mode. Then I manually fixed the corruption on each wiki where I found some, by copying text table rows from db04 to db03 and updating rev_text_id to account for ID drift where necessary. Now that this is done, db03 is in a consistent state and I'm going to disable read-only mode. However, we have to keep db04 depooled because replication is still stopped and I don't know how to restart it given that some rows in the text table differ between the two servers.

Change 399207 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] Remove temporary read-only setting for beta labs

https://gerrit.wikimedia.org/r/399207

Change 399207 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove temporary read-only setting for beta labs

https://gerrit.wikimedia.org/r/399207

greg renamed this task from Unbreak replication in beta labs to Unbreak replication in beta cluster.Dec 19 2017, 8:06 PM
greg updated the task description. (Show Details)

(Sorry, I was using the DBA project as a request for help, thanks for subscribing!)

From -releng:

20:16 < RoanKatto> greg-g: So https://phabricator.wikimedia.org/T183252#3848591 is basically my "I've 
                   done what I can and I need a professional to take over now" comment
20:17 < RoanKatto> I got it to a working state with only a master, and the replica is depooled, but it's 
                   hosed and I don't know how to fix it
20:19 < RoanKatto> I don't know enough about DBs to know how to get the replica into the same state as 
                   the master and then restart replication, and I don't know enough about 
                   cloudlabstoolbetaforge to know how to provision a fresh replica DB and throw away the 
                   old one

Sorry - I was playing with the subscribers and removed the DBA tag by mistake.

I don't think I will be able to help with this, as I am focused on fixing data drifts on core, which caused a production breakage a few days ago and we want to make sure it doesn't happen again, specially during holidays.
We, DBAs, are a bit overwhelmed with tasks before the holidays break.

jcrespo triaged this task as Lowest priority.Dec 20 2017, 6:32 AM

No one is probably going to work on this any time soon, despite we spent a non-trivial amount of time training releng team on how to backup and restore mysql.

@daniel I would mark this as blocking one of the MCR related tickets but i'm not really sure which one or if to bother.

Change 399449 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] Revert "Depool deployment-db04"

https://gerrit.wikimedia.org/r/399449

Catrope claimed this task.

I exported the data on db03, imported it on db04, restarted replication, and am repooling it now.

Change 399449 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Depool deployment-db04"

https://gerrit.wikimedia.org/r/399449

@Addshore "blocking one of the MCR related tickets"

If things like this are highly blocking for MCR, please consider hiring a person with strong backend background, as this will be the least difficult thing that MCR will face in the future, with way more production breakages and performance issues. As we transmitted on our meeting with Daniel, we are too busy keeping regular wikipedias up (trying) to provide more resources to the project that aside from the regular maintenance/schema changes we provide to all others.

I exported the data on db03, imported it on db04, restarted replication, and am repooling it now.

Thanks @Catrope !

@Addshore "blocking one of the MCR related tickets"

If things like this are highly blocking for MCR, please consider hiring a person with strong backend background

If it was up to me we would already have a larger DBA team ;)

@Addshore "blocking one of the MCR related tickets"

If things like this are highly blocking for MCR, please consider hiring a person with strong backend background, as this will be the least difficult thing that MCR will face in the future, with way more production breakages and performance issues. As we transmitted on our meeting with Daniel, we are too busy keeping regular wikipedias up (trying) to provide more resources to the project that aside from the regular maintenance/schema changes we provide to all others.

I think the background here is not that this broke MCR, but that bad MCR code caused this breakage. MCR was actually working fine otherwise, or as well as it could have worked given how badly broken the DB state was.

The issue developed as follows: (this is basically a summary of the IRC logs in the task description)

  • A big MCR patch was merged. Buried in this patch was code that looked like $dbw = $this->getDBConnection( DB_REPLICA ); $dbw->insert( 'text', [ ... ] );
  • To our surprise, MediaWiki's database abstraction layer contained no logic preventing a replica DB connection from attempting write queries. I filed T183242: DB handles obtained with DB_REPLICA should not allow writes for this, and while Daniel is still working on a full solution, a basic patch preventing write queries on replica connections is already merged.
  • To my further surprise, the replica DB server in labs (deployment-db04) was not set to read-only, so the writes MediaWiki attempted actually went through.
  • As you can imagine, things went downhill from here pretty quickly. Replication broke and both of the DBs ended up with incomplete/inconsistent data:
    • The MCR code used a replica connection to insert rows into the text table. These rows were inserted on the replica but not on the master.
    • Rows in the revision table referring to these text rows (rev_text_id is a FK to old_id) did correctly get inserted on the master, and were replicated.
    • The first page view of a newly created page (by the creator) uses the master for reads because of ChronologyProtector, so these post-creation page views failed with an error: what MW saw when querying the master was a revision row pointing to a missing text row. Subsequent page views used the replica, which did have the right text row. This bug was spotted soon after the MCR code was deployed on beta, and initially it was suspected that ChronologyProtector somehow wasn't doing its job and allowed a lagged replica to be read from.
    • Some other code (in AbuseFilter) inserted rows into the text table using a master connection. However, the replica's auto_increment counter had gotten ahead of the master's at this point, so the master chose an old_id that's already in use on the replica. When the replica tried to replicate this insertion, it ran into a "duplicate value for primary key" error. Replication stopped.
    • With replication broken on the sole replica, MediaWiki went into read-only mode (either immediately or because the maximum replication lag is exceeded soon after, not sure). People investigating the issue got more confused, and reverting the MCR patch didn't (appear to) help, because the DB was broken and needed manual repair.
  • I figured out that there were rows on the replica that weren't on the master, but wasn't sure how this might have happened. @Addshore correctly guessed that code might have tried to write to the replica, found the offending code, and fixed it.
  • Meanwhile Zppix asked what would happen if a write was sent to the replica; I checked the read-only status on the replica and found that it wasn't read-only.
  • I depooled the replica. This would have brought MW out of read-only mode, so I manually enabled read-only mode to make my job recovering the broken state easier.
  • At this point, the master had a revision table that was pointing to text rows that only existed on the replica, but other text rows with the same IDs but different contents had been inserted on the master instead. I copied the missing text rows from the replica to the master. They were inserted with different IDs, so I also adjusted the rev_text_id pointers for that. After doing this on all affected DBs (about half a dozen), I had put the master in a consistent state.
  • With the master fixed, I disabled read-only mode, but kept the replica depooled.
  • The next day, I used mysqldump to snapshot the master, and imported this snapshot on the replica. I then restarted replication and repooled the replica.

Several things failed here, and we filed tasks for everything we could think of:

  • Both the author and the reviewer failed to notice the mistake (no task filed, because telling humans to make fewer mistakes doesn't work)
  • No errors were thrown when testing this code locally and in unit tests. On most local installs, and all unit tests runs, there is only one DB server, so getDBConnection( DB_REPLICA ) just returns the master connection. Daniel is working on a patch to make these connections (requested with DB_REPLICA but pointing to the master) refuse writes. This is T183242: DB handles obtained with DB_REPLICA should not allow writes.
  • When the code ran in beta, MW's database abstraction layer should have refused to send a write query to a server it knew to be a replica, but didn't. This is now fixed. The task for this is T183265: Write operations should fail on connections to database replicas., not sure why it's still open but there's an unmerged patch for it, so I'll go and review it.
  • The replica server in beta was somehow not set to read-only. I set it to read-only, and filed T183245: Ensure replica DB in labs is read-only for making sure that this doesn't happen in the future. (I also verified that all replicas in production are read-only.)

@Catrope thank you for the detailed post mortem!

I think the background here is not that this broke MCR

To clarify, I on my comment T183252#3853489 I didn't say or mean that MCR was broken or that MCR broke something, nor it was the comment about the breakage (of which I do not know much). Catrope postmortem should be added to T183232, as that seems to be the real isue, not here.

My comment was specifically about the unbreakage, that more people should know how to clone a server and "suffer" it, and I commented above we have trained other people about it, and we are glad to train even more people, too. I only mentioned MCR because it seems to be a highly impacting change and definitely could use some extra backend (practical) knowledge. All people could benefit from being able to handle an environment such as beta, where there is not so many restrictions regarding permissions, or their own docker containers with full installations. That doesn't mean DBAs should not exist or we won't help, but being "too vertical" worries me, specially given the small resources some (most?) people have, which leads to blockers.

My comment was specifically about the unbreakage, that more people should know how to clone a server and "suffer" it, and I commented above we have trained other people about it, and we are glad to train even more people, too.

Are there any written docs on wikitech for the process that you are aware of?
I couldn't find any docs for the beta cluster specifically, but perhaps there are docs for prod?

BTW, let's move discussion to a separate ticket, this is technically closed.