Any attempt to perform a write operation on a database connection that was obtained for the DB_REPLICA index should fail hard. Two different cases need to be covered:
- Connections to actual database replicas must not allow write operations (T183265)
- When selecting the connection to the master database when asked for DB_REPLICA, that connection needs to be wrapped to disallow write operations. This is particularly relevant for catching bugs early, in unit tests and development environments which typically have a single database setup.
Background:
This code introduced by a recent MCR change obtains a DB handle using DB_REPLICA, then writes to it:
$dbw = $this->getDBConnection( DB_REPLICA ); $old_id = $dbw->nextSequenceValue( 'text_old_id_seq' ); $dbw->insert( 'text', [ 'old_id' => $old_id, 'old_text' => $data, 'old_flags' => $flags, ], __METHOD__ ); $textId = $dbw->insertId();
Shockingly, nothing in MediaWiki prevented a write on a replica connection. Because the replica DB server in beta labs was not set to read-only (see T183245: Ensure replica DB in labs is read-only), this caused an actual write to happen to the replica server, which of course broke replication because the auto increment counters on the text table drifted out of sync.
A DB connection handle obtained with DB_REPLICA should not allow writes. Any attempt to call ->insert(), ->delete(), ->replace(), etc. should throw an exception, even if the underlying connection is to the master (development environments typically only have one DB server, but we need to catch these errors early; thanks to @daniel for realizing this).
Chat log:
| 1 | RoanKattouw: Aaron tried to help with investigating the chronology issue yesterday, but didn't get anywhere, it seems. That stuff is a bit esoterik to me... do you have an idea how to investigate? Basically, new revisions fail to load seconds after an edit. it *looks* like the page table is out iof sync with the revision table. But nobody knows how that can be. |
|---|---|
| 2 | [12:36:40] <DanielK_WMDE> ah, addshore merged the revert |
| 3 | [12:36:50] <DanielK_WMDE> I'll file a revert of the revert in a minute |
| 4 | [12:36:54] <addshore> not yet, just fixing the commit message |
| 5 | [12:37:00] <addshore> but gerrit has died for me |
| 6 | [12:37:17] <DanielK_WMDE> wffm |
| 7 | [12:37:18] <addshore> RoanKattouw: yes, the christmas time for merging and getting it on beta was the plan :) |
| 8 | [12:37:41] <DanielK_WMDE> addshore: we failed to consider that christmas is a bad time to get people to investigate arcane issues, though :) |
| 9 | [12:37:47] <DanielK_WMDE> aaron is gone for the holidays, i guess |
| 10 | [12:37:49] <addshore> indeed |
| 11 | [12:38:08] <DanielK_WMDE> i was expecting issues, but i was hoping they would be a bit more clear cut... |
| 12 | [12:39:01] <RoanKattouw> haha yeah that's a problem |
| 13 | |
| 14 | [12:39:53] <RoanKattouw> I guess you could try to figure out what the state of the DB is that the broken requests see (revision table entry referring to a nonexistent page table entry? or the reverse?) and try to figure out if the order of insertions might possibly be wrong |
| 15 | [12:40:21] <addshore> DanielK_WMDE: Just finished fixing the commit message issues, anything else we need to add to that revert? |
| 16 | [12:40:34] <DanielK_WMDE> RoanKattouw: no, thje page table references a revision that then fails to load (it may or may not be in the db, we don't know) |
| 17 | [12:40:54] <DanielK_WMDE> addshore: i don't think so |
| 18 | [12:41:15] <RoanKattouw> Hmm that's weird |
| 19 | [12:41:31] <RoanKattouw> One thing I wonder is how the page table row was able to be written then |
| 20 | [12:41:41] <RoanKattouw> Because you don't know the revision ID until after you've written the revision row |
| 21 | [12:41:53] <addshore> hmm, beta db is currently locked? |
| 22 | [12:42:08] <RoanKattouw> Have you isolated a broken case? Do broken cases stay broken for a long-ish period of time, or only briefly? |
| 23 | [12:42:21] <addshore> RoanKattouw: I was about to say you can try it on beta, but the db is locked |
| 24 | [12:42:34] <DanielK_WMDE> RoanKattouw: it does get written. the revision is found once replicas catch up. it just fails to load right after the edit. |
| 25 | [12:43:23] <DanielK_WMDE> RoanKattouw: while investigating, aaron found that replication weas broken on beta. we firt thought that that was all there was to it. but fixing replication didn't fix the issue. it's all a bit... obscure |
| 26 | [12:44:29] <DanielK_WMDE> RoanKattouw: to replicate, just edit. you get redirected to the "plain" url, and get an error (revision not found). you reload, and it works fine. |
| 27 | [12:45:02] <DanielK_WMDE> RoanKattouw, addshore: the relevant code in Article pre-fills the page with that error message. so that's the error you'll see if *anything* goes wrong. which means we don't actually know what goes wrong. |
| 28 | [12:46:19] <addshore> DanielK_WMDE: I dont think the beta replication issue is fixed |
| 29 | [12:46:19] <addshore> "The database has been automatically locked while the replica database servers catch up to the master." |
| 30 | [12:46:50] <RoanKattouw> "You reload, and it works fine" --> so it's not a long-lasting issue, it's brief |
| 31 | [12:46:55] <RoanKattouw> That makes investigating it harder |
| 32 | [12:47:34] <DanielK_WMDE> RoanKattouw: indeed. and MAYBE it's actually an issue with replication on beta, not with the code, as addshore just said... |
| 33 | [12:47:39] <RoanKattouw> addshore: Hmm if replication is *completely* broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow) |
| 34 | 12:47:52] <RoanKattouw> Well, let's see if the revert fixes it |
| 35 | [12:47:57] <addshore> https://deployment.wikimedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= is the same as when arron looked yesterday |
| 36 | [12:48:03] <RoanKattouw> If it does, then your code is doing at least something wrong |
| 37 | [12:48:06] <addshore> wtf is up with beta |
| 38 | [12:48:15] <DanielK_WMDE> RoanKattouw: that indeed is a meaningful experiment :) |
| 39 | [12:48:51] <RoanKattouw> addshore: Hmm at that URL I see 0 lag right now |
| 40 | [12:49:07] <addshore> RoanKattouw: try creating a page though, it says the db is locked due to lag |
| 41 | [12:49:38] <RoanKattouw> I see |
| 42 | [12:49:57] <RoanKattouw> Very strange |
| 43 | [12:50:30] <RoanKattouw> The DB does appear to be read-only |
| 44 | [12:50:35] <RoanKattouw> Marking notifications as read doesn't stick |
| 45 | [12:51:06] <Zppix> Would jenkins doing a scap cause that? |
| 46 | [12:51:15] <addshore> Zppix: the locking? no |
| 47 | [12:51:22] <DanielK_WMDE> RoanKattouw: maybe it's r/o for a reason that has nothing to do with replag? |
| 48 | [12:51:23] <RoanKattouw> No, something weird is going on |
| 49 | [12:51:36] <RoanKattouw> I checked $wgReadOnly and it's false, so that isn't it |
| 50 | [12:51:43] <addshore> DanielK_WMDE: when editing though the reason it shows says it is due to replag |
| 51 | [12:51:52] <RoanKattouw> wfReadOnly() returns true, so let's see why |
| 52 | [12:51:55] <DanielK_WMDE> addshore: wtf? |
| 53 | [12:52:07] <addshore> https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/lastFailedBuild/console |
| 54 | [12:52:26] <addshore> Wikimedia\\Rdbms\\DBReplicationWaitError from line 369 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03 |
| 55 | [12:53:02] <addshore> first failure is Build #22122 (18-Dec-2017 19:20:00) |
| 56 | [12:55:47] <RoanKattouw> OK I can confirm that replication is broken on deployment-db04 |
| 57 | [12:55:58] <addshore> RoanKattouw: any idea how we can fix that? |
| 58 | [12:56:08] <RoanKattouw> https://www.irccloud.com/pastebin/tREOTlIr/ |
| 59 | [12:56:53] <addshore> ooooooh, well, it has things to do with externalstore and revisions / old text, could be related too the patch |
| 60 | [12:56:58] <RoanKattouw> addshore: I know like 1/3 of the theory and I haven't done it before |
| 61 | [12:57:26] <RoanKattouw> Yeah the error that broke replication is a primary key collision in the text table, I hadn't realized that that could be related to MCR but you're right |
| 62 | [12:57:42] <RoanKattouw> The query itself specifies NULL as the old_id so that should have been fine |
| 63 | [12:57:55] <addshore> That actually could be a very big hint as to what is wrong |
| 64 | [12:58:27] <addshore> patch got merged, code started being run, replication broke, lag wasn't bad enough to cause a db lock for a while though, we keep seeing the same issue, and eventually the db locks up |
| 65 | [12:58:52] <RoanKattouw> If replication breaks completely, it should only take 5 minutes for the lag to shut everything down |
| 66 | [12:59:03] <RoanKattouw> At most; I'm not sure if MW detects this broken state specially and goes r/o immediately |
| 67 | [12:59:28] <RoanKattouw> But obviously if replication is broken but lag measurements still work, the lag will increase at a rate of 1 second per second, and the 300s lag threshold will be hit pretty quickly |
| 68 | [13:00:21] <RoanKattouw> OK so I've looked at the text table on deployment-db04 which interestingly has IDs going all the way up to 20982, and has the auto_increment value correctly set to 20983 |
| 69 | [13:00:24] <RoanKattouw> So it doesn't look broken |
| 70 | [13:00:28] <RoanKattouw> Now going to look at that table on the master |
| 71 | [13:01:18] <RoanKattouw> Whoa, wtf |
| 72 | [13:01:24] <RoanKattouw> The text table on the master has FEWER entries |
| 73 | [13:01:33] <addshore> interesting....... urm :P |
| 74 | [13:01:42] <RoanKattouw> The highest old_id on the master is 20972 |
| 75 | [13:02:09] <addshore> 10 entries less |
| 76 | [13:02:31] <addshore> and 10 edits have been made on deployment wiki since the code landed |
| 77 | [13:02:38] <RoanKattouw> OK something weird happened |
| 78 | [13:02:48] <RoanKattouw> The master and slave disagree on some rows: |
| 79 | [13:03:07] <RoanKattouw> Master: |
| 80 | [13:03:08] <RoanKattouw> https://www.irccloud.com/pastebin/mJeq37Zx/ |
| 81 | [13:03:18] <RoanKattouw> Slave: |
| 82 | [13:03:22] <RoanKattouw> https://www.irccloud.com/pastebin/sRtUJazL/ |
| 83 | [13:03:33] <DanielK_WMDE> RoanKattouw: it'S possible that my patch broke that. I tried to keep that logic untouched while moving it to a different class, but it's possible I broke that. |
| 84 | [13:03:49] <DanielK_WMDE> But it's also very possible that it's completely unrelated. I sure hope it is :) |
| 85 | [13:03:58] <RoanKattouw> Any idea whether the difference between nativeDataArray and utf-8 is related to your patch? |
| 86 | [13:04:18] <addshore> we defintly touched that area, so most likely |
| 87 | [13:04:21] <RoanKattouw> Also note that there is perfect non-overlap here |
| 88 | [13:04:32] <addshore> right, revert time it is |
| 89 | [13:04:45] <RoanKattouw> The master and replica agree on everything up to oldid 20970 which has text address 8211 |
| 90 | [13:05:02] <Zppix> What is oldid 20970? |
| 91 | [13:05:08] <DanielK_WMDE> RoanKattouw: nativeDataArray? What is that? Never heard of it. I moved the code that handles these flags. If the logic changed in the meantime, it's possible i accidentally reverted that, in an inconsistent way |
| 92 | [13:05:12] <RoanKattouw> Then the replica picked up 8212-8217, 8219-8223 and 8225 |
| 93 | [13:05:16] <Zppix> Perhaps we could get a timeline? |
| 94 | [13:05:22] <RoanKattouw> And the master picked up 8218 and 8224 |
| 95 | [13:05:30] <DanielK_WMDE> RoanKattouw: what's the timestamp of that revision? |
| 96 | [13:05:37] <RoanKattouw> Which one, 20970? |
| 97 | [13:05:41] <RoanKattouw> The last one they agree on? |
| 98 | [13:05:43] <DanielK_WMDE> The patch was merged yesterday... I think around this time of day |
| 99 | [13:06:22] <DanielK_WMDE> RoanKattouw: merged yesterday at 16:10 CET. |
| 100 | [13:06:43] <RoanKattouw> There is no row in the revision table with either rev_id=20970 or rev_text_id=20970 |
| 101 | [13:07:41] <addshore> Merged @ Dec 18 15:10 PM UTC |
| 102 | [13:07:52] <RoanKattouw> Wait, on the MASTER there isn't, on the replica there is |
| 103 | [13:08:00] <DanielK_WMDE> ...what?! |
| 104 | [13:08:02] <RoanKattouw> The DBs also disagree on what's in the revision table |
| 105 | [13:08:22] <RoanKattouw> The 15 most recent revision table entries on the master: |
| 106 | [13:08:24] <addshore> 15:18 it is on the beta clster and scap has run |
| 107 | [13:08:24] <RoanKattouw> https://www.irccloud.com/pastebin/lnRPQI4Z/ |
| 108 | [13:08:25] <DanielK_WMDE> that sounds fubar... |
| 109 | [13:08:28] <RoanKattouw> And on the replica: |
| 110 | [13:08:37] <RoanKattouw> https://www.irccloud.com/pastebin/r7AxQfDT/ |
| 111 | [13:08:57] <addshore> 15:19 I have already spotted the issue https://usercontent.irccloud-cdn.com/file/Mc9x8ROA/image.png |
| 112 | [13:09:08] <addshore> that will be the first edit made with the new code |
| 113 | [13:09:44] <RoanKattouw> Oh wait that's not a disagreement, my bad |
| 114 | [13:09:49] <addshore> RoanKattouw: what is up with rev 7934 where does it appear & look like? |
| 115 | [13:10:05] <DanielK_WMDE> RoanKattouw: nativeDataArray seems to be an AbuseFilter thing. Poor man's content model from many years ago, it seems |
| 116 | [13:10:13] <RoanKattouw> There are strange gaps in the rev_id sequence, but the only real difference is the existence of rev_id 7944 on the master which seems reasonable |
| 117 | [13:10:15] <RoanKattouw> Aah OK |
| 118 | [13:10:16] <DanielK_WMDE> I didn't touch that. should still work as before. |
| 119 | [13:10:26] <RoanKattouw> Right and the query that broke replication had an AbuseFilter method in its comment |
| 120 | [13:10:36] <DanielK_WMDE> \O/ |
| 121 | [13:10:42] <RoanKattouw> | 7934 | 20972 | 20171218152013 | |
| 122 | [13:10:50] <addshore> Right |
| 123 | [13:10:52] <RoanKattouw> The two DBs agree on that |
| 124 | [13:11:00] <DanielK_WMDE> Buuut.... that actually may indicate an issue with my patch. Maybe AbuseFilter is relying on internals that i now changed |
| 125 | [13:11:06] <DanielK_WMDE> maybe I broke compat with AF |
| 126 | [13:11:07] <RoanKattouw> That is actually exactly the same rev_text_id as in the replication error! |
| 127 | [13:11:07] <addshore> DanielK_WMDE: indeed |
| 128 | [13:11:22] <RoanKattouw> Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump */ INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')' |
| 129 | [13:11:30] <addshore> RoanKattouw: right, and I believe that is the first time i spotted the issue / the first edit i made after the code landed |
| 130 | [13:11:48] <addshore> Welll... |
| 131 | [13:11:48] <addshore> 15:19, 18 December 2017 Addshore (talk | contribs | block) deleted page TXTTN (view/restore) |
| 132 | [13:12:01] <addshore> (diff | hist) . . N TXTTN; 15:20 . . (+7) . . Addshore (talk | contribs | block) (Created page with "slakfa;") |
| 133 | [13:12:01] <DanielK_WMDE> the timestamp matches |
| 134 | [13:12:26] <RoanKattouw> addshore: Oh you deleted stuff, that could explain the rev_id gap |
| 135 | [13:12:26] <DanielK_WMDE> addshore: but why is there an edit by abusefilter?... you didn't edit a filter, did you? |
| 136 | [13:12:27] <addshore> and that rev i created at 15:20 is https://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=TXTTN&oldid=7934 |
| 137 | [13:12:53] * DanielK_WMDE feels an abuse filter rewrite coming on |
| 138 | [13:13:04] <addshore> RoanKattouw: yes, although in the UI i cant see what i deleted, or what timestamp or revid as the db is locked https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Undelete/TXTTN |
| 139 | [13:13:22] <RoanKattouw> I have it |
| 140 | [13:13:30] <RoanKattouw> https://www.irccloud.com/pastebin/Lh4gkYeA/ |
| 141 | [13:13:36] <RoanKattouw> It helps that it was the most recently deleted thing |
| 142 | [13:14:29] <DanielK_WMDE> so, what was broken? AF or deletion? |
| 143 | [13:14:31] <addshore> so i created a page, and it worked, iu deleted it, it wokred, i recreated it and started seeing the issue |
| 144 | [13:14:45] <addshore> and that was due to abusefilter exploding during the first creation or deletion? |
| 145 | [13:14:47] <DanielK_WMDE> how can *any* of this break replication, btw? |
| 146 | [13:15:16] <DanielK_WMDE> addshore: why would abuse filter write revisions? |
| 147 | [13:15:22] <DanielK_WMDE> unless youedit a filter? |
| 148 | [13:15:24] <addshore> DanielK_WMDE: no idea |
| 149 | [13:15:31] <addshore> So, "nativeDataArray" is ONLY used in AbuseFilter |
| 150 | [13:15:32] <DanielK_WMDE> maybe someone else was editing a filter? |
| 151 | [13:15:49] <DanielK_WMDE> (and maybe this is a red herring)? |
| 152 | [13:16:00] <addshore> https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/blob/master/includes/AbuseFilter.class.php#L1286 |
| 153 | [13:16:45] <addshore> *scrolls up to look at the flag mess we already discovered* |
| 154 | [13:17:37] <addshore> 20970 has no nativeDataArray flag on either slave or master, 20971 and 20972 have it on master but not on the slave |
| 155 | [13:17:43] <DanielK_WMDE> addshore: can you find the user & commit message associated with the nativeArrayData entries? |
| 156 | [13:20:12] <RoanKattouw> Yeah old_id 20972 points to completely different things on the master and the replica |
| 157 | [13:20:20] <RoanKattouw> Different flags but also different external store addresses |
| 158 | [13:20:29] <DanielK_WMDE> how the hell does that happen? |
| 159 | [13:20:34] <RoanKattouw> No idea |
| 160 | [13:20:46] <addshore> its almost like some code is somehow manging to write to the slave... |
| 161 | [13:20:47] <RoanKattouw> I'm digging though the binlogs right now trying to find out what other queries were run on the tex ttable |
| 162 | [13:20:53] <RoanKattouw> So far I've only found the one that killed replication |
| 163 | [13:21:00] <RoanKattouw> Hmm yeah maybe |
| 164 | [13:21:12] <addshore> it looks like abusefilter stuff manage to write to master only |
| 165 | [13:21:19] <RoanKattouw> I thought that maybe the master and replica had switched roles, but maybe the replica is writeable when it shouldn't be? That would break replication pretty quickly too |
| 166 | [13:21:22] <DanielK_WMDE> addshore: oooohhhh.... what would happen if you tried that? it should fail HARD! But does it? |
| 167 | [13:21:44] <addshore> DanielK_WMDE: I would have thought it would fail really hard, but perhaps not?> |
| 168 | [13:21:46] <DanielK_WMDE> RoanKattouw: maybe aaron switched them yesterday when trying to fix things? |
| 169 | [13:21:56] <Zppix> Have you guys tried writing to the replicas directly? |
| 170 | [13:22:01] <RoanKattouw> That's possible |
| 171 | [13:22:09] <Zppix> It should fail no? |
| 172 | [13:22:10] <RoanKattouw> I'll check if the replica is read-only |
| 173 | [13:22:39] <RoanKattouw> Welp, the replica is not read-only |
| 174 | [13:23:11] <Zppix> Hmmm |
| 175 | [13:23:16] <Zppix> That cant be good |
| 176 | [13:23:41] <addshore> so, what is the first insert / update / delete we see on the replica that is not on the master? |
| 177 | [13:23:44] <addshore> (if there is one) ? |
| 178 | [13:24:12] <RoanKattouw> I'm trying to figure that out now |
| 179 | [13:25:09] <RoanKattouw> The master's binlog file is 489MB |
| 180 | [13:25:12] <addshore> I'll filter out messages out of this IRC channel and dump the chat log somewhere nice to look at once we have finished here, I think this is all rather valuable info we don't want to have to go hunting for agian |
| 181 | [13:25:24] <RoanKattouw> So analyzing it is a bit slow |
| 182 | [13:26:24] <DanielK_WMDE> I'll look into making unit tests use a r/o wrapper for the DB REPLICA connection. That may not be easy, though |
| 183 | 13:29:08] <RoanKattouw> OK so I'm going through the query history on both servers to try to prove that they received contradictory insertions |
| 184 | [13:29:13] <RoanKattouw> But it's slow going |
| 185 | [13:29:21] <addshore> I'm looking through the usages of DB SLAVE / MASTER in abusefilter and out new revstore stuff now |
| 186 | [13:30:10] <addshore> $dbw = $this->getDBConnection( DB_REPLICA ); |
| 187 | [13:30:11] <RoanKattouw> There is not much in the beta SAL so I can't tell if db04 was legitimately the master at some point |
| 188 | [13:30:12] <addshore> DanielK_WMDE: RoanKattouw ^^ |
| 189 | [13:30:14] <RoanKattouw> LOL |
| 190 | [13:30:16] <RoanKattouw> Wel... |
| 191 | [13:30:22] <RoanKattouw> That would do it |
| 192 | [13:30:22] <addshore> Think I might have found it.... |
| 193 | [13:30:34] <RoanKattouw> Do we not have a mechanism in MW to prevent writes to DB_REPLICA connections?! |
| 194 | [13:30:41] <addshore> RoanKattouw: apparently not |
| 195 | [13:30:45] <addshore> it just does ->insert |
| 196 | [13:31:13] <addshore> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231 |
| 197 | [13:31:15] <DanielK_WMDE> RoanKattouw: looking into fixing that at least for unit tests |
| 198 | [13:31:22] <DanielK_WMDE> ideally for everything. |
| 199 | [13:31:50] <addshore> there should be a NoWrite implementation of the db interface |
| 200 | [13:31:56] <RoanKattouw> Well that would do it |
| 201 | [13:31:58] <addshore> that refuses to accept calls to any writ methods |
| 202 | [13:32:06] <RoanKattouw> That's exactly the place where we would expect this, too |
| 203 | [13:32:11] <addshore> or infact... 2 separate interfaces.. |
| 204 | [13:32:14] <RoanKattouw> OK I'm going to file a couple tasks here |
| 205 | [13:32:20] <addshore> RoanKattouw: agreed |
| 206 | [13:32:25] <RoanKattouw> 1) We need all replica DBs to have writes disabled, in labs as well as production |
| 207 | [13:32:31] <RoanKattouw> 2) |
| 208 | [13:32:45] <RoanKattouw> 2) Trying to write to a DB_REPLICA handle needs to throw an exception |
| 209 | [13:32:46] <Zppix> If we disable writes how could masters write to replicas? |
| 210 | [13:32:51] <apergos> O.o you can write to replicas?? |
| 211 | [13:32:58] <RoanKattouw> apergos: Yuuup |
| 212 | [13:32:59] <addshore> apergos: on the beta cluster yes |
| 213 | [13:33:02] <apergos> owie |
| 214 | [13:33:05] <RoanKattouw> Zppix: Replication is exempt from the no-writes flag |
| 215 | [13:33:06] <addshore> and within mediawiki yes, |
| 216 | [13:33:11] <Zppix> Ah ok |
| 217 | [13:33:20] <RoanKattouw> 3) deployment-db04 is hosed and needs to become un-hosed |
| 218 | [13:33:29] <Zppix> (I am not to smart when it comes to db) |
| 219 | [13:33:55] <RoanKattouw> (I wondered the same thing actually, so I looked up the documentation and it said that replication and admins were exempt) |
| 220 | [13:34:33] <RoanKattouw> I'm going to begin with depooling deployment-db04 so as to unbreak the beta wikis |
| 221 | [13:34:37] <apergos> what the heck is getting a db replica handle and trying to write to it anyways? |
| 222 | [13:34:43] <RoanKattouw> Then we probably need to reprovision it or something |
| 223 | [13:34:48] <RoanKattouw> apergos: The new MCR code :) |
| 224 | [13:34:54] <RoanKattouw> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231 |
| 225 | [13:34:56] <apergos> oh dear... |
| 226 | [13:35:03] <RoanKattouw> Yeah... |
| 227 | [13:35:13] <RoanKattouw> So the text table's IDs diverged on the master and replica DBs |
| 228 | [13:35:21] <apergos> I know it's not really funny but I am lmao here |
| 229 | [13:35:24] <addshore> there could perhaps be a very simply phpcs sniff for that too |
| 230 | [13:35:26] <RoanKattouw> Then when the master tried to replicate its writes, replication died |
| 231 | [13:35:30] <apergos> of course |
| 232 | [13:35:33] <DanielK_WMDE> apergos: my bad. whoops. we need more eyes on this code... |
| 233 | [13:35:52] <apergos> well the good thing is it gets discovered in labs |
| 234 | [13:35:54] <RoanKattouw> apergos: If you're in #wikimedia-dev already, see my pastes in the backscroll there, you'll see the two servers disagreeing on the contents of the text and revision tables |
| 235 | [13:36:03] <addshore> apergos: RoanKattouw the great news is this issue is much simpler to fix than some of the other possible reasons for breaking that we were thinking about. |
| 236 | [13:36:07] <RoanKattouw> We also had revision IDs with different meanings on the two servers |
| 237 | [13:36:11] <RoanKattouw> Yes |
| 238 | [13:36:12] <apergos> eewwww |
| 239 | [13:36:20] <RoanKattouw> And it's not a chronology protection issue |
| 240 | [13:36:20] <DanielK_WMDE> RoanKattouw, addshore: i'm introducing allowWrite into the datbase base class. I'm also introducing a wrapper so writing to a replica connection will fail even if there is only one db and one connection. |
| 241 | [13:36:22] <apergos> yes I suppose it would all go pear-shaped pretty soon after |
| 242 | [13:36:33] <addshore> DanielK_WMDE: sounds good |
| 243 | [13:36:34] <apergos> doing the backread in the other channel |
| 244 | [13:36:49] <RoanKattouw> DanielK_WMDE: Yes that's good, and necessary to catch these things in dev. I'll file that task first so you can claim it |
| 245 | [13:36:53] <addshore> DanielK_WMDE: shall I fix the line or you? |
| 246 | [13:37:15] <DanielK_WMDE> RoanKattouw, addshore: thank you for helping with this! |
| 247 | [13:37:18] <DanielK_WMDE> addshore: i'll fix it. |
| 248 | [13:37:52] <Zppix> Im glad i mentioned directly writing to replicas :P |
| 249 | [13:38:15] <DanielK_WMDE> Zppix: thank you, indeed :)+ |
| 250 | [13:38:39] <DanielK_WMDE> Zppix: I would have expected that to just fail HARD, not silently corrupt the database |
| 251 | [13:38:51] <Zppix> I wonder if this possibly in prod as well |
| 252 | [13:38:55] <Zppix> Possible* |
| 253 | [13:39:06] <DanielK_WMDE> RoanKattouw: will you also file a ticket to make sure replicates are not writable in production? This kind of thing can SERIOUSLY mess up the database :( |
| 254 | [13:39:15] <DanielK_WMDE> Zppix: yea, scary |
| 255 | [13:39:33] <RoanKattouw> Yes |
| 256 | [13:39:38] <RoanKattouw> See my 1-2-3 above |
| 257 | [13:41:53] <apergos> that was a fun little read |
| 258 | [13:42:08] <apergos> not easy to track down either |
| 259 | [13:42:29] <DanielK_WMDE> yep... several wrong leads |