Page MenuHomePhabricator

DB handles obtained with DB_REPLICA should not allow writes
Closed, ResolvedPublic

Description

Any attempt to perform a write operation on a database connection that was obtained for the DB_REPLICA index should fail hard. Two different cases need to be covered:

  1. Connections to actual database replicas must not allow write operations (T183265)
  2. When selecting the connection to the master database when asked for DB_REPLICA, that connection needs to be wrapped to disallow write operations. This is particularly relevant for catching bugs early, in unit tests and development environments which typically have a single database setup.

Background:

This code introduced by a recent MCR change obtains a DB handle using DB_REPLICA, then writes to it:

		        $dbw = $this->getDBConnection( DB_REPLICA );
			$old_id = $dbw->nextSequenceValue( 'text_old_id_seq' );
			$dbw->insert(
				'text',
				[
					'old_id' => $old_id,
					'old_text' => $data,
					'old_flags' => $flags,
				],
				__METHOD__
			);
			$textId = $dbw->insertId();

Shockingly, nothing in MediaWiki prevented a write on a replica connection. Because the replica DB server in beta labs was not set to read-only (see T183245: Ensure replica DB in labs is read-only), this caused an actual write to happen to the replica server, which of course broke replication because the auto increment counters on the text table drifted out of sync.

A DB connection handle obtained with DB_REPLICA should not allow writes. Any attempt to call ->insert(), ->delete(), ->replace(), etc. should throw an exception, even if the underlying connection is to the master (development environments typically only have one DB server, but we need to catch these errors early; thanks to @daniel for realizing this).

Chat log:

1 RoanKattouw: Aaron tried to help with investigating the chronology issue yesterday, but didn't get anywhere, it seems. That stuff is a bit esoterik to me... do you have an idea how to investigate? Basically, new revisions fail to load seconds after an edit. it *looks* like the page table is out iof sync with the revision table. But nobody knows how that can be.
2[12:36:40] <DanielK_WMDE> ah, addshore merged the revert
3[12:36:50] <DanielK_WMDE> I'll file a revert of the revert in a minute
4[12:36:54] <addshore> not yet, just fixing the commit message
5[12:37:00] <addshore> but gerrit has died for me
6[12:37:17] <DanielK_WMDE> wffm
7[12:37:18] <addshore> RoanKattouw: yes, the christmas time for merging and getting it on beta was the plan :)
8[12:37:41] <DanielK_WMDE> addshore: we failed to consider that christmas is a bad time to get people to investigate arcane issues, though :)
9[12:37:47] <DanielK_WMDE> aaron is gone for the holidays, i guess
10[12:37:49] <addshore> indeed
11[12:38:08] <DanielK_WMDE> i was expecting issues, but i was hoping they would be a bit more clear cut...
12[12:39:01] <RoanKattouw> haha yeah that's a problem
13
14[12:39:53] <RoanKattouw> I guess you could try to figure out what the state of the DB is that the broken requests see (revision table entry referring to a nonexistent page table entry? or the reverse?) and try to figure out if the order of insertions might possibly be wrong
15[12:40:21] <addshore> DanielK_WMDE: Just finished fixing the commit message issues, anything else we need to add to that revert?
16[12:40:34] <DanielK_WMDE> RoanKattouw: no, thje page table references a revision that then fails to load (it may or may not be in the db, we don't know)
17[12:40:54] <DanielK_WMDE> addshore: i don't think so
18[12:41:15] <RoanKattouw> Hmm that's weird
19[12:41:31] <RoanKattouw> One thing I wonder is how the page table row was able to be written then
20[12:41:41] <RoanKattouw> Because you don't know the revision ID until after you've written the revision row
21[12:41:53] <addshore> hmm, beta db is currently locked?
22[12:42:08] <RoanKattouw> Have you isolated a broken case? Do broken cases stay broken for a long-ish period of time, or only briefly?
23[12:42:21] <addshore> RoanKattouw: I was about to say you can try it on beta, but the db is locked
24[12:42:34] <DanielK_WMDE> RoanKattouw: it does get written. the revision is found once replicas catch up. it just fails to load right after the edit.
25[12:43:23] <DanielK_WMDE> RoanKattouw: while investigating, aaron found that replication weas broken on beta. we firt thought that that was all there was to it. but fixing replication didn't fix the issue. it's all a bit... obscure
26[12:44:29] <DanielK_WMDE> RoanKattouw: to replicate, just edit. you get redirected to the "plain" url, and get an error (revision not found). you reload, and it works fine.
27[12:45:02] <DanielK_WMDE> RoanKattouw, addshore: the relevant code in Article pre-fills the page with that error message. so that's the error you'll see if *anything* goes wrong. which means we don't actually know what goes wrong.
28[12:46:19] <addshore> DanielK_WMDE: I dont think the beta replication issue is fixed
29[12:46:19] <addshore> "The database has been automatically locked while the replica database servers catch up to the master."
30[12:46:50] <RoanKattouw> "You reload, and it works fine" --> so it's not a long-lasting issue, it's brief
31[12:46:55] <RoanKattouw> That makes investigating it harder
32[12:47:34] <DanielK_WMDE> RoanKattouw: indeed. and MAYBE it's actually an issue with replication on beta, not with the code, as addshore just said...
33[12:47:39] <RoanKattouw> addshore: Hmm if replication is *completely* broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow)
3412:47:52] <RoanKattouw> Well, let's see if the revert fixes it
35[12:47:57] <addshore> https://deployment.wikimedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= is the same as when arron looked yesterday
36[12:48:03] <RoanKattouw> If it does, then your code is doing at least something wrong
37[12:48:06] <addshore> wtf is up with beta
38[12:48:15] <DanielK_WMDE> RoanKattouw: that indeed is a meaningful experiment :)
39[12:48:51] <RoanKattouw> addshore: Hmm at that URL I see 0 lag right now
40[12:49:07] <addshore> RoanKattouw: try creating a page though, it says the db is locked due to lag
41[12:49:38] <RoanKattouw> I see
42[12:49:57] <RoanKattouw> Very strange
43[12:50:30] <RoanKattouw> The DB does appear to be read-only
44[12:50:35] <RoanKattouw> Marking notifications as read doesn't stick
45[12:51:06] <Zppix> Would jenkins doing a scap cause that?
46[12:51:15] <addshore> Zppix: the locking? no
47[12:51:22] <DanielK_WMDE> RoanKattouw: maybe it's r/o for a reason that has nothing to do with replag?
48[12:51:23] <RoanKattouw> No, something weird is going on
49[12:51:36] <RoanKattouw> I checked $wgReadOnly and it's false, so that isn't it
50[12:51:43] <addshore> DanielK_WMDE: when editing though the reason it shows says it is due to replag
51[12:51:52] <RoanKattouw> wfReadOnly() returns true, so let's see why
52[12:51:55] <DanielK_WMDE> addshore: wtf?
53[12:52:07] <addshore> https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/lastFailedBuild/console
54[12:52:26] <addshore> Wikimedia\\Rdbms\\DBReplicationWaitError from line 369 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03
55[12:53:02] <addshore> first failure is Build #22122 (18-Dec-2017 19:20:00)
56[12:55:47] <RoanKattouw> OK I can confirm that replication is broken on deployment-db04
57[12:55:58] <addshore> RoanKattouw: any idea how we can fix that?
58[12:56:08] <RoanKattouw> https://www.irccloud.com/pastebin/tREOTlIr/
59[12:56:53] <addshore> ooooooh, well, it has things to do with externalstore and revisions / old text, could be related too the patch
60[12:56:58] <RoanKattouw> addshore: I know like 1/3 of the theory and I haven't done it before
61[12:57:26] <RoanKattouw> Yeah the error that broke replication is a primary key collision in the text table, I hadn't realized that that could be related to MCR but you're right
62[12:57:42] <RoanKattouw> The query itself specifies NULL as the old_id so that should have been fine
63[12:57:55] <addshore> That actually could be a very big hint as to what is wrong
64[12:58:27] <addshore> patch got merged, code started being run, replication broke, lag wasn't bad enough to cause a db lock for a while though, we keep seeing the same issue, and eventually the db locks up
65[12:58:52] <RoanKattouw> If replication breaks completely, it should only take 5 minutes for the lag to shut everything down
66[12:59:03] <RoanKattouw> At most; I'm not sure if MW detects this broken state specially and goes r/o immediately
67[12:59:28] <RoanKattouw> But obviously if replication is broken but lag measurements still work, the lag will increase at a rate of 1 second per second, and the 300s lag threshold will be hit pretty quickly
68[13:00:21] <RoanKattouw> OK so I've looked at the text table on deployment-db04 which interestingly has IDs going all the way up to 20982, and has the auto_increment value correctly set to 20983
69[13:00:24] <RoanKattouw> So it doesn't look broken
70[13:00:28] <RoanKattouw> Now going to look at that table on the master
71[13:01:18] <RoanKattouw> Whoa, wtf
72[13:01:24] <RoanKattouw> The text table on the master has FEWER entries
73[13:01:33] <addshore> interesting....... urm :P
74[13:01:42] <RoanKattouw> The highest old_id on the master is 20972
75[13:02:09] <addshore> 10 entries less
76[13:02:31] <addshore> and 10 edits have been made on deployment wiki since the code landed
77[13:02:38] <RoanKattouw> OK something weird happened
78[13:02:48] <RoanKattouw> The master and slave disagree on some rows:
79[13:03:07] <RoanKattouw> Master:
80[13:03:08] <RoanKattouw> https://www.irccloud.com/pastebin/mJeq37Zx/
81[13:03:18] <RoanKattouw> Slave:
82[13:03:22] <RoanKattouw> https://www.irccloud.com/pastebin/sRtUJazL/
83[13:03:33] <DanielK_WMDE> RoanKattouw: it'S possible that my patch broke that. I tried to keep that logic untouched while moving it to a different class, but it's possible I broke that.
84[13:03:49] <DanielK_WMDE> But it's also very possible that it's completely unrelated. I sure hope it is :)
85[13:03:58] <RoanKattouw> Any idea whether the difference between nativeDataArray and utf-8 is related to your patch?
86[13:04:18] <addshore> we defintly touched that area, so most likely
87[13:04:21] <RoanKattouw> Also note that there is perfect non-overlap here
88[13:04:32] <addshore> right, revert time it is
89[13:04:45] <RoanKattouw> The master and replica agree on everything up to oldid 20970 which has text address 8211
90[13:05:02] <Zppix> What is oldid 20970?
91[13:05:08] <DanielK_WMDE> RoanKattouw: nativeDataArray? What is that? Never heard of it. I moved the code that handles these flags. If the logic changed in the meantime, it's possible i accidentally reverted that, in an inconsistent way
92[13:05:12] <RoanKattouw> Then the replica picked up 8212-8217, 8219-8223 and 8225
93[13:05:16] <Zppix> Perhaps we could get a timeline?
94[13:05:22] <RoanKattouw> And the master picked up 8218 and 8224
95[13:05:30] <DanielK_WMDE> RoanKattouw: what's the timestamp of that revision?
96[13:05:37] <RoanKattouw> Which one, 20970?
97[13:05:41] <RoanKattouw> The last one they agree on?
98[13:05:43] <DanielK_WMDE> The patch was merged yesterday... I think around this time of day
99[13:06:22] <DanielK_WMDE> RoanKattouw: merged yesterday at 16:10 CET.
100[13:06:43] <RoanKattouw> There is no row in the revision table with either rev_id=20970 or rev_text_id=20970
101[13:07:41] <addshore> Merged @ Dec 18 15:10 PM UTC
102[13:07:52] <RoanKattouw> Wait, on the MASTER there isn't, on the replica there is
103[13:08:00] <DanielK_WMDE> ...what?!
104[13:08:02] <RoanKattouw> The DBs also disagree on what's in the revision table
105[13:08:22] <RoanKattouw> The 15 most recent revision table entries on the master:
106[13:08:24] <addshore> 15:18 it is on the beta clster and scap has run
107[13:08:24] <RoanKattouw> https://www.irccloud.com/pastebin/lnRPQI4Z/
108[13:08:25] <DanielK_WMDE> that sounds fubar...
109[13:08:28] <RoanKattouw> And on the replica:
110[13:08:37] <RoanKattouw> https://www.irccloud.com/pastebin/r7AxQfDT/
111[13:08:57] <addshore> 15:19 I have already spotted the issue https://usercontent.irccloud-cdn.com/file/Mc9x8ROA/image.png
112[13:09:08] <addshore> that will be the first edit made with the new code
113[13:09:44] <RoanKattouw> Oh wait that's not a disagreement, my bad
114[13:09:49] <addshore> RoanKattouw: what is up with rev 7934 where does it appear & look like?
115[13:10:05] <DanielK_WMDE> RoanKattouw: nativeDataArray seems to be an AbuseFilter thing. Poor man's content model from many years ago, it seems
116[13:10:13] <RoanKattouw> There are strange gaps in the rev_id sequence, but the only real difference is the existence of rev_id 7944 on the master which seems reasonable
117[13:10:15] <RoanKattouw> Aah OK
118[13:10:16] <DanielK_WMDE> I didn't touch that. should still work as before.
119[13:10:26] <RoanKattouw> Right and the query that broke replication had an AbuseFilter method in its comment
120[13:10:36] <DanielK_WMDE> \O/
121[13:10:42] <RoanKattouw> | 7934 | 20972 | 20171218152013 |
122[13:10:50] <addshore> Right
123[13:10:52] <RoanKattouw> The two DBs agree on that
124[13:11:00] <DanielK_WMDE> Buuut.... that actually may indicate an issue with my patch. Maybe AbuseFilter is relying on internals that i now changed
125[13:11:06] <DanielK_WMDE> maybe I broke compat with AF
126[13:11:07] <RoanKattouw> That is actually exactly the same rev_text_id as in the replication error!
127[13:11:07] <addshore> DanielK_WMDE: indeed
128[13:11:22] <RoanKattouw> Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump */ INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
129[13:11:30] <addshore> RoanKattouw: right, and I believe that is the first time i spotted the issue / the first edit i made after the code landed
130[13:11:48] <addshore> Welll...
131[13:11:48] <addshore> 15:19, 18 December 2017 Addshore (talk | contribs | block) deleted page TXTTN (view/restore)
132[13:12:01] <addshore> (diff | hist) . . N TXTTN‎; 15:20 . . (+7)‎ . . ‎Addshore (talk | contribs | block)‎ (Created page with "slakfa;")
133[13:12:01] <DanielK_WMDE> the timestamp matches
134[13:12:26] <RoanKattouw> addshore: Oh you deleted stuff, that could explain the rev_id gap
135[13:12:26] <DanielK_WMDE> addshore: but why is there an edit by abusefilter?... you didn't edit a filter, did you?
136[13:12:27] <addshore> and that rev i created at 15:20 is https://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=TXTTN&oldid=7934
137[13:12:53] * DanielK_WMDE feels an abuse filter rewrite coming on
138[13:13:04] <addshore> RoanKattouw: yes, although in the UI i cant see what i deleted, or what timestamp or revid as the db is locked https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Undelete/TXTTN
139[13:13:22] <RoanKattouw> I have it
140[13:13:30] <RoanKattouw> https://www.irccloud.com/pastebin/Lh4gkYeA/
141[13:13:36] <RoanKattouw> It helps that it was the most recently deleted thing
142[13:14:29] <DanielK_WMDE> so, what was broken? AF or deletion?
143[13:14:31] <addshore> so i created a page, and it worked, iu deleted it, it wokred, i recreated it and started seeing the issue
144[13:14:45] <addshore> and that was due to abusefilter exploding during the first creation or deletion?
145[13:14:47] <DanielK_WMDE> how can *any* of this break replication, btw?
146[13:15:16] <DanielK_WMDE> addshore: why would abuse filter write revisions?
147[13:15:22] <DanielK_WMDE> unless youedit a filter?
148[13:15:24] <addshore> DanielK_WMDE: no idea
149[13:15:31] <addshore> So, "nativeDataArray" is ONLY used in AbuseFilter
150[13:15:32] <DanielK_WMDE> maybe someone else was editing a filter?
151[13:15:49] <DanielK_WMDE> (and maybe this is a red herring)?
152[13:16:00] <addshore> https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/blob/master/includes/AbuseFilter.class.php#L1286
153[13:16:45] <addshore> *scrolls up to look at the flag mess we already discovered*
154[13:17:37] <addshore> 20970 has no nativeDataArray flag on either slave or master, 20971 and 20972 have it on master but not on the slave
155[13:17:43] <DanielK_WMDE> addshore: can you find the user & commit message associated with the nativeArrayData entries?
156[13:20:12] <RoanKattouw> Yeah old_id 20972 points to completely different things on the master and the replica
157[13:20:20] <RoanKattouw> Different flags but also different external store addresses
158[13:20:29] <DanielK_WMDE> how the hell does that happen?
159[13:20:34] <RoanKattouw> No idea
160[13:20:46] <addshore> its almost like some code is somehow manging to write to the slave...
161[13:20:47] <RoanKattouw> I'm digging though the binlogs right now trying to find out what other queries were run on the tex ttable
162[13:20:53] <RoanKattouw> So far I've only found the one that killed replication
163[13:21:00] <RoanKattouw> Hmm yeah maybe
164[13:21:12] <addshore> it looks like abusefilter stuff manage to write to master only
165[13:21:19] <RoanKattouw> I thought that maybe the master and replica had switched roles, but maybe the replica is writeable when it shouldn't be? That would break replication pretty quickly too
166[13:21:22] <DanielK_WMDE> addshore: oooohhhh.... what would happen if you tried that? it should fail HARD! But does it?
167[13:21:44] <addshore> DanielK_WMDE: I would have thought it would fail really hard, but perhaps not?>
168[13:21:46] <DanielK_WMDE> RoanKattouw: maybe aaron switched them yesterday when trying to fix things?
169[13:21:56] <Zppix> Have you guys tried writing to the replicas directly?
170[13:22:01] <RoanKattouw> That's possible
171[13:22:09] <Zppix> It should fail no?
172[13:22:10] <RoanKattouw> I'll check if the replica is read-only
173[13:22:39] <RoanKattouw> Welp, the replica is not read-only
174[13:23:11] <Zppix> Hmmm
175[13:23:16] <Zppix> That cant be good
176[13:23:41] <addshore> so, what is the first insert / update / delete we see on the replica that is not on the master?
177[13:23:44] <addshore> (if there is one) ?
178[13:24:12] <RoanKattouw> I'm trying to figure that out now
179[13:25:09] <RoanKattouw> The master's binlog file is 489MB
180[13:25:12] <addshore> I'll filter out messages out of this IRC channel and dump the chat log somewhere nice to look at once we have finished here, I think this is all rather valuable info we don't want to have to go hunting for agian
181[13:25:24] <RoanKattouw> So analyzing it is a bit slow
182[13:26:24] <DanielK_WMDE> I'll look into making unit tests use a r/o wrapper for the DB REPLICA connection. That may not be easy, though
18313:29:08] <RoanKattouw> OK so I'm going through the query history on both servers to try to prove that they received contradictory insertions
184[13:29:13] <RoanKattouw> But it's slow going
185[13:29:21] <addshore> I'm looking through the usages of DB SLAVE / MASTER in abusefilter and out new revstore stuff now
186[13:30:10] <addshore> $dbw = $this->getDBConnection( DB_REPLICA );
187[13:30:11] <RoanKattouw> There is not much in the beta SAL so I can't tell if db04 was legitimately the master at some point
188[13:30:12] <addshore> DanielK_WMDE: RoanKattouw ^^
189[13:30:14] <RoanKattouw> LOL
190[13:30:16] <RoanKattouw> Wel...
191[13:30:22] <RoanKattouw> That would do it
192[13:30:22] <addshore> Think I might have found it....
193[13:30:34] <RoanKattouw> Do we not have a mechanism in MW to prevent writes to DB_REPLICA connections?!
194[13:30:41] <addshore> RoanKattouw: apparently not
195[13:30:45] <addshore> it just does ->insert
196[13:31:13] <addshore> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231
197[13:31:15] <DanielK_WMDE> RoanKattouw: looking into fixing that at least for unit tests
198[13:31:22] <DanielK_WMDE> ideally for everything.
199[13:31:50] <addshore> there should be a NoWrite implementation of the db interface
200[13:31:56] <RoanKattouw> Well that would do it
201[13:31:58] <addshore> that refuses to accept calls to any writ methods
202[13:32:06] <RoanKattouw> That's exactly the place where we would expect this, too
203[13:32:11] <addshore> or infact... 2 separate interfaces..
204[13:32:14] <RoanKattouw> OK I'm going to file a couple tasks here
205[13:32:20] <addshore> RoanKattouw: agreed
206[13:32:25] <RoanKattouw> 1) We need all replica DBs to have writes disabled, in labs as well as production
207[13:32:31] <RoanKattouw> 2)
208[13:32:45] <RoanKattouw> 2) Trying to write to a DB_REPLICA handle needs to throw an exception
209[13:32:46] <Zppix> If we disable writes how could masters write to replicas?
210[13:32:51] <apergos> O.o you can write to replicas??
211[13:32:58] <RoanKattouw> apergos: Yuuup
212[13:32:59] <addshore> apergos: on the beta cluster yes
213[13:33:02] <apergos> owie
214[13:33:05] <RoanKattouw> Zppix: Replication is exempt from the no-writes flag
215[13:33:06] <addshore> and within mediawiki yes,
216[13:33:11] <Zppix> Ah ok
217[13:33:20] <RoanKattouw> 3) deployment-db04 is hosed and needs to become un-hosed
218[13:33:29] <Zppix> (I am not to smart when it comes to db)
219[13:33:55] <RoanKattouw> (I wondered the same thing actually, so I looked up the documentation and it said that replication and admins were exempt)
220[13:34:33] <RoanKattouw> I'm going to begin with depooling deployment-db04 so as to unbreak the beta wikis
221[13:34:37] <apergos> what the heck is getting a db replica handle and trying to write to it anyways?
222[13:34:43] <RoanKattouw> Then we probably need to reprovision it or something
223[13:34:48] <RoanKattouw> apergos: The new MCR code :)
224[13:34:54] <RoanKattouw> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231
225[13:34:56] <apergos> oh dear...
226[13:35:03] <RoanKattouw> Yeah...
227[13:35:13] <RoanKattouw> So the text table's IDs diverged on the master and replica DBs
228[13:35:21] <apergos> I know it's not really funny but I am lmao here
229[13:35:24] <addshore> there could perhaps be a very simply phpcs sniff for that too
230[13:35:26] <RoanKattouw> Then when the master tried to replicate its writes, replication died
231[13:35:30] <apergos> of course
232[13:35:33] <DanielK_WMDE> apergos: my bad. whoops. we need more eyes on this code...
233[13:35:52] <apergos> well the good thing is it gets discovered in labs
234[13:35:54] <RoanKattouw> apergos: If you're in #wikimedia-dev already, see my pastes in the backscroll there, you'll see the two servers disagreeing on the contents of the text and revision tables
235[13:36:03] <addshore> apergos: RoanKattouw the great news is this issue is much simpler to fix than some of the other possible reasons for breaking that we were thinking about.
236[13:36:07] <RoanKattouw> We also had revision IDs with different meanings on the two servers
237[13:36:11] <RoanKattouw> Yes
238[13:36:12] <apergos> eewwww
239[13:36:20] <RoanKattouw> And it's not a chronology protection issue
240[13:36:20] <DanielK_WMDE> RoanKattouw, addshore: i'm introducing allowWrite into the datbase base class. I'm also introducing a wrapper so writing to a replica connection will fail even if there is only one db and one connection.
241[13:36:22] <apergos> yes I suppose it would all go pear-shaped pretty soon after
242[13:36:33] <addshore> DanielK_WMDE: sounds good
243[13:36:34] <apergos> doing the backread in the other channel
244[13:36:49] <RoanKattouw> DanielK_WMDE: Yes that's good, and necessary to catch these things in dev. I'll file that task first so you can claim it
245[13:36:53] <addshore> DanielK_WMDE: shall I fix the line or you?
246[13:37:15] <DanielK_WMDE> RoanKattouw, addshore: thank you for helping with this!
247[13:37:18] <DanielK_WMDE> addshore: i'll fix it.
248[13:37:52] <Zppix> Im glad i mentioned directly writing to replicas :P
249[13:38:15] <DanielK_WMDE> Zppix: thank you, indeed :)+
250[13:38:39] <DanielK_WMDE> Zppix: I would have expected that to just fail HARD, not silently corrupt the database
251[13:38:51] <Zppix> I wonder if this possibly in prod as well
252[13:38:55] <Zppix> Possible*
253[13:39:06] <DanielK_WMDE> RoanKattouw: will you also file a ticket to make sure replicates are not writable in production? This kind of thing can SERIOUSLY mess up the database :(
254[13:39:15] <DanielK_WMDE> Zppix: yea, scary
255[13:39:33] <RoanKattouw> Yes
256[13:39:38] <RoanKattouw> See my 1-2-3 above
257[13:41:53] <apergos> that was a fun little read
258[13:42:08] <apergos> not easy to track down either
259[13:42:29] <DanielK_WMDE> yep... several wrong leads

Event Timeline

Catrope created this task.Dec 19 2017, 1:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2017, 1:42 PM
Zppix added a subscriber: Zppix.Dec 19 2017, 1:42 PM
Addshore added a subscriber: Addshore.
daniel claimed this task.Dec 19 2017, 1:46 PM
daniel triaged this task as High priority.
Addshore moved this task from Inbox to Done on the Multi-Content-Revisions board.Dec 19 2017, 1:46 PM
Zppix updated the task description. (Show Details)Dec 19 2017, 2:07 PM

Change 399188 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] [MCR] Fix SqlBlobStore using DB_REPLICA for writes

https://gerrit.wikimedia.org/r/399188

Change 399188 merged by jenkins-bot:
[mediawiki/core@master] [MCR] Fix SqlBlobStore using DB_REPLICA for writes

https://gerrit.wikimedia.org/r/399188

daniel lowered the priority of this task from High to Medium.Dec 19 2017, 3:51 PM
daniel updated the task description. (Show Details)

Reducing to normal prio, bumping T183265: Write operations should fail on connections to database replicas. to high prio instead.

Change 399445 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] [WIP] Prevent writes on connections obtained with DB_REPLICA.

https://gerrit.wikimedia.org/r/399445

daniel moved this task from Inbox to Push on the User-Daniel board.

Effectively resolved by:

(and many commits before it in preparation, and after it to make it work even better).

It is limited in that direct use of getConnection(DB_REPLICA), which bypasses the internal connection management, also bypasses this detection. In general, use of that method is discouraged though.

The only exception being legacy code using wfGetDB() which is currently tied to getConnection(), which I'm not sure whether we can swap or not.

See also:

I see uses of both getConnection and wfGetDB in maintenance/includes/BackupDumper.php and maintenance/includes/TextPassDumper.php, should these be changed to $this->getDB?

Krinkle closed this task as Resolved.Jul 23 2019, 4:14 PM

@ArielGlenn The idiomatic pattern for maintenance scripts is indeed to use Maintenance::getDB, which also honours setDB, and is compatible with utility methods such as beginTransaction, commitTransaction (which automatically waits for replication, useful when processing batches), and more.

Using the LBFactory service object and using getConnectionRef etc. directly is fine as well.

Using wfGetDB is discouraged in new code, and is slowly being phased out.