DB handles obtained with DB_REPLICA should not allow writes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Catrope
	Dec 19 2017, 1:42 PM

Description

Any attempt to perform a write operation on a database connection that was obtained for the DB_REPLICA index should fail hard. Two different cases need to be covered:

Connections to actual database replicas must not allow write operations (T183265)
When selecting the connection to the master database when asked for DB_REPLICA, that connection needs to be wrapped to disallow write operations. This is particularly relevant for catching bugs early, in unit tests and development environments which typically have a single database setup.

Background:

This code introduced by a recent MCR change obtains a DB handle using DB_REPLICA, then writes to it:

		        $dbw = $this->getDBConnection( DB_REPLICA );
			$old_id = $dbw->nextSequenceValue( 'text_old_id_seq' );
			$dbw->insert(
				'text',
				[
					'old_id' => $old_id,
					'old_text' => $data,
					'old_flags' => $flags,
				],
				__METHOD__
			);
			$textId = $dbw->insertId();

Shockingly, nothing in MediaWiki prevented a write on a replica connection. Because the replica DB server in beta labs was not set to read-only (see T183245: Ensure replica DB in labs is read-only), this caused an actual write to happen to the replica server, which of course broke replication because the auto increment counters on the text table drifted out of sync.

A DB connection handle obtained with DB_REPLICA should not allow writes. Any attempt to call ->insert(), ->delete(), ->replace(), etc. should throw an exception, even if the underlying connection is to the master (development environments typically only have one DB server, but we need to catch these errors early; thanks to @daniel for realizing this).

Chat log:

P6485 Chat Log T183242

1	RoanKattouw: Aaron tried to help with investigating the chronology issue yesterday, but didn't get anywhere, it seems. That stuff is a bit esoterik to me... do you have an idea how to investigate? Basically, new revisions fail to load seconds after an edit. it looks like the page table is out iof sync with the revision table. But nobody knows how that can be.
2	[12:36:40] <DanielK_WMDE> ah, addshore merged the revert
3	[12:36:50] <DanielK_WMDE> I'll file a revert of the revert in a minute
4	[12:36:54] <addshore> not yet, just fixing the commit message
5	[12:37:00] <addshore> but gerrit has died for me
6	[12:37:17] <DanielK_WMDE> wffm
7	[12:37:18] <addshore> RoanKattouw: yes, the christmas time for merging and getting it on beta was the plan :)
8	[12:37:41] <DanielK_WMDE> addshore: we failed to consider that christmas is a bad time to get people to investigate arcane issues, though :)
9	[12:37:47] <DanielK_WMDE> aaron is gone for the holidays, i guess
10	[12:37:49] <addshore> indeed
11	[12:38:08] <DanielK_WMDE> i was expecting issues, but i was hoping they would be a bit more clear cut...
12	[12:39:01] <RoanKattouw> haha yeah that's a problem
13
14	[12:39:53] <RoanKattouw> I guess you could try to figure out what the state of the DB is that the broken requests see (revision table entry referring to a nonexistent page table entry? or the reverse?) and try to figure out if the order of insertions might possibly be wrong
15	[12:40:21] <addshore> DanielK_WMDE: Just finished fixing the commit message issues, anything else we need to add to that revert?
16	[12:40:34] <DanielK_WMDE> RoanKattouw: no, thje page table references a revision that then fails to load (it may or may not be in the db, we don't know)
17	[12:40:54] <DanielK_WMDE> addshore: i don't think so
18	[12:41:15] <RoanKattouw> Hmm that's weird
19	[12:41:31] <RoanKattouw> One thing I wonder is how the page table row was able to be written then
20	[12:41:41] <RoanKattouw> Because you don't know the revision ID until after you've written the revision row
21	[12:41:53] <addshore> hmm, beta db is currently locked?
22	[12:42:08] <RoanKattouw> Have you isolated a broken case? Do broken cases stay broken for a long-ish period of time, or only briefly?
23	[12:42:21] <addshore> RoanKattouw: I was about to say you can try it on beta, but the db is locked
24	[12:42:34] <DanielK_WMDE> RoanKattouw: it does get written. the revision is found once replicas catch up. it just fails to load right after the edit.
25	[12:43:23] <DanielK_WMDE> RoanKattouw: while investigating, aaron found that replication weas broken on beta. we firt thought that that was all there was to it. but fixing replication didn't fix the issue. it's all a bit... obscure
26	[12:44:29] <DanielK_WMDE> RoanKattouw: to replicate, just edit. you get redirected to the "plain" url, and get an error (revision not found). you reload, and it works fine.
27	[12:45:02] <DanielK_WMDE> RoanKattouw, addshore: the relevant code in Article pre-fills the page with that error message. so that's the error you'll see if anything goes wrong. which means we don't actually know what goes wrong.
28	[12:46:19] <addshore> DanielK_WMDE: I dont think the beta replication issue is fixed
29	[12:46:19] <addshore> "The database has been automatically locked while the replica database servers catch up to the master."
30	[12:46:50] <RoanKattouw> "You reload, and it works fine" --> so it's not a long-lasting issue, it's brief
31	[12:46:55] <RoanKattouw> That makes investigating it harder
32	[12:47:34] <DanielK_WMDE> RoanKattouw: indeed. and MAYBE it's actually an issue with replication on beta, not with the code, as addshore just said...
33	[12:47:39] <RoanKattouw> addshore: Hmm if replication is completely broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow)
34	12:47:52] <RoanKattouw> Well, let's see if the revert fixes it
35	[12:47:57] <addshore> https://deployment.wikimedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= is the same as when arron looked yesterday
36	[12:48:03] <RoanKattouw> If it does, then your code is doing at least something wrong
37	[12:48:06] <addshore> wtf is up with beta
38	[12:48:15] <DanielK_WMDE> RoanKattouw: that indeed is a meaningful experiment :)
39	[12:48:51] <RoanKattouw> addshore: Hmm at that URL I see 0 lag right now
40	[12:49:07] <addshore> RoanKattouw: try creating a page though, it says the db is locked due to lag
41	[12:49:38] <RoanKattouw> I see
42	[12:49:57] <RoanKattouw> Very strange
43	[12:50:30] <RoanKattouw> The DB does appear to be read-only
44	[12:50:35] <RoanKattouw> Marking notifications as read doesn't stick
45	[12:51:06] <Zppix> Would jenkins doing a scap cause that?
46	[12:51:15] <addshore> Zppix: the locking? no
47	[12:51:22] <DanielK_WMDE> RoanKattouw: maybe it's r/o for a reason that has nothing to do with replag?
48	[12:51:23] <RoanKattouw> No, something weird is going on
49	[12:51:36] <RoanKattouw> I checked $wgReadOnly and it's false, so that isn't it
50	[12:51:43] <addshore> DanielK_WMDE: when editing though the reason it shows says it is due to replag
51	[12:51:52] <RoanKattouw> wfReadOnly() returns true, so let's see why
52	[12:51:55] <DanielK_WMDE> addshore: wtf?
53	[12:52:07] <addshore> https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/lastFailedBuild/console
54	[12:52:26] <addshore> Wikimedia\\Rdbms\\DBReplicationWaitError from line 369 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03
55	[12:53:02] <addshore> first failure is Build #22122 (18-Dec-2017 19:20:00)
56	[12:55:47] <RoanKattouw> OK I can confirm that replication is broken on deployment-db04
57	[12:55:58] <addshore> RoanKattouw: any idea how we can fix that?
58	[12:56:08] <RoanKattouw> https://www.irccloud.com/pastebin/tREOTlIr/
59	[12:56:53] <addshore> ooooooh, well, it has things to do with externalstore and revisions / old text, could be related too the patch
60	[12:56:58] <RoanKattouw> addshore: I know like 1/3 of the theory and I haven't done it before
61	[12:57:26] <RoanKattouw> Yeah the error that broke replication is a primary key collision in the text table, I hadn't realized that that could be related to MCR but you're right
62	[12:57:42] <RoanKattouw> The query itself specifies NULL as the old_id so that should have been fine
63	[12:57:55] <addshore> That actually could be a very big hint as to what is wrong
64	[12:58:27] <addshore> patch got merged, code started being run, replication broke, lag wasn't bad enough to cause a db lock for a while though, we keep seeing the same issue, and eventually the db locks up
65	[12:58:52] <RoanKattouw> If replication breaks completely, it should only take 5 minutes for the lag to shut everything down
66	[12:59:03] <RoanKattouw> At most; I'm not sure if MW detects this broken state specially and goes r/o immediately
67	[12:59:28] <RoanKattouw> But obviously if replication is broken but lag measurements still work, the lag will increase at a rate of 1 second per second, and the 300s lag threshold will be hit pretty quickly
68	[13:00:21] <RoanKattouw> OK so I've looked at the text table on deployment-db04 which interestingly has IDs going all the way up to 20982, and has the auto_increment value correctly set to 20983
69	[13:00:24] <RoanKattouw> So it doesn't look broken
70	[13:00:28] <RoanKattouw> Now going to look at that table on the master
71	[13:01:18] <RoanKattouw> Whoa, wtf
72	[13:01:24] <RoanKattouw> The text table on the master has FEWER entries
73	[13:01:33] <addshore> interesting....... urm :P
74	[13:01:42] <RoanKattouw> The highest old_id on the master is 20972
75	[13:02:09] <addshore> 10 entries less
76	[13:02:31] <addshore> and 10 edits have been made on deployment wiki since the code landed
77	[13:02:38] <RoanKattouw> OK something weird happened
78	[13:02:48] <RoanKattouw> The master and slave disagree on some rows:
79	[13:03:07] <RoanKattouw> Master:
80	[13:03:08] <RoanKattouw> https://www.irccloud.com/pastebin/mJeq37Zx/
81	[13:03:18] <RoanKattouw> Slave:
82	[13:03:22] <RoanKattouw> https://www.irccloud.com/pastebin/sRtUJazL/
83	[13:03:33] <DanielK_WMDE> RoanKattouw: it'S possible that my patch broke that. I tried to keep that logic untouched while moving it to a different class, but it's possible I broke that.
84	[13:03:49] <DanielK_WMDE> But it's also very possible that it's completely unrelated. I sure hope it is :)
85	[13:03:58] <RoanKattouw> Any idea whether the difference between nativeDataArray and utf-8 is related to your patch?
86	[13:04:18] <addshore> we defintly touched that area, so most likely
87	[13:04:21] <RoanKattouw> Also note that there is perfect non-overlap here
88	[13:04:32] <addshore> right, revert time it is
89	[13:04:45] <RoanKattouw> The master and replica agree on everything up to oldid 20970 which has text address 8211
90	[13:05:02] <Zppix> What is oldid 20970?
91	[13:05:08] <DanielK_WMDE> RoanKattouw: nativeDataArray? What is that? Never heard of it. I moved the code that handles these flags. If the logic changed in the meantime, it's possible i accidentally reverted that, in an inconsistent way
92	[13:05:12] <RoanKattouw> Then the replica picked up 8212-8217, 8219-8223 and 8225
93	[13:05:16] <Zppix> Perhaps we could get a timeline?
94	[13:05:22] <RoanKattouw> And the master picked up 8218 and 8224
95	[13:05:30] <DanielK_WMDE> RoanKattouw: what's the timestamp of that revision?
96	[13:05:37] <RoanKattouw> Which one, 20970?
97	[13:05:41] <RoanKattouw> The last one they agree on?
98	[13:05:43] <DanielK_WMDE> The patch was merged yesterday... I think around this time of day
99	[13:06:22] <DanielK_WMDE> RoanKattouw: merged yesterday at 16:10 CET.
100	[13:06:43] <RoanKattouw> There is no row in the revision table with either rev_id=20970 or rev_text_id=20970
101	[13:07:41] <addshore> Merged @ Dec 18 15:10 PM UTC
102	[13:07:52] <RoanKattouw> Wait, on the MASTER there isn't, on the replica there is
103	[13:08:00] <DanielK_WMDE> ...what?!
104	[13:08:02] <RoanKattouw> The DBs also disagree on what's in the revision table
105	[13:08:22] <RoanKattouw> The 15 most recent revision table entries on the master:
106	[13:08:24] <addshore> 15:18 it is on the beta clster and scap has run
107	[13:08:24] <RoanKattouw> https://www.irccloud.com/pastebin/lnRPQI4Z/
108	[13:08:25] <DanielK_WMDE> that sounds fubar...
109	[13:08:28] <RoanKattouw> And on the replica:
110	[13:08:37] <RoanKattouw> https://www.irccloud.com/pastebin/r7AxQfDT/
111	[13:08:57] <addshore> 15:19 I have already spotted the issue https://usercontent.irccloud-cdn.com/file/Mc9x8ROA/image.png
112	[13:09:08] <addshore> that will be the first edit made with the new code
113	[13:09:44] <RoanKattouw> Oh wait that's not a disagreement, my bad
114	[13:09:49] <addshore> RoanKattouw: what is up with rev 7934 where does it appear & look like?
115	[13:10:05] <DanielK_WMDE> RoanKattouw: nativeDataArray seems to be an AbuseFilter thing. Poor man's content model from many years ago, it seems
116	[13:10:13] <RoanKattouw> There are strange gaps in the rev_id sequence, but the only real difference is the existence of rev_id 7944 on the master which seems reasonable
117	[13:10:15] <RoanKattouw> Aah OK
118	[13:10:16] <DanielK_WMDE> I didn't touch that. should still work as before.
119	[13:10:26] <RoanKattouw> Right and the query that broke replication had an AbuseFilter method in its comment
120	[13:10:36] <DanielK_WMDE> \O/
121	[13:10:42] <RoanKattouw> \| 7934 \| 20972 \| 20171218152013 \|
122	[13:10:50] <addshore> Right
123	[13:10:52] <RoanKattouw> The two DBs agree on that
124	[13:11:00] <DanielK_WMDE> Buuut.... that actually may indicate an issue with my patch. Maybe AbuseFilter is relying on internals that i now changed
125	[13:11:06] <DanielK_WMDE> maybe I broke compat with AF
126	[13:11:07] <RoanKattouw> That is actually exactly the same rev_text_id as in the replication error!
127	[13:11:07] <addshore> DanielK_WMDE: indeed
128	[13:11:22] <RoanKattouw> Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump */ INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
129	[13:11:30] <addshore> RoanKattouw: right, and I believe that is the first time i spotted the issue / the first edit i made after the code landed
130	[13:11:48] <addshore> Welll...
131	[13:11:48] <addshore> 15:19, 18 December 2017 Addshore (talk \| contribs \| block) deleted page TXTTN (view/restore)
132	[13:12:01] <addshore> (diff \| hist) . . N TXTTN‎; 15:20 . . (+7)‎ . . ‎Addshore (talk \| contribs \| block)‎ (Created page with "slakfa;")
133	[13:12:01] <DanielK_WMDE> the timestamp matches
134	[13:12:26] <RoanKattouw> addshore: Oh you deleted stuff, that could explain the rev_id gap
135	[13:12:26] <DanielK_WMDE> addshore: but why is there an edit by abusefilter?... you didn't edit a filter, did you?
136	[13:12:27] <addshore> and that rev i created at 15:20 is https://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=TXTTN&oldid=7934
137	[13:12:53] * DanielK_WMDE feels an abuse filter rewrite coming on
138	[13:13:04] <addshore> RoanKattouw: yes, although in the UI i cant see what i deleted, or what timestamp or revid as the db is locked https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:Undelete/TXTTN
139	[13:13:22] <RoanKattouw> I have it
140	[13:13:30] <RoanKattouw> https://www.irccloud.com/pastebin/Lh4gkYeA/
141	[13:13:36] <RoanKattouw> It helps that it was the most recently deleted thing
142	[13:14:29] <DanielK_WMDE> so, what was broken? AF or deletion?
143	[13:14:31] <addshore> so i created a page, and it worked, iu deleted it, it wokred, i recreated it and started seeing the issue
144	[13:14:45] <addshore> and that was due to abusefilter exploding during the first creation or deletion?
145	[13:14:47] <DanielK_WMDE> how can any of this break replication, btw?
146	[13:15:16] <DanielK_WMDE> addshore: why would abuse filter write revisions?
147	[13:15:22] <DanielK_WMDE> unless youedit a filter?
148	[13:15:24] <addshore> DanielK_WMDE: no idea
149	[13:15:31] <addshore> So, "nativeDataArray" is ONLY used in AbuseFilter
150	[13:15:32] <DanielK_WMDE> maybe someone else was editing a filter?
151	[13:15:49] <DanielK_WMDE> (and maybe this is a red herring)?
152	[13:16:00] <addshore> https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/blob/master/includes/AbuseFilter.class.php#L1286
153	[13:16:45] <addshore> scrolls up to look at the flag mess we already discovered
154	[13:17:37] <addshore> 20970 has no nativeDataArray flag on either slave or master, 20971 and 20972 have it on master but not on the slave
155	[13:17:43] <DanielK_WMDE> addshore: can you find the user & commit message associated with the nativeArrayData entries?
156	[13:20:12] <RoanKattouw> Yeah old_id 20972 points to completely different things on the master and the replica
157	[13:20:20] <RoanKattouw> Different flags but also different external store addresses
158	[13:20:29] <DanielK_WMDE> how the hell does that happen?
159	[13:20:34] <RoanKattouw> No idea
160	[13:20:46] <addshore> its almost like some code is somehow manging to write to the slave...
161	[13:20:47] <RoanKattouw> I'm digging though the binlogs right now trying to find out what other queries were run on the tex ttable
162	[13:20:53] <RoanKattouw> So far I've only found the one that killed replication
163	[13:21:00] <RoanKattouw> Hmm yeah maybe
164	[13:21:12] <addshore> it looks like abusefilter stuff manage to write to master only
165	[13:21:19] <RoanKattouw> I thought that maybe the master and replica had switched roles, but maybe the replica is writeable when it shouldn't be? That would break replication pretty quickly too
166	[13:21:22] <DanielK_WMDE> addshore: oooohhhh.... what would happen if you tried that? it should fail HARD! But does it?
167	[13:21:44] <addshore> DanielK_WMDE: I would have thought it would fail really hard, but perhaps not?>
168	[13:21:46] <DanielK_WMDE> RoanKattouw: maybe aaron switched them yesterday when trying to fix things?
169	[13:21:56] <Zppix> Have you guys tried writing to the replicas directly?
170	[13:22:01] <RoanKattouw> That's possible
171	[13:22:09] <Zppix> It should fail no?
172	[13:22:10] <RoanKattouw> I'll check if the replica is read-only
173	[13:22:39] <RoanKattouw> Welp, the replica is not read-only
174	[13:23:11] <Zppix> Hmmm
175	[13:23:16] <Zppix> That cant be good
176	[13:23:41] <addshore> so, what is the first insert / update / delete we see on the replica that is not on the master?
177	[13:23:44] <addshore> (if there is one) ?
178	[13:24:12] <RoanKattouw> I'm trying to figure that out now
179	[13:25:09] <RoanKattouw> The master's binlog file is 489MB
180	[13:25:12] <addshore> I'll filter out messages out of this IRC channel and dump the chat log somewhere nice to look at once we have finished here, I think this is all rather valuable info we don't want to have to go hunting for agian
181	[13:25:24] <RoanKattouw> So analyzing it is a bit slow
182	[13:26:24] <DanielK_WMDE> I'll look into making unit tests use a r/o wrapper for the DB REPLICA connection. That may not be easy, though
183	13:29:08] <RoanKattouw> OK so I'm going through the query history on both servers to try to prove that they received contradictory insertions
184	[13:29:13] <RoanKattouw> But it's slow going
185	[13:29:21] <addshore> I'm looking through the usages of DB SLAVE / MASTER in abusefilter and out new revstore stuff now
186	[13:30:10] <addshore> $dbw = $this->getDBConnection( DB_REPLICA );
187	[13:30:11] <RoanKattouw> There is not much in the beta SAL so I can't tell if db04 was legitimately the master at some point
188	[13:30:12] <addshore> DanielK_WMDE: RoanKattouw ^^
189	[13:30:14] <RoanKattouw> LOL
190	[13:30:16] <RoanKattouw> Wel...
191	[13:30:22] <RoanKattouw> That would do it
192	[13:30:22] <addshore> Think I might have found it....
193	[13:30:34] <RoanKattouw> Do we not have a mechanism in MW to prevent writes to DB_REPLICA connections?!
194	[13:30:41] <addshore> RoanKattouw: apparently not
195	[13:30:45] <addshore> it just does ->insert
196	[13:31:13] <addshore> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231
197	[13:31:15] <DanielK_WMDE> RoanKattouw: looking into fixing that at least for unit tests
198	[13:31:22] <DanielK_WMDE> ideally for everything.
199	[13:31:50] <addshore> there should be a NoWrite implementation of the db interface
200	[13:31:56] <RoanKattouw> Well that would do it
201	[13:31:58] <addshore> that refuses to accept calls to any writ methods
202	[13:32:06] <RoanKattouw> That's exactly the place where we would expect this, too
203	[13:32:11] <addshore> or infact... 2 separate interfaces..
204	[13:32:14] <RoanKattouw> OK I'm going to file a couple tasks here
205	[13:32:20] <addshore> RoanKattouw: agreed
206	[13:32:25] <RoanKattouw> 1) We need all replica DBs to have writes disabled, in labs as well as production
207	[13:32:31] <RoanKattouw> 2)
208	[13:32:45] <RoanKattouw> 2) Trying to write to a DB_REPLICA handle needs to throw an exception
209	[13:32:46] <Zppix> If we disable writes how could masters write to replicas?
210	[13:32:51] <apergos> O.o you can write to replicas??
211	[13:32:58] <RoanKattouw> apergos: Yuuup
212	[13:32:59] <addshore> apergos: on the beta cluster yes
213	[13:33:02] <apergos> owie
214	[13:33:05] <RoanKattouw> Zppix: Replication is exempt from the no-writes flag
215	[13:33:06] <addshore> and within mediawiki yes,
216	[13:33:11] <Zppix> Ah ok
217	[13:33:20] <RoanKattouw> 3) deployment-db04 is hosed and needs to become un-hosed
218	[13:33:29] <Zppix> (I am not to smart when it comes to db)
219	[13:33:55] <RoanKattouw> (I wondered the same thing actually, so I looked up the documentation and it said that replication and admins were exempt)
220	[13:34:33] <RoanKattouw> I'm going to begin with depooling deployment-db04 so as to unbreak the beta wikis
221	[13:34:37] <apergos> what the heck is getting a db replica handle and trying to write to it anyways?
222	[13:34:43] <RoanKattouw> Then we probably need to reprovision it or something
223	[13:34:48] <RoanKattouw> apergos: The new MCR code :)
224	[13:34:54] <RoanKattouw> https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/SqlBlobStore.php#L231
225	[13:34:56] <apergos> oh dear...
226	[13:35:03] <RoanKattouw> Yeah...
227	[13:35:13] <RoanKattouw> So the text table's IDs diverged on the master and replica DBs
228	[13:35:21] <apergos> I know it's not really funny but I am lmao here
229	[13:35:24] <addshore> there could perhaps be a very simply phpcs sniff for that too
230	[13:35:26] <RoanKattouw> Then when the master tried to replicate its writes, replication died
231	[13:35:30] <apergos> of course
232	[13:35:33] <DanielK_WMDE> apergos: my bad. whoops. we need more eyes on this code...
233	[13:35:52] <apergos> well the good thing is it gets discovered in labs
234	[13:35:54] <RoanKattouw> apergos: If you're in #wikimedia-dev already, see my pastes in the backscroll there, you'll see the two servers disagreeing on the contents of the text and revision tables
235	[13:36:03] <addshore> apergos: RoanKattouw the great news is this issue is much simpler to fix than some of the other possible reasons for breaking that we were thinking about.
236	[13:36:07] <RoanKattouw> We also had revision IDs with different meanings on the two servers
237	[13:36:11] <RoanKattouw> Yes
238	[13:36:12] <apergos> eewwww
239	[13:36:20] <RoanKattouw> And it's not a chronology protection issue
240	[13:36:20] <DanielK_WMDE> RoanKattouw, addshore: i'm introducing allowWrite into the datbase base class. I'm also introducing a wrapper so writing to a replica connection will fail even if there is only one db and one connection.
241	[13:36:22] <apergos> yes I suppose it would all go pear-shaped pretty soon after
242	[13:36:33] <addshore> DanielK_WMDE: sounds good
243	[13:36:34] <apergos> doing the backread in the other channel
244	[13:36:49] <RoanKattouw> DanielK_WMDE: Yes that's good, and necessary to catch these things in dev. I'll file that task first so you can claim it
245	[13:36:53] <addshore> DanielK_WMDE: shall I fix the line or you?
246	[13:37:15] <DanielK_WMDE> RoanKattouw, addshore: thank you for helping with this!
247	[13:37:18] <DanielK_WMDE> addshore: i'll fix it.
248	[13:37:52] <Zppix> Im glad i mentioned directly writing to replicas :P
249	[13:38:15] <DanielK_WMDE> Zppix: thank you, indeed :)+
250	[13:38:39] <DanielK_WMDE> Zppix: I would have expected that to just fail HARD, not silently corrupt the database
251	[13:38:51] <Zppix> I wonder if this possibly in prod as well
252	[13:38:55] <Zppix> Possible*
253	[13:39:06] <DanielK_WMDE> RoanKattouw: will you also file a ticket to make sure replicates are not writable in production? This kind of thing can SERIOUSLY mess up the database :(
254	[13:39:15] <DanielK_WMDE> Zppix: yea, scary
255	[13:39:33] <RoanKattouw> Yes
256	[13:39:38] <RoanKattouw> See my 1-2-3 above
257	[13:41:53] <apergos> that was a fun little read
258	[13:42:08] <apergos> not easy to track down either
259	[13:42:29] <DanielK_WMDE> yep... several wrong leads

Details

	Subject	Repo	Branch	Lines +/-
	Prevent writes on connections obtained with DB_REPLICA.	mediawiki/core	master	+576 -6
	[MCR] Fix SqlBlobStore using DB_REPLICA for writes	mediawiki/core	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	daniel	T183242 DB handles obtained with DB_REPLICA should not allow writes
Invalid	daniel	T183244 Unit tests should fail when code writes to a DB connection obtained with the DB_REPLICA flag.
Resolved	daniel	T183265 Write operations should fail on connections to database replicas.

Event Timeline

Catrope created this task.Dec 19 2017, 1:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2017, 1:42 PM

Zppix subscribed.Dec 19 2017, 1:42 PM

ArielGlenn subscribed.Dec 19 2017, 1:45 PM

Addshore awarded a token.Dec 19 2017, 1:45 PM

Addshore subscribed.

daniel claimed this task.Dec 19 2017, 1:46 PM

daniel triaged this task as High priority.

daniel added projects: User-Daniel, Multi-Content-Revisions.

Addshore moved this task from Inbox to Done on the Multi-Content-Revisions board.Dec 19 2017, 1:46 PM

Catrope mentioned this in T183245: Ensure replica DB in labs is read-only.Dec 19 2017, 2:00 PM

Catrope updated the task description. (Show Details)

Zppix updated the task description. (Show Details)Dec 19 2017, 2:07 PM

Catrope mentioned this in T183252: Unbreak replication in beta cluster.Dec 19 2017, 2:14 PM

Change 399188 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] [MCR] Fix SqlBlobStore using DB_REPLICA for writes

https://gerrit.wikimedia.org/r/399188

gerritbot added a project: Patch-For-Review.Dec 19 2017, 2:59 PM

Change 399188 merged by jenkins-bot:
[mediawiki/core@master] [MCR] Fix SqlBlobStore using DB_REPLICA for writes

https://gerrit.wikimedia.org/r/399188

daniel mentioned this in T183265: Write operations should fail on connections to database replicas..Dec 19 2017, 3:44 PM

daniel closed subtask T183244: Unit tests should fail when code writes to a DB connection obtained with the DB_REPLICA flag. as Invalid.

daniel added a subtask: T183265: Write operations should fail on connections to database replicas..Dec 19 2017, 3:47 PM

Reducing to normal prio, bumping T183265: Write operations should fail on connections to database replicas. to high prio instead.

Change 399445 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] [WIP] Prevent writes on connections obtained with DB_REPLICA.

https://gerrit.wikimedia.org/r/399445

daniel closed subtask T183265: Write operations should fail on connections to database replicas. as Resolved.Dec 21 2017, 2:36 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Apr 10 2018, 3:42 PM

CCicalese_WMF moved this task from Done to Inbox on the Multi-Content-Revisions board.May 10 2018, 3:14 PM

daniel removed a project: Multi-Content-Revisions.May 15 2018, 6:41 PM

daniel moved this task from Inbox to Push on the User-Daniel board.

Effectively resolved by:

rMWeb603040de8d: rdbms: add and enforce DB_REPLICA/DB_MASTER roles in DBConnRef

(and many commits before it in preparation, and after it to make it work even better).

It is limited in that direct use of getConnection(DB_REPLICA), which bypasses the internal connection management, also bypasses this detection. In general, use of that method is discouraged though.

The only exception being legacy code using wfGetDB() which is currently tied to getConnection(), which I'm not sure whether we can swap or not.

DB handles obtained with DB_REPLICA should not allow writesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

DB handles obtained with DB_REPLICA should not allow writes
Closed, ResolvedPublic
Actions

Related Objects
Search...