Paste P6485

Chat Log T183242

Authored by Zppix on Dec 19 2017, 2:06 PM.
1​ RoanKattouw: Aaron tried to help with investigating the chronology issue yesterday, but didn't get anywhere, it seems. That stuff is a bit esoterik to me... do you have an idea how to investigate? Basically, new revisions fail to load seconds after an edit. it *looks* like the page table is out iof sync with the revision table. But nobody knows how that can be.
2​[12:36:40] <DanielK_WMDE> ah, addshore merged the revert
3​[12:36:50] <DanielK_WMDE> I'll file a revert of the revert in a minute
4​[12:36:54] <addshore> not yet, just fixing the commit message
5​[12:37:00] <addshore> but gerrit has died for me
6​[12:37:17] <DanielK_WMDE> wffm
7​[12:37:18] <addshore> RoanKattouw: yes, the christmas time for merging and getting it on beta was the plan :)
8​[12:37:41] <DanielK_WMDE> addshore: we failed to consider that christmas is a bad time to get people to investigate arcane issues, though :)
9​[12:37:47] <DanielK_WMDE> aaron is gone for the holidays, i guess
10​[12:37:49] <addshore> indeed
11​[12:38:08] <DanielK_WMDE> i was expecting issues, but i was hoping they would be a bit more clear cut...
12​[12:39:01] <RoanKattouw> haha yeah that's a problem
14​[12:39:53] <RoanKattouw> I guess you could try to figure out what the state of the DB is that the broken requests see (revision table entry referring to a nonexistent page table entry? or the reverse?) and try to figure out if the order of insertions might possibly be wrong
15​[12:40:21] <addshore> DanielK_WMDE: Just finished fixing the commit message issues, anything else we need to add to that revert?
16​[12:40:34] <DanielK_WMDE> RoanKattouw: no, thje page table references a revision that then fails to load (it may or may not be in the db, we don't know)
17​[12:40:54] <DanielK_WMDE> addshore: i don't think so
18​[12:41:15] <RoanKattouw> Hmm that's weird
19​[12:41:31] <RoanKattouw> One thing I wonder is how the page table row was able to be written then
20​[12:41:41] <RoanKattouw> Because you don't know the revision ID until after you've written the revision row
21​[12:41:53] <addshore> hmm, beta db is currently locked?
22​[12:42:08] <RoanKattouw> Have you isolated a broken case? Do broken cases stay broken for a long-ish period of time, or only briefly?
23​[12:42:21] <addshore> RoanKattouw: I was about to say you can try it on beta, but the db is locked
24​[12:42:34] <DanielK_WMDE> RoanKattouw: it does get written. the revision is found once replicas catch up. it just fails to load right after the edit.
25​[12:43:23] <DanielK_WMDE> RoanKattouw: while investigating, aaron found that replication weas broken on beta. we firt thought that that was all there was to it. but fixing replication didn't fix the issue. it's all a bit... obscure
26​[12:44:29] <DanielK_WMDE> RoanKattouw: to replicate, just edit. you get redirected to the "plain" url, and get an error (revision not found). you reload, and it works fine.
27​[12:45:02] <DanielK_WMDE> RoanKattouw, addshore: the relevant code in Article pre-fills the page with that error message. so that's the error you'll see if *anything* goes wrong. which means we don't actually know what goes wrong.
28​[12:46:19] <addshore> DanielK_WMDE: I dont think the beta replication issue is fixed
29​[12:46:19] <addshore> "The database has been automatically locked while the replica database servers catch up to the master."
30​[12:46:50] <RoanKattouw> "You reload, and it works fine" --> so it's not a long-lasting issue, it's brief
31​[12:46:55] <RoanKattouw> That makes investigating it harder
32​[12:47:34] <DanielK_WMDE> RoanKattouw: indeed. and MAYBE it's actually an issue with replication on beta, not with the code, as addshore just said...
33​[12:47:39] <RoanKattouw> addshore: Hmm if replication is *completely* broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow)
34​12:47:52] <RoanKattouw> Well, let's see if the revert fixes it
35​[12:47:57] <addshore> is the same as when arron looked yesterday
36​[12:48:03] <RoanKattouw> If it does, then your code is doing at least something wrong
37​[12:48:06] <addshore> wtf is up with beta
38​[12:48:15] <DanielK_WMDE> RoanKattouw: that indeed is a meaningful experiment :)
39​[12:48:51] <RoanKattouw> addshore: Hmm at that URL I see 0 lag right now
40​[12:49:07] <addshore> RoanKattouw: try creating a page though, it says the db is locked due to lag
41​[12:49:38] <RoanKattouw> I see
42​[12:49:57] <RoanKattouw> Very strange
43​[12:50:30] <RoanKattouw> The DB does appear to be read-only
44​[12:50:35] <RoanKattouw> Marking notifications as read doesn't stick
45​[12:51:06] <Zppix> Would jenkins doing a scap cause that?
46​[12:51:15] <addshore> Zppix: the locking? no
47​[12:51:22] <DanielK_WMDE> RoanKattouw: maybe it's r/o for a reason that has nothing to do with replag?
48​[12:51:23] <RoanKattouw> No, something weird is going on
49​[12:51:36] <RoanKattouw> I checked $wgReadOnly and it's false, so that isn't it
50​[12:51:43] <addshore> DanielK_WMDE: when editing though the reason it shows says it is due to replag
51​[12:51:52] <RoanKattouw> wfReadOnly() returns true, so let's see why
52​[12:51:55] <DanielK_WMDE> addshore: wtf?
53​[12:52:07] <addshore>
54​[12:52:26] <addshore> Wikimedia\\Rdbms\\DBReplicationWaitError from line 369 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03
55​[12:53:02] <addshore> first failure is Build #22122 (18-Dec-2017 19:20:00)
56​[12:55:47] <RoanKattouw> OK I can confirm that replication is broken on deployment-db04
57​[12:55:58] <addshore> RoanKattouw: any idea how we can fix that?
58​[12:56:08] <RoanKattouw>
59​[12:56:53] <addshore> ooooooh, well, it has things to do with externalstore and revisions / old text, could be related too the patch
60​[12:56:58] <RoanKattouw> addshore: I know like 1/3 of the theory and I haven't done it before
61​[12:57:26] <RoanKattouw> Yeah the error that broke replication is a primary key collision in the text table, I hadn't realized that that could be related to MCR but you're right
62​[12:57:42] <RoanKattouw> The query itself specifies NULL as the old_id so that should have been fine
63​[12:57:55] <addshore> That actually could be a very big hint as to what is wrong
64​[12:58:27] <addshore> patch got merged, code started being run, replication broke, lag wasn't bad enough to cause a db lock for a while though, we keep seeing the same issue, and eventually the db locks up
65​[12:58:52] <RoanKattouw> If replication breaks completely, it should only take 5 minutes for the lag to shut everything down
66​[12:59:03] <RoanKattouw> At most; I'm not sure if MW detects this broken state specially and goes r/o immediately
67​[12:59:28] <RoanKattouw> But obviously if replication is broken but lag measurements still work, the lag will increase at a rate of 1 second per second, and the 300s lag threshold will be hit pretty quickly
68​[13:00:21] <RoanKattouw> OK so I've looked at the text table on deployment-db04 which interestingly has IDs going all the way up to 20982, and has the auto_increment value correctly set to 20983
69​[13:00:24] <RoanKattouw> So it doesn't look broken
70​[13:00:28] <RoanKattouw> Now going to look at that table on the master
71​[13:01:18] <RoanKattouw> Whoa, wtf
72​[13:01:24] <RoanKattouw> The text table on the master has FEWER entries
73​[13:01:33] <addshore> interesting....... urm :P
74​[13:01:42] <RoanKattouw> The highest old_id on the master is 20972
75​[13:02:09] <addshore> 10 entries less
76​[13:02:31] <addshore> and 10 edits have been made on deployment wiki since the code landed
77​[13:02:38] <RoanKattouw> OK something weird happened
78​[13:02:48] <RoanKattouw> The master and slave disagree on some rows:
79​[13:03:07] <RoanKattouw> Master:
80​[13:03:08] <RoanKattouw>
81​[13:03:18] <RoanKattouw> Slave:
82​[13:03:22] <RoanKattouw>
83​[13:03:33] <DanielK_WMDE> RoanKattouw: it'S possible that my patch broke that. I tried to keep that logic untouched while moving it to a different class, but it's possible I broke that.
84​[13:03:49] <DanielK_WMDE> But it's also very possible that it's completely unrelated. I sure hope it is :)
85​[13:03:58] <RoanKattouw> Any idea whether the difference between nativeDataArray and utf-8 is related to your patch?
86​[13:04:18] <addshore> we defintly touched that area, so most likely
87​[13:04:21] <RoanKattouw> Also note that there is perfect non-overlap here
88​[13:04:32] <addshore> right, revert time it is
89​[13:04:45] <RoanKattouw> The master and replica agree on everything up to oldid 20970 which has text address 8211
90​[13:05:02] <Zppix> What is oldid 20970?
91​[13:05:08] <DanielK_WMDE> RoanKattouw: nativeDataArray? What is that? Never heard of it. I moved the code that handles these flags. If the logic changed in the meantime, it's possible i accidentally reverted that, in an inconsistent way
92​[13:05:12] <RoanKattouw> Then the replica picked up 8212-8217, 8219-8223 and 8225
93​[13:05:16] <Zppix> Perhaps we could get a timeline?
94​[13:05:22] <RoanKattouw> And the master picked up 8218 and 8224
95​[13:05:30] <DanielK_WMDE> RoanKattouw: what's the timestamp of that revision?
96​[13:05:37] <RoanKattouw> Which one, 20970?
97​[13:05:41] <RoanKattouw> The last one they agree on?
98​[13:05:43] <DanielK_WMDE> The patch was merged yesterday... I think around this time of day
99​[13:06:22] <DanielK_WMDE> RoanKattouw: merged yesterday at 16:10 CET.
100​[13:06:43] <RoanKattouw> There is no row in the revision table with either rev_id=20970 or rev_text_id=20970
101​[13:07:41] <addshore> Merged @ Dec 18 15:10 PM UTC
102​[13:07:52] <RoanKattouw> Wait, on the MASTER there isn't, on the replica there is
103​[13:08:00] <DanielK_WMDE> ...what?!
104​[13:08:02] <RoanKattouw> The DBs also disagree on what's in the revision table
105​[13:08:22] <RoanKattouw> The 15 most recent revision table entries on the master:
106​[13:08:24] <addshore> 15:18 it is on the beta clster and scap has run
107​[13:08:24] <RoanKattouw>
108​[13:08:25] <DanielK_WMDE> that sounds fubar...
109​[13:08:28] <RoanKattouw> And on the replica:
110​[13:08:37] <RoanKattouw>
111​[13:08:57] <addshore> 15:19 I have already spotted the issue
112​[13:09:08] <addshore> that will be the first edit made with the new code
113​[13:09:44] <RoanKattouw> Oh wait that's not a disagreement, my bad
114​[13:09:49] <addshore> RoanKattouw: what is up with rev 7934 where does it appear & look like?
115​[13:10:05] <DanielK_WMDE> RoanKattouw: nativeDataArray seems to be an AbuseFilter thing. Poor man's content model from many years ago, it seems
116​[13:10:13] <RoanKattouw> There are strange gaps in the rev_id sequence, but the only real difference is the existence of rev_id 7944 on the master which seems reasonable
117​[13:10:15] <RoanKattouw> Aah OK
118​[13:10:16] <DanielK_WMDE> I didn't touch that. should still work as before.
119​[13:10:26] <RoanKattouw> Right and the query that broke replication had an AbuseFilter method in its comment
120​[13:10:36] <DanielK_WMDE> \O/
121​[13:10:42] <RoanKattouw> | 7934 | 20972 | 20171218152013 |
122​[13:10:50] <addshore> Right
123​[13:10:52] <RoanKattouw> The two DBs agree on that
124​[13:11:00] <DanielK_WMDE> Buuut.... that actually may indicate an issue with my patch. Maybe AbuseFilter is relying on internals that i now changed
125​[13:11:06] <DanielK_WMDE> maybe I broke compat with AF
126​[13:11:07] <RoanKattouw> That is actually exactly the same rev_text_id as in the replication error!
127​[13:11:07] <addshore> DanielK_WMDE: indeed
128​[13:11:22] <RoanKattouw> Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump */ INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
129​[13:11:30] <addshore> RoanKattouw: right, and I believe that is the first time i spotted the issue / the first edit i made after the code landed
130​[13:11:48] <addshore> Welll...
131​[13:11:48] <addshore> 15:19, 18 December 2017 Addshore (talk | contribs | block) deleted page TXTTN (view/restore)
132​[13:12:01] <addshore> (diff | hist) . . N TXTTN‎; 15:20 . . (+7)‎ . . ‎Addshore (talk | contribs | block)‎ (Created page with "slakfa;")
133​[13:12:01] <DanielK_WMDE> the timestamp matches
134​[13:12:26] <RoanKattouw> addshore: Oh you deleted stuff, that could explain the rev_id gap
135​[13:12:26] <DanielK_WMDE> addshore: but why is there an edit by abusefilter?... you didn't edit a filter, did you?
136​[13:12:27] <addshore> and that rev i created at 15:20 is
137​[13:12:53] * DanielK_WMDE feels an abuse filter rewrite coming on
138​[13:13:04] <addshore> RoanKattouw: yes, although in the UI i cant see what i deleted, or what timestamp or revid as the db is locked
139​[13:13:22] <RoanKattouw> I have it
140​[13:13:30] <RoanKattouw>
141​[13:13:36] <RoanKattouw> It helps that it was the most recently deleted thing
142​[13:14:29] <DanielK_WMDE> so, what was broken? AF or deletion?
143​[13:14:31] <addshore> so i created a page, and it worked, iu deleted it, it wokred, i recreated it and started seeing the issue
144​[13:14:45] <addshore> and that was due to abusefilter exploding during the first creation or deletion?
145​[13:14:47] <DanielK_WMDE> how can *any* of this break replication, btw?
146​[13:15:16] <DanielK_WMDE> addshore: why would abuse filter write revisions?
147​[13:15:22] <DanielK_WMDE> unless youedit a filter?
148​[13:15:24] <addshore> DanielK_WMDE: no idea
149​[13:15:31] <addshore> So, "nativeDataArray" is ONLY used in AbuseFilter
150​[13:15:32] <DanielK_WMDE> maybe someone else was editing a filter?
151​[13:15:49] <DanielK_WMDE> (and maybe this is a red herring)?
152​[13:16:00] <addshore>
153​[13:16:45] <addshore> *scrolls up to look at the flag mess we already discovered*
154​[13:17:37] <addshore> 20970 has no nativeDataArray flag on either slave or master, 20971 and 20972 have it on master but not on the slave
155​[13:17:43] <DanielK_WMDE> addshore: can you find the user & commit message associated with the nativeArrayData entries?
156​[13:20:12] <RoanKattouw> Yeah old_id 20972 points to completely different things on the master and the replica
157​[13:20:20] <RoanKattouw> Different flags but also different external store addresses
158​[13:20:29] <DanielK_WMDE> how the hell does that happen?
159​[13:20:34] <RoanKattouw> No idea
160​[13:20:46] <addshore> its almost like some code is somehow manging to write to the slave...
161​[13:20:47] <RoanKattouw> I'm digging though the binlogs right now trying to find out what other queries were run on the tex ttable
162​[13:20:53] <RoanKattouw> So far I've only found the one that killed replication
163​[13:21:00] <RoanKattouw> Hmm yeah maybe
164​[13:21:12] <addshore> it looks like abusefilter stuff manage to write to master only
165​[13:21:19] <RoanKattouw> I thought that maybe the master and replica had switched roles, but maybe the replica is writeable when it shouldn't be? That would break replication pretty quickly too
166​[13:21:22] <DanielK_WMDE> addshore: oooohhhh.... what would happen if you tried that? it should fail HARD! But does it?
167​[13:21:44] <addshore> DanielK_WMDE: I would have thought it would fail really hard, but perhaps not?>
168​[13:21:46] <DanielK_WMDE> RoanKattouw: maybe aaron switched them yesterday when trying to fix things?
169​[13:21:56] <Zppix> Have you guys tried writing to the replicas directly?
170​[13:22:01] <RoanKattouw> That's possible
171​[13:22:09] <Zppix> It should fail no?
172​[13:22:10] <RoanKattouw> I'll check if the replica is read-only
173​[13:22:39] <RoanKattouw> Welp, the replica is not read-only
174​[13:23:11] <Zppix> Hmmm
175​[13:23:16] <Zppix> That cant be good
176​[13:23:41] <addshore> so, what is the first insert / update / delete we see on the replica that is not on the master?
177​[13:23:44] <addshore> (if there is one) ?
178​[13:24:12] <RoanKattouw> I'm trying to figure that out now
179​[13:25:09] <RoanKattouw> The master's binlog file is 489MB
180​[13:25:12] <addshore> I'll filter out messages out of this IRC channel and dump the chat log somewhere nice to look at once we have finished here, I think this is all rather valuable info we don't want to have to go hunting for agian
181​[13:25:24] <RoanKattouw> So analyzing it is a bit slow
182​[13:26:24] <DanielK_WMDE> I'll look into making unit tests use a r/o wrapper for the DB REPLICA connection. That may not be easy, though
183​13:29:08] <RoanKattouw> OK so I'm going through the query history on both servers to try to prove that they received contradictory insertions
184​[13:29:13] <RoanKattouw> But it's slow going
185​[13:29:21] <addshore> I'm looking through the usages of DB SLAVE / MASTER in abusefilter and out new revstore stuff now
186​[13:30:10] <addshore> $dbw = $this->getDBConnection( DB_REPLICA );
187​[13:30:11] <RoanKattouw> There is not much in the beta SAL so I can't tell if db04 was legitimately the master at some point
188​[13:30:12] <addshore> DanielK_WMDE: RoanKattouw ^^
189​[13:30:14] <RoanKattouw> LOL
190​[13:30:16] <RoanKattouw> Wel...
191​[13:30:22] <RoanKattouw> That would do it
192​[13:30:22] <addshore> Think I might have found it....
193​[13:30:34] <RoanKattouw> Do we not have a mechanism in MW to prevent writes to DB_REPLICA connections?!
194​[13:30:41] <addshore> RoanKattouw: apparently not
195​[13:30:45] <addshore> it just does ->insert
196​[13:31:13] <addshore>
197​[13:31:15] <DanielK_WMDE> RoanKattouw: looking into fixing that at least for unit tests
198​[13:31:22] <DanielK_WMDE> ideally for everything.
199​[13:31:50] <addshore> there should be a NoWrite implementation of the db interface
200​[13:31:56] <RoanKattouw> Well that would do it
201​[13:31:58] <addshore> that refuses to accept calls to any writ methods
202​[13:32:06] <RoanKattouw> That's exactly the place where we would expect this, too
203​[13:32:11] <addshore> or infact... 2 separate interfaces..
204​[13:32:14] <RoanKattouw> OK I'm going to file a couple tasks here
205​[13:32:20] <addshore> RoanKattouw: agreed
206​[13:32:25] <RoanKattouw> 1) We need all replica DBs to have writes disabled, in labs as well as production
207​[13:32:31] <RoanKattouw> 2)
208​[13:32:45] <RoanKattouw> 2) Trying to write to a DB_REPLICA handle needs to throw an exception
209​[13:32:46] <Zppix> If we disable writes how could masters write to replicas?
210​[13:32:51] <apergos> O.o you can write to replicas??
211​[13:32:58] <RoanKattouw> apergos: Yuuup
212​[13:32:59] <addshore> apergos: on the beta cluster yes
213​[13:33:02] <apergos> owie
214​[13:33:05] <RoanKattouw> Zppix: Replication is exempt from the no-writes flag
215​[13:33:06] <addshore> and within mediawiki yes,
216​[13:33:11] <Zppix> Ah ok
217​[13:33:20] <RoanKattouw> 3) deployment-db04 is hosed and needs to become un-hosed
218​[13:33:29] <Zppix> (I am not to smart when it comes to db)
219​[13:33:55] <RoanKattouw> (I wondered the same thing actually, so I looked up the documentation and it said that replication and admins were exempt)
220​[13:34:33] <RoanKattouw> I'm going to begin with depooling deployment-db04 so as to unbreak the beta wikis
221​[13:34:37] <apergos> what the heck is getting a db replica handle and trying to write to it anyways?
222​[13:34:43] <RoanKattouw> Then we probably need to reprovision it or something
223​[13:34:48] <RoanKattouw> apergos: The new MCR code :)
224​[13:34:54] <RoanKattouw>
225​[13:34:56] <apergos> oh dear...
226​[13:35:03] <RoanKattouw> Yeah...
227​[13:35:13] <RoanKattouw> So the text table's IDs diverged on the master and replica DBs
228​[13:35:21] <apergos> I know it's not really funny but I am lmao here
229​[13:35:24] <addshore> there could perhaps be a very simply phpcs sniff for that too
230​[13:35:26] <RoanKattouw> Then when the master tried to replicate its writes, replication died
231​[13:35:30] <apergos> of course
232​[13:35:33] <DanielK_WMDE> apergos: my bad. whoops. we need more eyes on this code...
233​[13:35:52] <apergos> well the good thing is it gets discovered in labs
234​[13:35:54] <RoanKattouw> apergos: If you're in #wikimedia-dev already, see my pastes in the backscroll there, you'll see the two servers disagreeing on the contents of the text and revision tables
235​[13:36:03] <addshore> apergos: RoanKattouw the great news is this issue is much simpler to fix than some of the other possible reasons for breaking that we were thinking about.
236​[13:36:07] <RoanKattouw> We also had revision IDs with different meanings on the two servers
237​[13:36:11] <RoanKattouw> Yes
238​[13:36:12] <apergos> eewwww
239​[13:36:20] <RoanKattouw> And it's not a chronology protection issue
240​[13:36:20] <DanielK_WMDE> RoanKattouw, addshore: i'm introducing allowWrite into the datbase base class. I'm also introducing a wrapper so writing to a replica connection will fail even if there is only one db and one connection.
241​[13:36:22] <apergos> yes I suppose it would all go pear-shaped pretty soon after
242​[13:36:33] <addshore> DanielK_WMDE: sounds good
243​[13:36:34] <apergos> doing the backread in the other channel
244​[13:36:49] <RoanKattouw> DanielK_WMDE: Yes that's good, and necessary to catch these things in dev. I'll file that task first so you can claim it
245​[13:36:53] <addshore> DanielK_WMDE: shall I fix the line or you?
246​[13:37:15] <DanielK_WMDE> RoanKattouw, addshore: thank you for helping with this!
247​[13:37:18] <DanielK_WMDE> addshore: i'll fix it.
248​[13:37:52] <Zppix> Im glad i mentioned directly writing to replicas :P
249​[13:38:15] <DanielK_WMDE> Zppix: thank you, indeed :)+
250​[13:38:39] <DanielK_WMDE> Zppix: I would have expected that to just fail HARD, not silently corrupt the database
251​[13:38:51] <Zppix> I wonder if this possibly in prod as well
252​[13:38:55] <Zppix> Possible*
253​[13:39:06] <DanielK_WMDE> RoanKattouw: will you also file a ticket to make sure replicates are not writable in production? This kind of thing can SERIOUSLY mess up the database :(
254​[13:39:15] <DanielK_WMDE> Zppix: yea, scary
255​[13:39:33] <RoanKattouw> Yes
256​[13:39:38] <RoanKattouw> See my 1-2-3 above
257​[13:41:53] <apergos> that was a fun little read
258​[13:42:08] <apergos> not easy to track down either
259​[13:42:29] <DanielK_WMDE> yep... several wrong leads
Zppix created this paste.Dec 19 2017, 2:06 PM
Zppix created this object with visibility "Public (No Login Required)".
Zppix created this object with edit policy "All Users".