Page MenuHomePhabricator
Paste P6485

Chat Log T183242

Authored by Zppix on Dec 19 2017, 2:06 PM.
RoanKattouw: Aaron tried to help with investigating the chronology issue yesterday, but didn't get anywhere, it seems. That stuff is a bit esoterik to me... do you have an idea how to investigate? Basically, new revisions fail to load seconds after an edit. it *looks* like the page table is out iof sync with the revision table. But nobody knows how that can be.
[12:36:40] <DanielK_WMDE> ah, addshore merged the revert
[12:36:50] <DanielK_WMDE> I'll file a revert of the revert in a minute
[12:36:54] <addshore> not yet, just fixing the commit message
[12:37:00] <addshore> but gerrit has died for me
[12:37:17] <DanielK_WMDE> wffm
[12:37:18] <addshore> RoanKattouw: yes, the christmas time for merging and getting it on beta was the plan :)
[12:37:41] <DanielK_WMDE> addshore: we failed to consider that christmas is a bad time to get people to investigate arcane issues, though :)
[12:37:47] <DanielK_WMDE> aaron is gone for the holidays, i guess
[12:37:49] <addshore> indeed
[12:38:08] <DanielK_WMDE> i was expecting issues, but i was hoping they would be a bit more clear cut...
[12:39:01] <RoanKattouw> haha yeah that's a problem
[12:39:53] <RoanKattouw> I guess you could try to figure out what the state of the DB is that the broken requests see (revision table entry referring to a nonexistent page table entry? or the reverse?) and try to figure out if the order of insertions might possibly be wrong
[12:40:21] <addshore> DanielK_WMDE: Just finished fixing the commit message issues, anything else we need to add to that revert?
[12:40:34] <DanielK_WMDE> RoanKattouw: no, thje page table references a revision that then fails to load (it may or may not be in the db, we don't know)
[12:40:54] <DanielK_WMDE> addshore: i don't think so
[12:41:15] <RoanKattouw> Hmm that's weird
[12:41:31] <RoanKattouw> One thing I wonder is how the page table row was able to be written then
[12:41:41] <RoanKattouw> Because you don't know the revision ID until after you've written the revision row
[12:41:53] <addshore> hmm, beta db is currently locked?
[12:42:08] <RoanKattouw> Have you isolated a broken case? Do broken cases stay broken for a long-ish period of time, or only briefly?
[12:42:21] <addshore> RoanKattouw: I was about to say you can try it on beta, but the db is locked
[12:42:34] <DanielK_WMDE> RoanKattouw: it does get written. the revision is found once replicas catch up. it just fails to load right after the edit.
[12:43:23] <DanielK_WMDE> RoanKattouw: while investigating, aaron found that replication weas broken on beta. we firt thought that that was all there was to it. but fixing replication didn't fix the issue. it's all a bit... obscure
[12:44:29] <DanielK_WMDE> RoanKattouw: to replicate, just edit. you get redirected to the "plain" url, and get an error (revision not found). you reload, and it works fine.
[12:45:02] <DanielK_WMDE> RoanKattouw, addshore: the relevant code in Article pre-fills the page with that error message. so that's the error you'll see if *anything* goes wrong. which means we don't actually know what goes wrong.
[12:46:19] <addshore> DanielK_WMDE: I dont think the beta replication issue is fixed
[12:46:19] <addshore> "The database has been automatically locked while the replica database servers catch up to the master."
[12:46:50] <RoanKattouw> "You reload, and it works fine" --> so it's not a long-lasting issue, it's brief
[12:46:55] <RoanKattouw> That makes investigating it harder
[12:47:34] <DanielK_WMDE> RoanKattouw: indeed. and MAYBE it's actually an issue with replication on beta, not with the code, as addshore just said...
[12:47:39] <RoanKattouw> addshore: Hmm if replication is *completely* broken, then maybe that explains it, but I feel like this should be working even when replication is broken (or at least when it's very slow)
12:47:52] <RoanKattouw> Well, let's see if the revert fixes it
[12:47:57] <addshore> is the same as when arron looked yesterday
[12:48:03] <RoanKattouw> If it does, then your code is doing at least something wrong
[12:48:06] <addshore> wtf is up with beta
[12:48:15] <DanielK_WMDE> RoanKattouw: that indeed is a meaningful experiment :)
[12:48:51] <RoanKattouw> addshore: Hmm at that URL I see 0 lag right now
[12:49:07] <addshore> RoanKattouw: try creating a page though, it says the db is locked due to lag
[12:49:38] <RoanKattouw> I see
[12:49:57] <RoanKattouw> Very strange
[12:50:30] <RoanKattouw> The DB does appear to be read-only
[12:50:35] <RoanKattouw> Marking notifications as read doesn't stick
[12:51:06] <Zppix> Would jenkins doing a scap cause that?
[12:51:15] <addshore> Zppix: the locking? no
[12:51:22] <DanielK_WMDE> RoanKattouw: maybe it's r/o for a reason that has nothing to do with replag?
[12:51:23] <RoanKattouw> No, something weird is going on
[12:51:36] <RoanKattouw> I checked $wgReadOnly and it's false, so that isn't it
[12:51:43] <addshore> DanielK_WMDE: when editing though the reason it shows says it is due to replag
[12:51:52] <RoanKattouw> wfReadOnly() returns true, so let's see why
[12:51:55] <DanielK_WMDE> addshore: wtf?
[12:52:07] <addshore>
[12:52:26] <addshore> Wikimedia\\Rdbms\\DBReplicationWaitError from line 369 of /srv/mediawiki-staging/php-master/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to deployment-db03
[12:53:02] <addshore> first failure is Build #22122 (18-Dec-2017 19:20:00)
[12:55:47] <RoanKattouw> OK I can confirm that replication is broken on deployment-db04
[12:55:58] <addshore> RoanKattouw: any idea how we can fix that?
[12:56:08] <RoanKattouw>
[12:56:53] <addshore> ooooooh, well, it has things to do with externalstore and revisions / old text, could be related too the patch
[12:56:58] <RoanKattouw> addshore: I know like 1/3 of the theory and I haven't done it before
[12:57:26] <RoanKattouw> Yeah the error that broke replication is a primary key collision in the text table, I hadn't realized that that could be related to MCR but you're right
[12:57:42] <RoanKattouw> The query itself specifies NULL as the old_id so that should have been fine
[12:57:55] <addshore> That actually could be a very big hint as to what is wrong
[12:58:27] <addshore> patch got merged, code started being run, replication broke, lag wasn't bad enough to cause a db lock for a while though, we keep seeing the same issue, and eventually the db locks up
[12:58:52] <RoanKattouw> If replication breaks completely, it should only take 5 minutes for the lag to shut everything down
[12:59:03] <RoanKattouw> At most; I'm not sure if MW detects this broken state specially and goes r/o immediately
[12:59:28] <RoanKattouw> But obviously if replication is broken but lag measurements still work, the lag will increase at a rate of 1 second per second, and the 300s lag threshold will be hit pretty quickly
[13:00:21] <RoanKattouw> OK so I've looked at the text table on deployment-db04 which interestingly has IDs going all the way up to 20982, and has the auto_increment value correctly set to 20983
[13:00:24] <RoanKattouw> So it doesn't look broken
[13:00:28] <RoanKattouw> Now going to look at that table on the master
[13:01:18] <RoanKattouw> Whoa, wtf
[13:01:24] <RoanKattouw> The text table on the master has FEWER entries
[13:01:33] <addshore> interesting....... urm :P
[13:01:42] <RoanKattouw> The highest old_id on the master is 20972
[13:02:09] <addshore> 10 entries less
[13:02:31] <addshore> and 10 edits have been made on deployment wiki since the code landed
[13:02:38] <RoanKattouw> OK something weird happened
[13:02:48] <RoanKattouw> The master and slave disagree on some rows:
[13:03:07] <RoanKattouw> Master:
[13:03:08] <RoanKattouw>
[13:03:18] <RoanKattouw> Slave:
[13:03:22] <RoanKattouw>
[13:03:33] <DanielK_WMDE> RoanKattouw: it'S possible that my patch broke that. I tried to keep that logic untouched while moving it to a different class, but it's possible I broke that.
[13:03:49] <DanielK_WMDE> But it's also very possible that it's completely unrelated. I sure hope it is :)
[13:03:58] <RoanKattouw> Any idea whether the difference between nativeDataArray and utf-8 is related to your patch?
[13:04:18] <addshore> we defintly touched that area, so most likely
[13:04:21] <RoanKattouw> Also note that there is perfect non-overlap here
[13:04:32] <addshore> right, revert time it is
[13:04:45] <RoanKattouw> The master and replica agree on everything up to oldid 20970 which has text address 8211
[13:05:02] <Zppix> What is oldid 20970?
[13:05:08] <DanielK_WMDE> RoanKattouw: nativeDataArray? What is that? Never heard of it. I moved the code that handles these flags. If the logic changed in the meantime, it's possible i accidentally reverted that, in an inconsistent way
[13:05:12] <RoanKattouw> Then the replica picked up 8212-8217, 8219-8223 and 8225
[13:05:16] <Zppix> Perhaps we could get a timeline?
[13:05:22] <RoanKattouw> And the master picked up 8218 and 8224
[13:05:30] <DanielK_WMDE> RoanKattouw: what's the timestamp of that revision?
[13:05:37] <RoanKattouw> Which one, 20970?
[13:05:41] <RoanKattouw> The last one they agree on?
[13:05:43] <DanielK_WMDE> The patch was merged yesterday... I think around this time of day
[13:06:22] <DanielK_WMDE> RoanKattouw: merged yesterday at 16:10 CET.
[13:06:43] <RoanKattouw> There is no row in the revision table with either rev_id=20970 or rev_text_id=20970
[13:07:41] <addshore> Merged @ Dec 18 15:10 PM UTC
[13:07:52] <RoanKattouw> Wait, on the MASTER there isn't, on the replica there is
[13:08:00] <DanielK_WMDE> ...what?!
[13:08:02] <RoanKattouw> The DBs also disagree on what's in the revision table
[13:08:22] <RoanKattouw> The 15 most recent revision table entries on the master:
[13:08:24] <addshore> 15:18 it is on the beta clster and scap has run
[13:08:24] <RoanKattouw>
[13:08:25] <DanielK_WMDE> that sounds fubar...
[13:08:28] <RoanKattouw> And on the replica:
[13:08:37] <RoanKattouw>
[13:08:57] <addshore> 15:19 I have already spotted the issue
[13:09:08] <addshore> that will be the first edit made with the new code
[13:09:44] <RoanKattouw> Oh wait that's not a disagreement, my bad
[13:09:49] <addshore> RoanKattouw: what is up with rev 7934 where does it appear & look like?
[13:10:05] <DanielK_WMDE> RoanKattouw: nativeDataArray seems to be an AbuseFilter thing. Poor man's content model from many years ago, it seems
[13:10:13] <RoanKattouw> There are strange gaps in the rev_id sequence, but the only real difference is the existence of rev_id 7944 on the master which seems reasonable
[13:10:15] <RoanKattouw> Aah OK
[13:10:16] <DanielK_WMDE> I didn't touch that. should still work as before.
[13:10:26] <RoanKattouw> Right and the query that broke replication had an AbuseFilter method in its comment
[13:10:36] <DanielK_WMDE> \O/
[13:10:42] <RoanKattouw> | 7934 | 20972 | 20171218152013 |
[13:10:50] <addshore> Right
[13:10:52] <RoanKattouw> The two DBs agree on that
[13:11:00] <DanielK_WMDE> Buuut.... that actually may indicate an issue with my patch. Maybe AbuseFilter is relying on internals that i now changed
[13:11:06] <DanielK_WMDE> maybe I broke compat with AF
[13:11:07] <RoanKattouw> That is actually exactly the same rev_text_id as in the replication error!
[13:11:07] <addshore> DanielK_WMDE: indeed
[13:11:22] <RoanKattouw> Last_SQL_Error: Error 'Duplicate entry '20972' for key 'PRIMARY'' on query. Default database: 'deploymentwiki'. Query: 'INSERT /* AbuseFilter::storeVarDump */ INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster1/8224','nativeDataArray,gzip,external')'
[13:11:30] <addshore> RoanKattouw: right, and I believe that is the first time i spotted the issue / the first edit i made after the code landed
[13:11:48] <addshore> Welll...
[13:11:48] <addshore> 15:19, 18 December 2017 Addshore (talk | contribs | block) deleted page TXTTN (view/restore)
[13:12:01] <addshore> (diff | hist) . . N TXTTN‎; 15:20 . . (+7)‎ . . ‎Addshore (talk | contribs | block)‎ (Created page with "slakfa;")
[13:12:01] <DanielK_WMDE> the timestamp matches
[13:12:26] <RoanKattouw> addshore: Oh you deleted stuff, that could explain the rev_id gap
[13:12:26] <DanielK_WMDE> addshore: but why is there an edit by abusefilter?... you didn't edit a filter, did you?
[13:12:27] <addshore> and that rev i created at 15:20 is
[13:12:53] * DanielK_WMDE feels an abuse filter rewrite coming on
[13:13:04] <addshore> RoanKattouw: yes, although in the UI i cant see what i deleted, or what timestamp or revid as the db is locked
[13:13:22] <RoanKattouw> I have it
[13:13:30] <RoanKattouw>
[13:13:36] <RoanKattouw> It helps that it was the most recently deleted thing
[13:14:29] <DanielK_WMDE> so, what was broken? AF or deletion?
[13:14:31] <addshore> so i created a page, and it worked, iu deleted it, it wokred, i recreated it and started seeing the issue
[13:14:45] <addshore> and that was due to abusefilter exploding during the first creation or deletion?
[13:14:47] <DanielK_WMDE> how can *any* of this break replication, btw?
[13:15:16] <DanielK_WMDE> addshore: why would abuse filter write revisions?
[13:15:22] <DanielK_WMDE> unless youedit a filter?
[13:15:24] <addshore> DanielK_WMDE: no idea
[13:15:31] <addshore> So, "nativeDataArray" is ONLY used in AbuseFilter
[13:15:32] <DanielK_WMDE> maybe someone else was editing a filter?
[13:15:49] <DanielK_WMDE> (and maybe this is a red herring)?
[13:16:00] <addshore>
[13:16:45] <addshore> *scrolls up to look at the flag mess we already discovered*
[13:17:37] <addshore> 20970 has no nativeDataArray flag on either slave or master, 20971 and 20972 have it on master but not on the slave
[13:17:43] <DanielK_WMDE> addshore: can you find the user & commit message associated with the nativeArrayData entries?
[13:20:12] <RoanKattouw> Yeah old_id 20972 points to completely different things on the master and the replica
[13:20:20] <RoanKattouw> Different flags but also different external store addresses
[13:20:29] <DanielK_WMDE> how the hell does that happen?
[13:20:34] <RoanKattouw> No idea
[13:20:46] <addshore> its almost like some code is somehow manging to write to the slave...
[13:20:47] <RoanKattouw> I'm digging though the binlogs right now trying to find out what other queries were run on the tex ttable
[13:20:53] <RoanKattouw> So far I've only found the one that killed replication
[13:21:00] <RoanKattouw> Hmm yeah maybe
[13:21:12] <addshore> it looks like abusefilter stuff manage to write to master only
[13:21:19] <RoanKattouw> I thought that maybe the master and replica had switched roles, but maybe the replica is writeable when it shouldn't be? That would break replication pretty quickly too
[13:21:22] <DanielK_WMDE> addshore: oooohhhh.... what would happen if you tried that? it should fail HARD! But does it?
[13:21:44] <addshore> DanielK_WMDE: I would have thought it would fail really hard, but perhaps not?>
[13:21:46] <DanielK_WMDE> RoanKattouw: maybe aaron switched them yesterday when trying to fix things?
[13:21:56] <Zppix> Have you guys tried writing to the replicas directly?
[13:22:01] <RoanKattouw> That's possible
[13:22:09] <Zppix> It should fail no?
[13:22:10] <RoanKattouw> I'll check if the replica is read-only
[13:22:39] <RoanKattouw> Welp, the replica is not read-only
[13:23:11] <Zppix> Hmmm
[13:23:16] <Zppix> That cant be good
[13:23:41] <addshore> so, what is the first insert / update / delete we see on the replica that is not on the master?
[13:23:44] <addshore> (if there is one) ?
[13:24:12] <RoanKattouw> I'm trying to figure that out now
[13:25:09] <RoanKattouw> The master's binlog file is 489MB
[13:25:12] <addshore> I'll filter out messages out of this IRC channel and dump the chat log somewhere nice to look at once we have finished here, I think this is all rather valuable info we don't want to have to go hunting for agian
[13:25:24] <RoanKattouw> So analyzing it is a bit slow
[13:26:24] <DanielK_WMDE> I'll look into making unit tests use a r/o wrapper for the DB REPLICA connection. That may not be easy, though
13:29:08] <RoanKattouw> OK so I'm going through the query history on both servers to try to prove that they received contradictory insertions
[13:29:13] <RoanKattouw> But it's slow going
[13:29:21] <addshore> I'm looking through the usages of DB SLAVE / MASTER in abusefilter and out new revstore stuff now
[13:30:10] <addshore> $dbw = $this->getDBConnection( DB_REPLICA );
[13:30:11] <RoanKattouw> There is not much in the beta SAL so I can't tell if db04 was legitimately the master at some point
[13:30:12] <addshore> DanielK_WMDE: RoanKattouw ^^
[13:30:14] <RoanKattouw> LOL
[13:30:16] <RoanKattouw> Wel...
[13:30:22] <RoanKattouw> That would do it
[13:30:22] <addshore> Think I might have found it....
[13:30:34] <RoanKattouw> Do we not have a mechanism in MW to prevent writes to DB_REPLICA connections?!
[13:30:41] <addshore> RoanKattouw: apparently not
[13:30:45] <addshore> it just does ->insert
[13:31:13] <addshore>
[13:31:15] <DanielK_WMDE> RoanKattouw: looking into fixing that at least for unit tests
[13:31:22] <DanielK_WMDE> ideally for everything.
[13:31:50] <addshore> there should be a NoWrite implementation of the db interface
[13:31:56] <RoanKattouw> Well that would do it
[13:31:58] <addshore> that refuses to accept calls to any writ methods
[13:32:06] <RoanKattouw> That's exactly the place where we would expect this, too
[13:32:11] <addshore> or infact... 2 separate interfaces..
[13:32:14] <RoanKattouw> OK I'm going to file a couple tasks here
[13:32:20] <addshore> RoanKattouw: agreed
[13:32:25] <RoanKattouw> 1) We need all replica DBs to have writes disabled, in labs as well as production
[13:32:31] <RoanKattouw> 2)
[13:32:45] <RoanKattouw> 2) Trying to write to a DB_REPLICA handle needs to throw an exception
[13:32:46] <Zppix> If we disable writes how could masters write to replicas?
[13:32:51] <apergos> O.o you can write to replicas??
[13:32:58] <RoanKattouw> apergos: Yuuup
[13:32:59] <addshore> apergos: on the beta cluster yes
[13:33:02] <apergos> owie
[13:33:05] <RoanKattouw> Zppix: Replication is exempt from the no-writes flag
[13:33:06] <addshore> and within mediawiki yes,
[13:33:11] <Zppix> Ah ok
[13:33:20] <RoanKattouw> 3) deployment-db04 is hosed and needs to become un-hosed
[13:33:29] <Zppix> (I am not to smart when it comes to db)
[13:33:55] <RoanKattouw> (I wondered the same thing actually, so I looked up the documentation and it said that replication and admins were exempt)
[13:34:33] <RoanKattouw> I'm going to begin with depooling deployment-db04 so as to unbreak the beta wikis
[13:34:37] <apergos> what the heck is getting a db replica handle and trying to write to it anyways?
[13:34:43] <RoanKattouw> Then we probably need to reprovision it or something
[13:34:48] <RoanKattouw> apergos: The new MCR code :)
[13:34:54] <RoanKattouw>
[13:34:56] <apergos> oh dear...
[13:35:03] <RoanKattouw> Yeah...
[13:35:13] <RoanKattouw> So the text table's IDs diverged on the master and replica DBs
[13:35:21] <apergos> I know it's not really funny but I am lmao here
[13:35:24] <addshore> there could perhaps be a very simply phpcs sniff for that too
[13:35:26] <RoanKattouw> Then when the master tried to replicate its writes, replication died
[13:35:30] <apergos> of course
[13:35:33] <DanielK_WMDE> apergos: my bad. whoops. we need more eyes on this code...
[13:35:52] <apergos> well the good thing is it gets discovered in labs
[13:35:54] <RoanKattouw> apergos: If you're in #wikimedia-dev already, see my pastes in the backscroll there, you'll see the two servers disagreeing on the contents of the text and revision tables
[13:36:03] <addshore> apergos: RoanKattouw the great news is this issue is much simpler to fix than some of the other possible reasons for breaking that we were thinking about.
[13:36:07] <RoanKattouw> We also had revision IDs with different meanings on the two servers
[13:36:11] <RoanKattouw> Yes
[13:36:12] <apergos> eewwww
[13:36:20] <RoanKattouw> And it's not a chronology protection issue
[13:36:20] <DanielK_WMDE> RoanKattouw, addshore: i'm introducing allowWrite into the datbase base class. I'm also introducing a wrapper so writing to a replica connection will fail even if there is only one db and one connection.
[13:36:22] <apergos> yes I suppose it would all go pear-shaped pretty soon after
[13:36:33] <addshore> DanielK_WMDE: sounds good
[13:36:34] <apergos> doing the backread in the other channel
[13:36:49] <RoanKattouw> DanielK_WMDE: Yes that's good, and necessary to catch these things in dev. I'll file that task first so you can claim it
[13:36:53] <addshore> DanielK_WMDE: shall I fix the line or you?
[13:37:15] <DanielK_WMDE> RoanKattouw, addshore: thank you for helping with this!
[13:37:18] <DanielK_WMDE> addshore: i'll fix it.
[13:37:52] <Zppix> Im glad i mentioned directly writing to replicas :P
[13:38:15] <DanielK_WMDE> Zppix: thank you, indeed :)+
[13:38:39] <DanielK_WMDE> Zppix: I would have expected that to just fail HARD, not silently corrupt the database
[13:38:51] <Zppix> I wonder if this possibly in prod as well
[13:38:55] <Zppix> Possible*
[13:39:06] <DanielK_WMDE> RoanKattouw: will you also file a ticket to make sure replicates are not writable in production? This kind of thing can SERIOUSLY mess up the database :(
[13:39:15] <DanielK_WMDE> Zppix: yea, scary
[13:39:33] <RoanKattouw> Yes
[13:39:38] <RoanKattouw> See my 1-2-3 above
[13:41:53] <apergos> that was a fun little read
[13:42:08] <apergos> not easy to track down either
[13:42:29] <DanielK_WMDE> yep... several wrong leads

Event Timeline

Zppix created this paste.Dec 19 2017, 2:06 PM
Zppix created this object with visibility "Public (No Login Required)".
Zppix created this object with edit policy "All Users".