Page MenuHomePhabricator
Paste P5290

Slightly cleaned up irc log for T163344
ActivePublic

Authored by Nikerabbit on Apr 19 2017, 4:23 PM.
UTC+3
[18:13:02] @paravoid> so who can help debug the contenttranslation issue?
[18:13:07] @paravoid> anyone from the language team around?
[18:13:20] akosiaris> Nikerabbit: ^
[18:13:21] @paravoid> Nikerabbit: ?
[18:13:55] jynus> paravoid, basically, i see lots of SELECT FOR UPDATE, meaning it is the app blocking itself
[18:14:02] jynus> that didn't happen before
[18:14:12] marostegui> yes, it is full of table lock :_(
[18:18:27] @paravoid> Nikerabbit: TL;DR, we just performed the codfw switchover, cx database queries are taking table locks and there's contention resulting into slowness
[18:21:04] @paravoid> it's unclear why this started happening now, after the switch to codfw
[18:21:14] @paravoid> so investigating that would be interesting
[18:21:16] jynus> Nikerabbit, there is lots of traffic happening on translation tables
[18:21:26] jynus> so much I am not sure it is working
[18:21:31] @paravoid> er, s/interesting/helpful/
[18:21:44] jynus> it is not a huge issue because I do not think it affects other functionality
[18:21:54] Nikerabbit> paravoid, jynus: we previously had issues that lot of pending requests could arrive to the server at once, I wonder if during the read-only similar kind of thing happened
[18:21:58] jynus> but I would not exect such high traffic there
[18:22:08] jynus> if it is only that, no big deal
[18:22:13] jynus> but it would be nice to confirm it
[18:22:38] jynus> do you have access on how to get the right logs: logstash and tendril?
[18:23:03] Nikerabbit> I have access to logstash, not familiar how to use tendril
[18:23:20] @paravoid> isn't SELECT FOR UPDATE frowned upon in general?
[18:23:41] jynus> paravoid, not really, assuming the query needs it and it only takes a few miliseconds
[18:23:51] jynus> locking is inevitabble
[18:24:06] jynus> now the point is why it is not takiing miliseconds :-)
[18:24:49] Nikerabbit> hmm
[18:24:50] jynus> Nikerabbit, https://tendril.wikimedia.org/activity?wikiadmin=0&research=0&labsusers=0
[18:24:58] jynus> ignore the ongoing dump user
[18:25:09] jynus> and check the long running queries on cx* tables
[18:25:11] Nikerabbit> Aaron worked on these queries, there is some long comments about locking I don't claim to fully understand
[18:25:38] jynus> some are running for 53 minutes already, maybe I should just kill them manually?
[18:26:08] marostegui> is there any issue if the write is killed?
[18:26:09] Nikerabbit> jynus: I don't see any in that link
[18:26:19] marostegui> in terms of incosistency or something that might be created?
[18:26:38] marostegui> Nikerabbit: look for db2033
[18:26:38] Nikerabbit> marostegui: the UI will retry if saving fails (assuming the page is still open in the client)
[18:26:48] marostegui> Nikerabbit: ok
[18:26:56] jynus> it probably isn't, nobody is there for 53 minutes waiting
[18:28:37] jynus> the one causing it is a SELECT /* ContentTranslation\TranslationStorageManager::{closure}..FOR UPDATE
[18:28:37] Nikerabbit> there are lot of repeats in cxc_translation_id = XXX, which means the UI has already retried
[18:28:47] jynus> blockign all writes there
[18:29:10] jynus> several of them, actually
[18:29:34] jynus> if we can kill those, we can unblock them, but we must be sure not to retry them or the problem will return
[18:30:30] jynus> there is 2361 open connections, I would suggest to kill them all
[18:31:21] Nikerabbit> any request that has been running for few minutes has already timed out to the browser, I believe, so killing those FOR UPDATE queries would not do much harm
[18:31:33] marostegui> I will kill them then
[18:33:20] Nikerabbit> https://github.com/wikimedia/mediawiki-extensions-ContentTranslation/blob/master/includes/TranslationStorageManager.php#L81-L119
[18:33:54] jynus> marostegui, it is getting worse- queries are killed but they do not disconnecty
[18:33:55] marostegui> I have killed them
[18:34:00] marostegui> I see yes
[18:34:29] Nikerabbit> I'm sure there must be a better way to do saves in CX... after all there is no problem with regular page editing which has much higher frequency
[18:35:12] jynus> at this point I would disable conntent translation everywhere
[18:35:20] jynus> before it affects all other functionality
[18:35:26] marostegui> agreed, the server is not recovering
[18:35:45] jynus> Nikerabbit, please disable the extension all toghether
[18:35:55] jynus> or tell me how to do it
[18:36:02] marostegui> server went to too many connections already
[18:36:11] jynus> yes, it is creating an outage now
[18:36:30] icinga-wm> PROBLEM - MariaDB Slave IO: x1 on db2033 is CRITICAL: CRITICAL slave_io_state could not connect
[18:36:35] jynus> not only for translation, but for flow and the other servers
[18:36:38] marostegui> I will silence that
[18:36:38] Nikerabbit> wmgUseContentTranslation -> wikipedia => false
[18:36:39] icinga-wm> PROBLEM - MariaDB Slave SQL: x1 on db2033 is CRITICAL: CRITICAL slave_sql_state could not connect
[18:36:58] Nikerabbit> in InitialiseSettings.php
[18:37:23] * AaronSchulz sees "Error connecting to 10.192.32.4: Too many connections" spam
[18:37:35] marostegui> AaronSchulz: we are on it, see a few lines above
[18:37:46] @paravoid> so who is disabling wmgUseContentTranslation?
[18:37:56] -!- kart_ [~kartik_@60.243.27.49] has joined #wikimedia-operations
[18:38:04] Nikerabbit> I'm not, but changing that line does it
[18:38:08] +wikibugs> (PS1) Jcrespo: Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951
[18:38:11] jynus> Nikerabbit, ^
[18:38:34] +wikibugs__> (CR) Faidon Liambotis: [C: 1] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
[18:38:41] +wikibugs> (CR) Nikerabbit: [C: 2] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
[18:38:46] +wikibugs__> (CR) Jcrespo: [V: 2 C: 2] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
[18:38:59] +wikibugs> (CR) jenkins-bot: Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
[18:39:30] marostegui> If the connections aren't freed up we might need to kill -9 the server :(
[18:39:36] @paravoid> ugh
[18:40:04] jynus> marostegui, we can failover to the slave
[18:40:14] jynus> it should not create new locking
[18:40:39] marostegui> jynus: but remember the new slave's hardware isn't great
[18:40:43] jynus> yeah
[18:40:44] marostegui> let's wait what happens once it is deployed
[18:40:57] jynus> one mw host is down, right?
[18:41:03] @paravoid> yes, mw2256
[18:41:04] jynus> sync-apaches: 99% (ok: 300; fail: 0; left: 1)
[18:41:43] jynus> db2033 is not getting any better
[18:41:51] volans> marostegui, jynus: if I can help let me know
[18:42:17] marostegui> jynus: I would do a kill -9 and let the server recover, it shouldn't take too long
[18:42:36] +logmsgbot> !log jynus@tin Synchronized wmf-config/InitialiseSettings.php: Disable cx_translation- it is causing an outage on x1 (duration: 02m 44s)
[18:42:39] jynus> wait, the deploy has not finished yet
[18:42:40] jynus> now
[18:42:41] akosiaris> heh, so I was looking at corrupted memory while rebooting mw2256 ? nice
[18:42:43] +stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:17] jynus> marostegui, port 3307 is reserved for admin login
[18:43:20] jynus> not saturatedf
[18:43:25] icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
[18:43:48] jynus> but queries do not get killed yet
[18:43:54] marostegui> :(
[18:44:21] jynus> let's do a graceful restart
[18:44:35] marostegui> we'll see if it is able to stop
[18:44:36] jynus> can you deploy the slave
[18:44:40] jynus> with more traffic?
[18:44:42] marostegui> yes
[18:44:45] jynus> quickly
[18:44:50] marostegui> doing it now
[18:45:01] @paravoid> would this issue affect Echo as well?
[18:45:05] jynus> yes
[18:45:48] +wikibugs__> (PS1) Marostegui: db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952
[18:45:53] marostegui> jynus: ^
[18:46:02] jynus> no, no
[18:46:06] jynus> just give it 0 to the master
[18:46:09] marostegui> ok
[18:46:38] +wikibugs> (PS2) Marostegui: db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952
[18:46:52] marostegui> jynus: ^
[18:46:59] +wikibugs__> (CR) Jcrespo: [V: 2 C: 2] db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952 (owner: Marostegui)
[18:47:10] +wikibugs> (CR) jenkins-bot: db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952 (owner: Marostegui)
[18:47:43] jynus> deploying
[18:47:46] marostegui> ah
[18:47:51] marostegui> I was aboput to hit enter :)
[18:47:56] jynus> keep an eye on the slave
[18:48:00] jynus> while I reboot the master
[18:48:18] +logmsgbot> !log jynus@tin Synchronized wmf-config/db-codfw.php: Failing over x1-master (duration: 00m 41s)
[18:48:18] @paravoid> thanks guys, much <3
[18:48:20] marostegui> you going to attempt graceful one?
[18:48:25] +stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:25] icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[18:48:33] @paravoid> Nikerabbit: in the meantime, did you find why this happened in the first place?
[18:49:05] jynus> !log shutting down db2033 (x1-master)
[18:49:11] +stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:39] jynus> it is ongoing
[18:49:50] jynus> we may get a lot of errors, but not more than we used to have
[18:50:49] jynus> marostegui, how is the slave taking the master going away?
[18:50:51] marostegui> the slave is having no issues, so that is good
[18:51:13] Nikerabbit> paravoid: my best theory is that the queries are somewhat inefficient by nature, and sudden burst due piling up during the switch just made it explode. But I'm just guessing
[18:51:17] volans> I can see logs Server 10.192.48.14 (#1) is not replicating?
[18:51:20] jynus> mysql process is still up
[18:51:22] volans> I guess expected
[18:51:26] jynus> yep
[18:51:27] marostegui> yeah, i doubt it is going to stop :(
[18:51:36] jynus> give it a few minutes
[18:51:38] @paravoid> Nikerabbit: ok, so next steps?
[18:51:45] icinga-wm> PROBLEM - MariaDB Slave IO: x1 on db1031 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2033.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2033.codfw.wmnet (111 Connection refused)
[18:51:46] volans> also All replica DBs lagged. Switch to read-only mode, expected too
[18:51:47] jynus> dbstores take 20 minutes
[18:51:55] @paravoid> Nikerabbit: probably needs a task, and will certainly need an incident report, could you spearhead that?
[18:52:07] marostegui> i have silenced tempdb2001 replication alerts
[18:52:48] Nikerabbit> paravoid: mitigate the issue (your side), I can lead root cause investigation that will produce incident report
[18:53:22] @paravoid> Nikerabbit: thanks!
[18:54:06] marostegui> shall we leave the https://gerrit.wikimedia.org/r/348951 like that for now Nikerabbit?
[18:54:09] jynus> mysql stil l up
[18:54:25] kart_> Nikerabbit: We had, https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation - anything related to that this time?
[18:54:29] jynus> so I am going to kill it, let it recover
[18:54:35] Nikerabbit> marostegui: I think safest is to leave it disabled for now
[18:54:35] marostegui> yeah, agreed
[18:54:43] marostegui> Nikerabbit: cool, thanks
[18:54:56] marostegui> jynus: +1 to kill it
[18:55:09] Nikerabbit> given TZ and everything, we will see if we can perhaps start enabling it incrementally (or fully if the root cause is figured out and fixed)
[18:55:24] marostegui> the slave is looking good, so those are good news
[18:55:42] Nikerabbit> I mean, we will see *tomorrow*
[18:57:08] marostegui> jynus: that was fast :)
[18:57:22] jynus> mysql going up, I think?
[18:57:25] volans> godog: maybe jynus would need to deploy mediawiki-config soon, I would probably hold a bit
[18:57:27] marostegui> it is up
[18:57:30] icinga-wm> RECOVERY - MariaDB Slave IO: x1 on db2033 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:57:40] icinga-wm> RECOVERY - MariaDB Slave SQL: x1 on db2033 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[18:57:40] volans> or get confirmation is ok ;)
[18:57:45] icinga-wm> RECOVERY - MariaDB Slave IO: x1 on db1031 is OK: OK slave_io_state Slave_IO_Running: Yes
[18:57:52] godog> volans: good point, I'll coordinate with him
[18:57:59] jynus> the question is if it will happen again now
[18:58:08] jynus> and if it will happen again when enabled
[18:58:10] marostegui> jynus: it shouldn't, as it is disabled, right?
[18:58:20] marostegui> jynus: Nikerabbit said we can leave it off for now
[18:58:26] jynus> I think we left testwiki and all on now
[18:58:46] jynus> is heartbeat ok?
[18:58:57] jynus> slave, etc
[18:59:03] marostegui> yep
[18:59:11] marostegui> shall we revert the weight back to the master?
[18:59:18] jynus> can the slave hold the load?
[18:59:23] marostegui> it was able yes
[18:59:26] marostegui> but with that feature stopped
[18:59:27] jynus> how many connections running?
[18:59:30] marostegui> around 40
[18:59:37] marostegui> oh running
[18:59:53] marostegui> connected around 40, running less than 5
[19:00:07] Nikerabbit> I started a task for addressing the CX outage https://phabricator.wikimedia.org/T163344
[19:00:14] jynus> heartbeat seems ok
[19:00:23] jynus> not sure how to handle load
[19:00:37] jynus> if it happens again (pileups, it is better if the slave has most of the load)
[19:01:12] marostegui> Let's give the master just a bit of weight and wait for T163344 to be addressed before enabling it back?
[19:01:13] +stashbot> T163344: Do a root-cause analysis on CX outage during dc switch and get it back online - https://phabricator.wikimedia.org/T163344
[19:01:17] marostegui> does that sound reasonable?
[19:01:31] jynus> can someone check flow and echo working ok, and what is happening to translation?
[19:06:00] marostegui> jynus: yes, let's leave translation disabled as Nikerabbit said too
[19:06:38] jynus> it would be nice to announce the translation tool disabling
[19:06:46] jynus> that is user-impacting
[19:07:24] jynus> paravoid, can issues be translated through communication ?
[19:07:29] jynus> *transmitted
[19:07:49] @paravoid> jynus: I'll handle that
[19:07:50] jynus> it would be nice to communicate it
[19:08:10] @paravoid> I'll sent a post-switchover email soon
[19:08:16] jynus> "translation extension is temporarily disabled until performance issues are solved"
[19:08:20] jynus> ^that is the summary
[19:08:20] @paravoid> but I was waiting for things to settle down a little bit first
[19:09:58] Nikerabbit> we put up a notice on talk:cx on mediawiki.org, but yeah some notice that reaches more people makes sense