Slightly cleaned up irc log for T163344
ActivePublic
Actions

Authored by • Nikerabbit on Apr 19 2017, 4:23 PM.

Tags

Referenced Files

	F7640044: Slightly cleaned up irc log for T163344
	Apr 19 2017, 4:23 PM

Subscribers

None

	UTC+3
	[18:13:02] @paravoid> so who can help debug the contenttranslation issue?
	[18:13:07] @paravoid> anyone from the language team around?
	[18:13:20] akosiaris> Nikerabbit: ^
	[18:13:21] @paravoid> Nikerabbit: ?
	[18:13:55] jynus> paravoid, basically, i see lots of SELECT FOR UPDATE, meaning it is the app blocking itself
	[18:14:02] jynus> that didn't happen before
	[18:14:12] marostegui> yes, it is full of table lock :_(
	[18:18:27] @paravoid> Nikerabbit: TL;DR, we just performed the codfw switchover, cx database queries are taking table locks and there's contention resulting into slowness
	[18:21:04] @paravoid> it's unclear why this started happening now, after the switch to codfw
	[18:21:14] @paravoid> so investigating that would be interesting
	[18:21:16] jynus> Nikerabbit, there is lots of traffic happening on translation tables
	[18:21:26] jynus> so much I am not sure it is working
	[18:21:31] @paravoid> er, s/interesting/helpful/
	[18:21:44] jynus> it is not a huge issue because I do not think it affects other functionality
	[18:21:54] Nikerabbit> paravoid, jynus: we previously had issues that lot of pending requests could arrive to the server at once, I wonder if during the read-only similar kind of thing happened
	[18:21:58] jynus> but I would not exect such high traffic there
	[18:22:08] jynus> if it is only that, no big deal
	[18:22:13] jynus> but it would be nice to confirm it
	[18:22:38] jynus> do you have access on how to get the right logs: logstash and tendril?
	[18:23:03] Nikerabbit> I have access to logstash, not familiar how to use tendril
	[18:23:20] @paravoid> isn't SELECT FOR UPDATE frowned upon in general?
	[18:23:41] jynus> paravoid, not really, assuming the query needs it and it only takes a few miliseconds
	[18:23:51] jynus> locking is inevitabble
	[18:24:06] jynus> now the point is why it is not takiing miliseconds :-)
	[18:24:49] Nikerabbit> hmm
	[18:24:50] jynus> Nikerabbit, https://tendril.wikimedia.org/activity?wikiadmin=0&research=0&labsusers=0
	[18:24:58] jynus> ignore the ongoing dump user
	[18:25:09] jynus> and check the long running queries on cx* tables
	[18:25:11] Nikerabbit> Aaron worked on these queries, there is some long comments about locking I don't claim to fully understand
	[18:25:38] jynus> some are running for 53 minutes already, maybe I should just kill them manually?
	[18:26:08] marostegui> is there any issue if the write is killed?
	[18:26:09] Nikerabbit> jynus: I don't see any in that link
	[18:26:19] marostegui> in terms of incosistency or something that might be created?
	[18:26:38] marostegui> Nikerabbit: look for db2033
	[18:26:38] Nikerabbit> marostegui: the UI will retry if saving fails (assuming the page is still open in the client)
	[18:26:48] marostegui> Nikerabbit: ok
	[18:26:56] jynus> it probably isn't, nobody is there for 53 minutes waiting
	[18:28:37] jynus> the one causing it is a SELECT /* ContentTranslation\TranslationStorageManager::{closure}..FOR UPDATE
	[18:28:37] Nikerabbit> there are lot of repeats in cxc_translation_id = XXX, which means the UI has already retried
	[18:28:47] jynus> blockign all writes there
	[18:29:10] jynus> several of them, actually
	[18:29:34] jynus> if we can kill those, we can unblock them, but we must be sure not to retry them or the problem will return
	[18:30:30] jynus> there is 2361 open connections, I would suggest to kill them all
	[18:31:21] Nikerabbit> any request that has been running for few minutes has already timed out to the browser, I believe, so killing those FOR UPDATE queries would not do much harm
	[18:31:33] marostegui> I will kill them then
	[18:33:20] Nikerabbit> https://github.com/wikimedia/mediawiki-extensions-ContentTranslation/blob/master/includes/TranslationStorageManager.php#L81-L119
	[18:33:54] jynus> marostegui, it is getting worse- queries are killed but they do not disconnecty
	[18:33:55] marostegui> I have killed them
	[18:34:00] marostegui> I see yes
	[18:34:29] Nikerabbit> I'm sure there must be a better way to do saves in CX... after all there is no problem with regular page editing which has much higher frequency
	[18:35:12] jynus> at this point I would disable conntent translation everywhere
	[18:35:20] jynus> before it affects all other functionality
	[18:35:26] marostegui> agreed, the server is not recovering
	[18:35:45] jynus> Nikerabbit, please disable the extension all toghether
	[18:35:55] jynus> or tell me how to do it


	[18:36:02] marostegui> server went to too many connections already
	[18:36:11] jynus> yes, it is creating an outage now
	[18:36:30] icinga-wm> PROBLEM - MariaDB Slave IO: x1 on db2033 is CRITICAL: CRITICAL slave_io_state could not connect
	[18:36:35] jynus> not only for translation, but for flow and the other servers
	[18:36:38] marostegui> I will silence that
	[18:36:38] Nikerabbit> wmgUseContentTranslation -> wikipedia => false
	[18:36:39] icinga-wm> PROBLEM - MariaDB Slave SQL: x1 on db2033 is CRITICAL: CRITICAL slave_sql_state could not connect
	[18:36:58] Nikerabbit> in InitialiseSettings.php
	[18:37:23] * AaronSchulz sees "Error connecting to 10.192.32.4: Too many connections" spam
	[18:37:35] marostegui> AaronSchulz: we are on it, see a few lines above
	[18:37:46] @paravoid> so who is disabling wmgUseContentTranslation?
	[18:37:56] -!- kart_ [~kartik_@60.243.27.49] has joined #wikimedia-operations
	[18:38:04] Nikerabbit> I'm not, but changing that line does it
	[18:38:08] +wikibugs> (PS1) Jcrespo: Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951
	[18:38:11] jynus> Nikerabbit, ^
	[18:38:34] +wikibugs__> (CR) Faidon Liambotis: [C: 1] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
	[18:38:41] +wikibugs> (CR) Nikerabbit: [C: 2] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
	[18:38:46] +wikibugs__> (CR) Jcrespo: [V: 2 C: 2] Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
	[18:38:59] +wikibugs> (CR) jenkins-bot: Disable cx_translation- it is causing an outage on x1 [mediawiki-config] - https://gerrit.wikimedia.org/r/348951 (owner: Jcrespo)
	[18:39:30] marostegui> If the connections aren't freed up we might need to kill -9 the server :(
	[18:39:36] @paravoid> ugh
	[18:40:04] jynus> marostegui, we can failover to the slave
	[18:40:14] jynus> it should not create new locking
	[18:40:39] marostegui> jynus: but remember the new slave's hardware isn't great
	[18:40:43] jynus> yeah
	[18:40:44] marostegui> let's wait what happens once it is deployed
	[18:40:57] jynus> one mw host is down, right?
	[18:41:03] @paravoid> yes, mw2256
	[18:41:04] jynus> sync-apaches: 99% (ok: 300; fail: 0; left: 1)
	[18:41:43] jynus> db2033 is not getting any better
	[18:41:51] volans> marostegui, jynus: if I can help let me know
	[18:42:17] marostegui> jynus: I would do a kill -9 and let the server recover, it shouldn't take too long
	[18:42:36] +logmsgbot> !log jynus@tin Synchronized wmf-config/InitialiseSettings.php: Disable cx_translation- it is causing an outage on x1 (duration: 02m 44s)
	[18:42:39] jynus> wait, the deploy has not finished yet
	[18:42:40] jynus> now
	[18:42:41] akosiaris> heh, so I was looking at corrupted memory while rebooting mw2256 ? nice
	[18:42:43] +stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
	[18:43:17] jynus> marostegui, port 3307 is reserved for admin login
	[18:43:20] jynus> not saturatedf
	[18:43:25] icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
	[18:43:48] jynus> but queries do not get killed yet
	[18:43:54] marostegui> :(
	[18:44:21] jynus> let's do a graceful restart
	[18:44:35] marostegui> we'll see if it is able to stop
	[18:44:36] jynus> can you deploy the slave
	[18:44:40] jynus> with more traffic?
	[18:44:42] marostegui> yes
	[18:44:45] jynus> quickly
	[18:44:50] marostegui> doing it now
	[18:45:01] @paravoid> would this issue affect Echo as well?
	[18:45:05] jynus> yes
	[18:45:48] +wikibugs__> (PS1) Marostegui: db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952
	[18:45:53] marostegui> jynus: ^
	[18:46:02] jynus> no, no
	[18:46:06] jynus> just give it 0 to the master
	[18:46:09] marostegui> ok
	[18:46:38] +wikibugs> (PS2) Marostegui: db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952
	[18:46:52] marostegui> jynus: ^
	[18:46:59] +wikibugs__> (CR) Jcrespo: [V: 2 C: 2] db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952 (owner: Marostegui)
	[18:47:10] +wikibugs> (CR) jenkins-bot: db-codfw.php: Give x1 slave more weight [mediawiki-config] - https://gerrit.wikimedia.org/r/348952 (owner: Marostegui)
	[18:47:43] jynus> deploying
	[18:47:46] marostegui> ah
	[18:47:51] marostegui> I was aboput to hit enter :)
	[18:47:56] jynus> keep an eye on the slave
	[18:48:00] jynus> while I reboot the master
	[18:48:18] +logmsgbot> !log jynus@tin Synchronized wmf-config/db-codfw.php: Failing over x1-master (duration: 00m 41s)
	[18:48:18] @paravoid> thanks guys, much <3
	[18:48:20] marostegui> you going to attempt graceful one?
	[18:48:25] +stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
	[18:48:25] icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
	[18:48:33] @paravoid> Nikerabbit: in the meantime, did you find why this happened in the first place?
	[18:49:05] jynus> !log shutting down db2033 (x1-master)
	[18:49:11] +stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
	[18:49:39] jynus> it is ongoing
	[18:49:50] jynus> we may get a lot of errors, but not more than we used to have
	[18:50:49] jynus> marostegui, how is the slave taking the master going away?
	[18:50:51] marostegui> the slave is having no issues, so that is good
	[18:51:13] Nikerabbit> paravoid: my best theory is that the queries are somewhat inefficient by nature, and sudden burst due piling up during the switch just made it explode. But I'm just guessing
	[18:51:17] volans> I can see logs Server 10.192.48.14 (#1) is not replicating?
	[18:51:20] jynus> mysql process is still up
	[18:51:22] volans> I guess expected
	[18:51:26] jynus> yep
	[18:51:27] marostegui> yeah, i doubt it is going to stop :(
	[18:51:36] jynus> give it a few minutes
	[18:51:38] @paravoid> Nikerabbit: ok, so next steps?
	[18:51:45] icinga-wm> PROBLEM - MariaDB Slave IO: x1 on db1031 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2033.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2033.codfw.wmnet (111 Connection refused)
	[18:51:46] volans> also All replica DBs lagged. Switch to read-only mode, expected too
	[18:51:47] jynus> dbstores take 20 minutes
	[18:51:55] @paravoid> Nikerabbit: probably needs a task, and will certainly need an incident report, could you spearhead that?
	[18:52:07] marostegui> i have silenced tempdb2001 replication alerts
	[18:52:48] Nikerabbit> paravoid: mitigate the issue (your side), I can lead root cause investigation that will produce incident report
	[18:53:22] @paravoid> Nikerabbit: thanks!
	[18:54:06] marostegui> shall we leave the https://gerrit.wikimedia.org/r/348951 like that for now Nikerabbit?
	[18:54:09] jynus> mysql stil l up
	[18:54:25] kart_> Nikerabbit: We had, https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation - anything related to that this time?
	[18:54:29] jynus> so I am going to kill it, let it recover
	[18:54:35] Nikerabbit> marostegui: I think safest is to leave it disabled for now
	[18:54:35] marostegui> yeah, agreed
	[18:54:43] marostegui> Nikerabbit: cool, thanks
	[18:54:56] marostegui> jynus: +1 to kill it
	[18:55:09] Nikerabbit> given TZ and everything, we will see if we can perhaps start enabling it incrementally (or fully if the root cause is figured out and fixed)
	[18:55:24] marostegui> the slave is looking good, so those are good news
	[18:55:42] Nikerabbit> I mean, we will see tomorrow
	[18:57:08] marostegui> jynus: that was fast :)
	[18:57:22] jynus> mysql going up, I think?
	[18:57:25] volans> godog: maybe jynus would need to deploy mediawiki-config soon, I would probably hold a bit
	[18:57:27] marostegui> it is up
	[18:57:30] icinga-wm> RECOVERY - MariaDB Slave IO: x1 on db2033 is OK: OK slave_io_state Slave_IO_Running: Yes
	[18:57:40] icinga-wm> RECOVERY - MariaDB Slave SQL: x1 on db2033 is OK: OK slave_sql_state Slave_SQL_Running: Yes
	[18:57:40] volans> or get confirmation is ok ;)
	[18:57:45] icinga-wm> RECOVERY - MariaDB Slave IO: x1 on db1031 is OK: OK slave_io_state Slave_IO_Running: Yes
	[18:57:52] godog> volans: good point, I'll coordinate with him
	[18:57:59] jynus> the question is if it will happen again now
	[18:58:08] jynus> and if it will happen again when enabled
	[18:58:10] marostegui> jynus: it shouldn't, as it is disabled, right?
	[18:58:20] marostegui> jynus: Nikerabbit said we can leave it off for now
	[18:58:26] jynus> I think we left testwiki and all on now
	[18:58:46] jynus> is heartbeat ok?
	[18:58:57] jynus> slave, etc
	[18:59:03] marostegui> yep
	[18:59:11] marostegui> shall we revert the weight back to the master?
	[18:59:18] jynus> can the slave hold the load?
	[18:59:23] marostegui> it was able yes
	[18:59:26] marostegui> but with that feature stopped
	[18:59:27] jynus> how many connections running?
	[18:59:30] marostegui> around 40
	[18:59:37] marostegui> oh running
	[18:59:53] marostegui> connected around 40, running less than 5
	[19:00:07] Nikerabbit> I started a task for addressing the CX outage https://phabricator.wikimedia.org/T163344
	[19:00:14] jynus> heartbeat seems ok
	[19:00:23] jynus> not sure how to handle load
	[19:00:37] jynus> if it happens again (pileups, it is better if the slave has most of the load)
	[19:01:12] marostegui> Let's give the master just a bit of weight and wait for T163344 to be addressed before enabling it back?
	[19:01:13] +stashbot> T163344: Do a root-cause analysis on CX outage during dc switch and get it back online - https://phabricator.wikimedia.org/T163344
	[19:01:17] marostegui> does that sound reasonable?
	[19:01:31] jynus> can someone check flow and echo working ok, and what is happening to translation?
	[19:06:00] marostegui> jynus: yes, let's leave translation disabled as Nikerabbit said too
	[19:06:38] jynus> it would be nice to announce the translation tool disabling
	[19:06:46] jynus> that is user-impacting
	[19:07:24] jynus> paravoid, can issues be translated through communication ?
	[19:07:29] jynus> *transmitted
	[19:07:49] @paravoid> jynus: I'll handle that
	[19:07:50] jynus> it would be nice to communicate it
	[19:08:10] @paravoid> I'll sent a post-switchover email soon
	[19:08:16] jynus> "translation extension is temporarily disabled until performance issues are solved"
	[19:08:20] jynus> ^that is the summary
	[19:08:20] @paravoid> but I was waiting for things to settle down a little bit first
	[19:09:58] Nikerabbit> we put up a notice on talk:cx on mediawiki.org, but yeah some notice that reaches more people makes sense

Event Timeline

• Nikerabbit created this paste.Apr 19 2017, 4:23 PM

• Nikerabbit mentioned this in T163344: Do a root-cause analysis on CX outage during dc switch and get it back online.

Slightly cleaned up irc log for T163344ActivePublicActions

Event Timeline

Slightly cleaned up irc log for T163344
ActivePublic
Actions