Page MenuHomePhabricator

Investigate server flapping after 3/7/2016 deploy
Closed, ResolvedPublic

Description

The deploy on 3/7/20016 (https://gerrit.wikimedia.org/r/#/c/275626/) caused the services to turn ciritcal and then back normal, and critical again, until I reverted the change and redeployed.

See https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2016-03-07
and from IRC log

[21:36:55] <bearND>	 !log mobileapps deployed 49169e9
[21:38:52] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:39:23] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:40:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[21:41:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy

[21:52:42] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:54:31] <ori>	 scb1001 is in trouble again
[21:55:05] <ori>	 it looks like inbound traffic to that host halved since the deployment: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=scb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Service+Cluster+B+eqiad
[21:57:54] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[21:59:34] <ori>	 I don't think it's really OK; I think it's flapping
[22:00:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

[22:03:41] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy

[22:06:42] <bearND>	 !log starting to revert mobileapps deploy
[22:07:22] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[22:10:05] <bearND>	 !log revert of mobileapps deploy complete

Event Timeline

bearND created this task.Mar 8 2016, 4:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2016, 4:53 PM
Mholloway renamed this task from Investigate server flapping after 3/7/20016 deploy to Investigate server flapping after 3/7/2016 deploy.Mar 8 2016, 5:07 PM
Mholloway awarded a token.

RESTBase logs reveal no error responses being received by the MCS during that period ( https://logstash.wikimedia.org/#dashboard/temp/AVNXUquQO3D718AOXssW ). This is consistent with the flapping behaviour as it signals either workers dying (for whatever reason) or being held on requests so that new requests cannot be accepted. Hence, no response actually reaches RESTBase. There should be something in the logs, though.

If we work under the assumption that this is caused by the service's code (as it currently stands), https://gerrit.wikimedia.org/r/#/c/275626/ seems like the only change that could potentially cause workers dying. The other patches do not affect the actual code path.

Mholloway added a comment.EditedMar 8 2016, 11:56 PM

I grabbed the log sections during the period of the bad deployment (from startup of the workers to immediately before restarting to revert the deployment) from scb1001 and scb1002. Unfortunately, I don't see anything very interesting or helpful.


Mholloway added a comment.EditedMar 9 2016, 12:06 AM

Actually there are a few potentially interesting lines in the scb1002 log (found with grep -v 404 scb1002.txt from the above file):

{"name":"mobileapps","hostname":"scb1002","pid":154,"level":40,"message":"Heap memory limit temporarily exceeded","limit":314572800,"memoryUsage":{"rss":449654784,"heapTotal":403587680,"heapUsed":371398688},"levelPath":"warn/service-runner/heap","msg":"Heap memory limit temporarily exceeded","time":"2016-03-07T21:37:33.940Z","v":0}
{"name":"mobileapps","hostname":"scb1002","pid":118,"level":40,"levelPath":"warn/mwapi","request_id":"b88d38d2-e4ad-11e5-b364-37bf8f411856","msg":"no mobileview in response: {\n  \"servedby\": \"mw1120\",\n  \"error\": {\n    \"code\": \"internal_api_error_Exception\",\n    \"info\": \"[6215e01f] Exception Caught: ApiMobileView::getParserOutput: PoolCounter didn't return parser output\"\n  }\n}","time":"2016-03-07T21:44:45.101Z","v":0}
{"name":"mobileapps","hostname":"scb1002","pid":118,"level":50,"message":"504: api_error","status":504,"type":"api_error","detail":{"servedby":"mw1120","error":{"code":"internal_api_error_Exception","info":"[6215e01f] Exception Caught: ApiMobileView::getParserOutput: PoolCounter didn't return parser output"}},"levelPath":"error/504","request_id":"b88d38d2-e4ad-11e5-b364-37bf8f411856","msg":"504: api_error","time":"2016-03-07T21:44:45.107Z","v":0}
{"name":"mobileapps","hostname":"scb1002","pid":145,"level":40,"message":"Heap memory limit temporarily exceeded","limit":314572800,"memoryUsage":
{"rss":657379328,"heapTotal":471307360,"heapUsed":400098032},"levelPath":"warn/service-runner/heap","msg":"Heap memory limit temporarily exceeded","time":"2016-03-07T22:08:33.043Z","v":0}

I'm really reaching, though.

mobrovac lowered the priority of this task from Unbreak Now! to High.Mar 9 2016, 4:35 PM
mobrovac added projects: Services, Operations.

So, I've checked out the offending deploy commit on scb1001 under my local user and ran it without problems. I think it's worth giving the deploy another shot.

Mholloway closed this task as Resolved.Mar 9 2016, 11:08 PM

The conclusion was that the node processes ran out of memory due to the previously mentioned patch.