Investigate lvs IP pages during codfw row C switch upgrade
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Jul 19 2017, 9:18 AM

Description

During codfw row C switch upgrade icinga had us paged for some service IPs being unreachable. I pulled the icinga log as grep -ir svc.codfw /var/log/icinga.log | grep ALERT | perl -pe 's/(\d+)/localtime($1)/e' below in P5760

P5760 Masterwork From Distant Lands

1	[Wed Jul 19 07:17:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;WARNING;SOFT;1;/en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle => None
2	[Wed Jul 19 07:18:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;OK;SOFT;2;All endpoints are healthy
3	[Wed Jul 19 08:20:42 2017] SERVICE ALERT: mathoid.svc.codfw.wmnet;Mathoid LVS codfw;CRITICAL;SOFT;1;/{format}/ (mass-energy equivalence (svg)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received
4	[Wed Jul 19 08:20:43 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
5	[Wed Jul 19 08:20:52 2017] SERVICE ALERT: citoid.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
6	[Wed Jul 19 08:20:53 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;CRITICAL;SOFT;1;CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.2.1.30', port=9200): Read timed out. (read timeout=4)
7	[Wed Jul 19 08:20:54 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
8	[Wed Jul 19 08:21:02 2017] SERVICE ALERT: parsoid.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
9	[Wed Jul 19 08:21:03 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.14 and port 8888: No route to host
10	[Wed Jul 19 08:21:03 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
11	[Wed Jul 19 08:21:03 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
12	[Wed Jul 19 08:21:12 2017] SERVICE ALERT: ores.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.10 and port 8081: No route to host
13	[Wed Jul 19 08:21:12 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;SOFT;1;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received
14	[Wed Jul 19 08:21:13 2017] SERVICE ALERT: thumbor.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.24 and port 8800: No route to host
15	[Wed Jul 19 08:21:33 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.21 and port 443: No route to host
16	[Wed Jul 19 08:21:42 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;CRITICAL;SOFT;1;/{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for 'cat') timed out before a response was received
17	[Wed Jul 19 08:21:52 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTPS IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.258 second response time
18	[Wed Jul 19 08:21:52 2017] SERVICE ALERT: mathoid.svc.codfw.wmnet;Mathoid LVS codfw;OK;SOFT;2;All endpoints are healthy
19	[Wed Jul 19 08:21:52 2017] SERVICE ALERT: citoid.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 921 bytes in 0.075 second response time
20	[Wed Jul 19 08:22:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 - 15600 bytes in 0.087 second response time
21	[Wed Jul 19 08:22:04 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;CRITICAL;SOFT;2;CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.2.1.30', port=9200): Read timed out. (read timeout=4)
22	[Wed Jul 19 08:22:04 2017] SERVICE ALERT: parsoid.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.087 second response time
23	[Wed Jul 19 08:22:22 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;connect to address 10.2.1.21 and port 80: No route to host
24	[Wed Jul 19 08:22:22 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;OK;SOFT;2;All endpoints are healthy
25	[Wed Jul 19 08:22:22 2017] SERVICE ALERT: ores.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 4913 bytes in 0.074 second response time
26	[Wed Jul 19 08:22:22 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 960 bytes in 0.094 second response time
27	[Wed Jul 19 08:22:22 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
28	[Wed Jul 19 08:22:22 2017] SERVICE ALERT: thumbor.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.074 second response time
29	[Wed Jul 19 08:22:42 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;OK;SOFT;2;All endpoints are healthy
30	[Wed Jul 19 08:22:43 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;SOFT;2;connect to address 10.2.1.21 and port 443: No route to host
31	[Wed Jul 19 08:23:12 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;CRITICAL;HARD;3;CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.2.1.30', port=9200): Read timed out. (read timeout=4)
32	[Wed Jul 19 08:23:22 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;HARD;3;connect to address 10.2.1.21 and port 80: No route to host
33	[Wed Jul 19 08:23:26 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.1 and port 80: No route to host
34	[Wed Jul 19 08:23:32 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
35	[Wed Jul 19 08:24:02 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;HARD;3;connect to address 10.2.1.21 and port 443: No route to host
36	[Wed Jul 19 08:24:42 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.188 second response time
37	[Wed Jul 19 08:24:42 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 23115 bytes in 0.366 second response time
38	[Wed Jul 19 08:25:12 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.161 second response time
39	[Wed Jul 19 08:26:02 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
40	[Wed Jul 19 08:26:02 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
41	[Wed Jul 19 08:26:13 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.168 second response time
42	[Wed Jul 19 08:27:02 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;connect to address 10.2.1.1 and port 80: No route to host
43	[Wed Jul 19 08:27:03 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 23117 bytes in 9.403 second response time
44	[Wed Jul 19 08:27:22 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;HARD;3;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.169 second response time
45	[Wed Jul 19 08:28:03 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.186 second response time
46	[Wed Jul 19 08:29:33 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.164 second response time
47	[Wed Jul 19 08:29:52 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.282 second response time
48	[Wed Jul 19 08:30:13 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;OK;HARD;3;OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 36, unassigned_shards: 168, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3087, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_number: 97.7849810913, active_shards: 9050, initializing_shards: 37, number_of_data_nodes: 36, delayed_unassigned_shards: 0
49	[Wed Jul 19 08:30:33 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 16150 bytes in 0.212 second response time
50	[Wed Jul 19 08:30:42 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;SOFT;1;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
51	[Wed Jul 19 08:31:42 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;SOFT;2;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
52	[Wed Jul 19 08:31:52 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
53	[Wed Jul 19 08:32:52 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.182 second response time
54	[Wed Jul 19 08:32:52 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;HARD;3;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
55	[Wed Jul 19 08:37:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;OK;HARD;3;All endpoints are healthy

It looks like some services timed out and then recovered, some others didn't and turned into HARD alerts e.g. search/rendering/api

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		ayounsi	T170380 codfw row C switch upgrade
		Declined		None	T171032 Investigate lvs IP pages during codfw row C switch upgrade

Event Timeline

fgiunchedi created this task.Jul 19 2017, 9:18 AM

• ema moved this task from Backlog to LoadBalancer on the Traffic board.Jul 20 2017, 8:46 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:23 PM

This is almost 2 years old now, I don't think we have any other logs to investigate it or if it happened again. Please reopen if you think it's possible to investigate.

Investigate lvs IP pages during codfw row C switch upgradeClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate lvs IP pages during codfw row C switch upgrade
Closed, DeclinedPublic
Actions

Related Objects
Search...