Page MenuHomePhabricator

Investigate lvs IP pages during codfw row C switch upgrade
Closed, DeclinedPublic

Description

During codfw row C switch upgrade icinga had us paged for some service IPs being unreachable. I pulled the icinga log as grep -ir svc.codfw /var/log/icinga.log | grep ALERT | perl -pe 's/(\d+)/localtime($1)/e' below in P5760

1[Wed Jul 19 07:17:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;WARNING;SOFT;1;/en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle => None
2[Wed Jul 19 07:18:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;OK;SOFT;2;All endpoints are healthy
3[Wed Jul 19 08:20:42 2017] SERVICE ALERT: mathoid.svc.codfw.wmnet;Mathoid LVS codfw;CRITICAL;SOFT;1;/{format}/ (mass-energy equivalence (svg)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received
4[Wed Jul 19 08:20:43 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
5[Wed Jul 19 08:20:52 2017] SERVICE ALERT: citoid.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
6[Wed Jul 19 08:20:53 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;CRITICAL;SOFT;1;CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.2.1.30', port=9200): Read timed out. (read timeout=4)
7[Wed Jul 19 08:20:54 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
8[Wed Jul 19 08:21:02 2017] SERVICE ALERT: parsoid.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
9[Wed Jul 19 08:21:03 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.14 and port 8888: No route to host
10[Wed Jul 19 08:21:03 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
11[Wed Jul 19 08:21:03 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
12[Wed Jul 19 08:21:12 2017] SERVICE ALERT: ores.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.10 and port 8081: No route to host
13[Wed Jul 19 08:21:12 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;SOFT;1;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received
14[Wed Jul 19 08:21:13 2017] SERVICE ALERT: thumbor.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.24 and port 8800: No route to host
15[Wed Jul 19 08:21:33 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.21 and port 443: No route to host
16[Wed Jul 19 08:21:42 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;CRITICAL;SOFT;1;/{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for 'cat') timed out before a response was received
17[Wed Jul 19 08:21:52 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTPS IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.258 second response time
18[Wed Jul 19 08:21:52 2017] SERVICE ALERT: mathoid.svc.codfw.wmnet;Mathoid LVS codfw;OK;SOFT;2;All endpoints are healthy
19[Wed Jul 19 08:21:52 2017] SERVICE ALERT: citoid.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 921 bytes in 0.075 second response time
20[Wed Jul 19 08:22:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 - 15600 bytes in 0.087 second response time
21[Wed Jul 19 08:22:04 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;CRITICAL;SOFT;2;CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.2.1.30', port=9200): Read timed out. (read timeout=4)
22[Wed Jul 19 08:22:04 2017] SERVICE ALERT: parsoid.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 1051 bytes in 0.087 second response time
23[Wed Jul 19 08:22:22 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;connect to address 10.2.1.21 and port 80: No route to host
24[Wed Jul 19 08:22:22 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;OK;SOFT;2;All endpoints are healthy
25[Wed Jul 19 08:22:22 2017] SERVICE ALERT: ores.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 4913 bytes in 0.074 second response time
26[Wed Jul 19 08:22:22 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 960 bytes in 0.094 second response time
27[Wed Jul 19 08:22:22 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
28[Wed Jul 19 08:22:22 2017] SERVICE ALERT: thumbor.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.074 second response time
29[Wed Jul 19 08:22:42 2017] SERVICE ALERT: mobileapps.svc.codfw.wmnet;Mobileapps LVS codfw;OK;SOFT;2;All endpoints are healthy
30[Wed Jul 19 08:22:43 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;SOFT;2;connect to address 10.2.1.21 and port 443: No route to host
31[Wed Jul 19 08:23:12 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;CRITICAL;HARD;3;CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.2.1.30', port=9200): Read timed out. (read timeout=4)
32[Wed Jul 19 08:23:22 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;HARD;3;connect to address 10.2.1.21 and port 80: No route to host
33[Wed Jul 19 08:23:26 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;connect to address 10.2.1.1 and port 80: No route to host
34[Wed Jul 19 08:23:32 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
35[Wed Jul 19 08:24:02 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;CRITICAL;HARD;3;connect to address 10.2.1.21 and port 443: No route to host
36[Wed Jul 19 08:24:42 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.188 second response time
37[Wed Jul 19 08:24:42 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 23115 bytes in 0.366 second response time
38[Wed Jul 19 08:25:12 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.161 second response time
39[Wed Jul 19 08:26:02 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
40[Wed Jul 19 08:26:02 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
41[Wed Jul 19 08:26:13 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.168 second response time
42[Wed Jul 19 08:27:02 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;2;connect to address 10.2.1.1 and port 80: No route to host
43[Wed Jul 19 08:27:03 2017] SERVICE ALERT: api.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 23117 bytes in 9.403 second response time
44[Wed Jul 19 08:27:22 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;HARD;3;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.169 second response time
45[Wed Jul 19 08:28:03 2017] SERVICE ALERT: appservers.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.186 second response time
46[Wed Jul 19 08:29:33 2017] SERVICE ALERT: search.svc.codfw.wmnet;LVS HTTP IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.164 second response time
47[Wed Jul 19 08:29:52 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTPS IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.282 second response time
48[Wed Jul 19 08:30:13 2017] SERVICE ALERT: search.svc.codfw.wmnet;ElasticSearch health check for shards;OK;HARD;3;OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 36, unassigned_shards: 168, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3087, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_number: 97.7849810913, active_shards: 9050, initializing_shards: 37, number_of_data_nodes: 36, delayed_unassigned_shards: 0
49[Wed Jul 19 08:30:33 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 16150 bytes in 0.212 second response time
50[Wed Jul 19 08:30:42 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;SOFT;1;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
51[Wed Jul 19 08:31:42 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;SOFT;2;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
52[Wed Jul 19 08:31:52 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
53[Wed Jul 19 08:32:52 2017] SERVICE ALERT: rendering.svc.codfw.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.182 second response time
54[Wed Jul 19 08:32:52 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;CRITICAL;HARD;3;/en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200)
55[Wed Jul 19 08:37:02 2017] SERVICE ALERT: restbase.svc.codfw.wmnet;Restbase LVS codfw;OK;HARD;3;All endpoints are healthy

It looks like some services timed out and then recovered, some others didn't and turned into HARD alerts e.g. search/rendering/api

Event Timeline

This is almost 2 years old now, I don't think we have any other logs to investigate it or if it happened again. Please reopen if you think it's possible to investigate.