Page MenuHomePhabricator

Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Open, NormalPublic

Description

Incident report on wikitech

Roughly what occurred (needs times and check for accuracy/order):

  • Rebooted labstore1004 activate upgraded kernel
  • Promoted labstore1004 to primary and failed clients over to it
  • Load spiked on labstore1004
  • (LDAP outage from unrelated causes)
  • Rebooted labstore1005 to activate upgraded kernel
  • Promoted labstore1005 to primary and failed clients over to it
  • Load spiked on labstore1005
  • Load spiked on several NFS clients
  • Reboots of various NFS clients in Tools project to ensure that stale NFS handles are not to blame for load spikes
  • Kubernetes nodes not able to communicate with etcd
  • Reboot flannel and flannel etcd
  • Reboot kubernetes etcd
  • Load continues to be very very high on labstore1005 NFS primary
  • Halt new pod scheduling on Kubernetes
  • Rolling reboot of Kubernetes nodes
  • Rolling reboot of grid engine nodes
  • Tune kernel parameters on labstore1005
  • Let things sit to see if load will settle down
  • Rollback labstore1004 kernel to 4.4.2-3+wmf8
  • Change i/o scheduler on labstore1005 from deadline to cfq
  • Let things sit to see if load will settle down
  • Re-enable new pod scheduling on Kubernetes to restore service to clients
  • Let things sit to see if load will settle down
  • 2017-06-30T00:28:26 Load spikes hit new high of 165.14 1m avg on labstore1005
  • Promote labstore1004 to NFS primary and fail clients over
  • Load on labstore1004 stays within pre-update expected values
  • Rollback labstore1005 kernel to 4.4.2-3+wmf8
  • Let things sit

Event Timeline

bd808 created this task.Jun 30 2017, 2:33 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 30 2017, 2:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 added a comment.Jun 30 2017, 2:38 AM

Some data about the system load we saw:

1bd808$ while /bin/true; do w|head -1; sleep 60; done
2 21:48:22 up 6:26, 5 users, load average: 22.48, 21.74, 24.36
3 21:49:22 up 6:27, 5 users, load average: 20.15, 21.54, 24.15
4 21:50:22 up 6:28, 5 users, load average: 15.01, 19.54, 23.28
5 21:51:22 up 6:29, 5 users, load average: 23.41, 21.89, 23.90
6 21:52:22 up 6:30, 5 users, load average: 9.67, 18.17, 22.49
7 21:53:22 up 6:31, 5 users, load average: 5.26, 15.47, 21.30
8 21:54:22 up 6:32, 5 users, load average: 3.68, 13.05, 20.09
9 21:55:22 up 6:33, 5 users, load average: 9.31, 12.55, 19.46
10 21:56:22 up 6:34, 5 users, load average: 22.45, 16.02, 20.23
11 21:57:22 up 6:35, 5 users, load average: 16.71, 15.48, 19.78
12 21:58:22 up 6:36, 5 users, load average: 10.81, 14.23, 19.09
13 21:59:22 up 6:37, 5 users, load average: 4.34, 11.72, 17.92
14 22:00:22 up 6:38, 5 users, load average: 20.87, 13.74, 18.15
15 22:01:22 up 6:39, 5 users, load average: 107.12, 40.87, 27.27
16 22:02:22 up 6:40, 5 users, load average: 79.89, 46.28, 30.06
17 22:03:22 up 6:41, 5 users, load average: 39.44, 40.80, 29.19
18 22:04:22 up 6:42, 5 users, load average: 23.71, 35.94, 28.24
19 22:05:22 up 6:43, 5 users, load average: 23.53, 33.20, 27.75
20 22:06:22 up 6:44, 5 users, load average: 23.09, 31.63, 27.57
21 22:07:22 up 6:45, 5 users, load average: 13.49, 27.30, 26.33
22 22:08:22 up 6:46, 5 users, load average: 17.68, 25.54, 25.76
23 22:09:22 up 6:47, 5 users, load average: 18.33, 24.19, 25.28
24 22:10:23 up 6:48, 5 users, load average: 33.64, 26.41, 25.92
25 22:11:23 up 6:49, 5 users, load average: 42.27, 31.18, 27.65
26 22:12:23 up 6:50, 5 users, load average: 28.28, 29.27, 27.22
27 22:13:23 up 6:51, 5 users, load average: 23.89, 28.23, 27.01
28 22:14:23 up 6:52, 6 users, load average: 19.23, 26.27, 26.42
29 22:15:23 up 6:53, 6 users, load average: 29.90, 27.27, 26.71
30 # <chasemp> !log set cfq scheduler on labstore1005
31 22:16:23 up 6:54, 6 users, load average: 39.06, 30.90, 28.03
32 22:17:23 up 6:55, 6 users, load average: 29.66, 29.78, 27.83
33 22:18:23 up 6:56, 6 users, load average: 28.39, 29.13, 27.72
34 22:19:23 up 6:57, 6 users, load average: 19.96, 26.63, 26.95
35 22:20:23 up 6:58, 6 users, load average: 17.86, 24.48, 26.17
36 22:21:23 up 6:59, 6 users, load average: 31.95, 27.74, 27.21
37 22:22:23 up 7:00, 6 users, load average: 24.14, 26.34, 26.77
38 22:23:23 up 7:01, 6 users, load average: 18.88, 24.37, 26.05
39 22:24:23 up 7:02, 6 users, load average: 24.18, 24.08, 25.80
40 22:25:23 up 7:03, 6 users, load average: 40.34, 29.36, 27.55
41 22:26:23 up 7:04, 6 users, load average: 36.21, 30.58, 28.11
42 22:27:23 up 7:05, 6 users, load average: 22.02, 27.42, 27.17
43 22:28:23 up 7:06, 6 users, load average: 17.21, 24.87, 26.29
44 22:29:23 up 7:07, 6 users, load average: 9.84, 21.39, 25.01
45 22:30:23 up 7:08, 6 users, load average: 15.81, 20.20, 24.33
46 22:31:23 up 7:09, 6 users, load average: 42.33, 27.60, 26.66
47 22:32:23 up 7:10, 6 users, load average: 24.94, 25.29, 25.92
48 22:33:23 up 7:11, 6 users, load average: 15.22, 22.62, 24.98
49 22:34:23 up 7:12, 6 users, load average: 6.63, 18.73, 23.48
50 22:35:23 up 7:13, 6 users, load average: 11.32, 17.40, 22.70
51 22:36:23 up 7:14, 6 users, load average: 20.44, 19.36, 23.07
52 22:37:23 up 7:15, 6 users, load average: 9.59, 16.41, 21.82
53 22:38:23 up 7:16, 6 users, load average: 5.33, 13.94, 20.63
54 22:39:23 up 7:17, 6 users, load average: 4.71, 12.06, 19.56
55 22:40:23 up 7:18, 6 users, load average: 18.87, 14.60, 19.95
56 22:41:23 up 7:19, 6 users, load average: 18.17, 15.34, 19.88
57 22:42:23 up 7:20, 6 users, load average: 17.02, 15.53, 19.66
58 22:43:23 up 7:21, 6 users, load average: 9.19, 13.71, 18.78
59 22:44:23 up 7:22, 6 users, load average: 7.81, 12.37, 18.00
60 22:45:23 up 7:23, 6 users, load average: 15.55, 13.35, 17.96
61 22:46:23 up 7:24, 6 users, load average: 29.23, 17.72, 19.18
62 22:47:23 up 7:25, 6 users, load average: 22.96, 18.22, 19.27
63 22:48:24 up 7:26, 6 users, load average: 18.69, 17.75, 19.04
64 22:49:24 up 7:27, 6 users, load average: 17.10, 17.72, 18.96
65 22:50:24 up 7:28, 6 users, load average: 12.75, 16.20, 18.35
66 22:51:24 up 7:29, 6 users, load average: 18.27, 17.72, 18.77
67 22:52:24 up 7:30, 6 users, load average: 10.41, 15.61, 17.98
68 22:53:24 up 7:31, 6 users, load average: 6.43, 13.67, 17.18
69 22:54:24 up 7:32, 6 users, load average: 4.77, 11.82, 16.31
70 22:55:24 up 7:33, 6 users, load average: 12.29, 12.35, 16.20
71 22:56:24 up 7:34, 6 users, load average: 19.81, 14.69, 16.77
72 22:57:24 up 7:35, 6 users, load average: 15.23, 14.24, 16.48
73 22:58:24 up 7:36, 6 users, load average: 7.96, 12.50, 15.75
74 22:59:24 up 7:37, 6 users, load average: 4.49, 10.58, 14.88
75 # Re-enabled Kubernetes scheduler
76 23:00:24 up 7:38, 6 users, load average: 30.33, 15.23, 16.12
77 23:01:24 up 7:39, 6 users, load average: 77.63, 32.98, 22.25
78 23:02:24 up 7:40, 6 users, load average: 49.11, 33.68, 23.22
79 23:03:24 up 7:41, 6 users, load average: 31.96, 31.74, 23.21
80 23:04:24 up 7:42, 6 users, load average: 23.84, 29.24, 22.87
81 23:05:24 up 7:43, 6 users, load average: 25.82, 28.46, 22.99
82 23:06:24 up 7:44, 6 users, load average: 26.25, 28.69, 23.44
83 23:07:24 up 7:45, 6 users, load average: 14.66, 24.97, 22.49
84 23:08:24 up 7:46, 6 users, load average: 13.73, 22.73, 21.87
85 23:09:24 up 7:47, 6 users, load average: 9.77, 20.01, 20.99
86 23:10:24 up 7:48, 6 users, load average: 25.70, 21.49, 21.37
87 23:11:24 up 7:49, 6 users, load average: 24.64, 22.67, 21.83
88 23:12:24 up 7:50, 6 users, load average: 16.08, 20.23, 21.02
89 23:13:24 up 7:51, 6 users, load average: 15.49, 19.19, 20.61
90 23:14:24 up 7:52, 6 users, load average: 13.61, 18.20, 20.19
91 23:15:24 up 7:53, 6 users, load average: 22.57, 19.67, 20.55
92 23:16:24 up 7:54, 6 users, load average: 35.45, 23.99, 21.99
93 23:17:24 up 7:55, 6 users, load average: 29.25, 23.99, 22.10
94 23:18:24 up 7:56, 6 users, load average: 31.99, 25.49, 22.72
95 23:19:24 up 7:57, 6 users, load average: 39.21, 28.43, 23.89
96 23:20:24 up 7:58, 6 users, load average: 36.44, 29.37, 24.49
97 23:21:24 up 7:59, 6 users, load average: 43.42, 32.92, 26.03
98 23:22:24 up 8:00, 6 users, load average: 43.83, 35.04, 27.19
99 23:23:24 up 8:01, 6 users, load average: 30.10, 32.89, 26.96
100 23:24:24 up 8:02, 6 users, load average: 35.34, 33.66, 27.58
101 23:25:24 up 8:03, 6 users, load average: 53.45, 38.55, 29.60
102 23:26:25 up 8:04, 6 users, load average: 42.10, 38.54, 30.19
103 23:27:25 up 8:05, 6 users, load average: 27.22, 34.98, 29.49
104 23:28:25 up 8:06, 6 users, load average: 19.43, 31.20, 28.52
105 23:29:25 up 8:07, 6 users, load average: 16.17, 28.22, 27.67
106 23:30:25 up 8:08, 6 users, load average: 34.74, 30.24, 28.34
107 23:31:25 up 8:09, 6 users, load average: 55.58, 37.68, 31.05
108 23:32:25 up 8:10, 6 users, load average: 38.48, 36.08, 30.92
109 23:33:25 up 8:11, 6 users, load average: 30.98, 34.18, 30.58
110 23:34:25 up 8:12, 6 users, load average: 24.63, 31.89, 30.03
111 23:36:25 up 8:14, 6 users, load average: 34.26, 33.03, 30.64
112 23:37:25 up 8:15, 6 users, load average: 24.90, 30.52, 29.93
113 23:38:25 up 8:16, 6 users, load average: 21.08, 28.21, 29.16
114 23:39:25 up 8:17, 6 users, load average: 15.94, 25.44, 28.14
115 23:40:25 up 8:18, 6 users, load average: 23.06, 25.17, 27.85
116 23:41:25 up 8:19, 6 users, load average: 25.25, 26.07, 28.03
117 23:42:25 up 8:20, 6 users, load average: 20.00, 24.12, 27.22
118 23:43:25 up 8:21, 6 users, load average: 15.96, 22.19, 26.36
119 23:44:25 up 8:22, 6 users, load average: 23.61, 23.32, 26.49
120 23:45:25 up 8:23, 6 users, load average: 30.87, 24.97, 26.83
121 23:46:25 up 8:24, 6 users, load average: 37.74, 28.79, 28.07
122 23:47:25 up 8:25, 6 users, load average: 29.97, 27.89, 27.79
123 23:48:25 up 8:26, 6 users, load average: 29.88, 28.12, 27.86
124 23:49:25 up 8:27, 6 users, load average: 31.49, 28.37, 27.93
125 23:50:25 up 8:28, 6 users, load average: 24.30, 27.03, 27.51
126 23:51:25 up 8:29, 6 users, load average: 20.32, 25.68, 27.03
127 23:52:25 up 8:30, 6 users, load average: 17.82, 23.77, 26.28
128 23:53:25 up 8:31, 6 users, load average: 15.61, 22.18, 25.58
129 23:54:25 up 8:32, 6 users, load average: 14.75, 20.46, 24.76
130 23:55:25 up 8:33, 6 users, load average: 30.94, 23.56, 25.52
131 23:56:25 up 8:34, 6 users, load average: 37.85, 27.34, 26.73
132 23:57:25 up 8:35, 6 users, load average: 31.53, 27.32, 26.75
133 23:58:25 up 8:36, 6 users, load average: 21.11, 25.36, 26.13
134 23:59:25 up 8:37, 5 users, load average: 11.26, 21.81, 24.86
135 00:00:25 up 8:38, 5 users, load average: 30.35, 24.46, 25.53
136 00:01:25 up 8:39, 5 users, load average: 128.66, 53.90, 35.60
137 00:02:26 up 8:40, 5 users, load average: 118.68, 65.93, 40.96
138 00:03:26 up 8:41, 5 users, load average: 68.00, 61.41, 41.01
139 00:04:26 up 8:42, 5 users, load average: 48.45, 57.09, 40.80
140 00:05:26 up 8:43, 5 users, load average: 30.11, 50.17, 39.43
141 00:06:26 up 8:44, 5 users, load average: 37.95, 48.86, 39.66
142 00:07:26 up 8:45, 5 users, load average: 25.56, 43.11, 38.25
143 00:08:26 up 8:46, 5 users, load average: 17.92, 38.16, 36.88
144 00:09:26 up 8:47, 4 users, load average: 11.29, 32.67, 35.07
145 00:10:26 up 8:48, 4 users, load average: 32.73, 33.91, 35.29
146 00:11:26 up 8:49, 4 users, load average: 35.13, 34.68, 35.48
147 00:12:26 up 8:50, 4 users, load average: 31.50, 33.41, 34.98
148 00:13:26 up 8:51, 4 users, load average: 26.91, 31.79, 34.33
149 00:14:26 up 8:52, 4 users, load average: 29.68, 31.24, 33.95
150 00:15:26 up 8:53, 4 users, load average: 33.28, 32.27, 34.15
151 00:16:26 up 8:54, 4 users, load average: 42.20, 34.76, 34.88
152 00:17:26 up 8:55, 4 users, load average: 26.46, 31.80, 33.86
153 00:18:26 up 8:56, 4 users, load average: 24.92, 30.23, 33.18
154 00:19:26 up 8:57, 4 users, load average: 21.84, 28.31, 32.33
155 00:20:26 up 8:58, 4 users, load average: 27.89, 28.66, 32.19
156 00:21:26 up 8:59, 4 users, load average: 33.64, 30.41, 32.58
157 00:22:26 up 9:00, 4 users, load average: 30.08, 30.01, 32.31
158 00:23:26 up 9:01, 4 users, load average: 23.74, 28.01, 31.47
159 00:24:26 up 9:02, 4 users, load average: 51.83, 35.55, 33.86
160 00:25:26 up 9:03, 4 users, load average: 92.31, 48.55, 38.34
161 00:26:26 up 9:04, 4 users, load average: 124.97, 67.08, 45.42
162 00:27:26 up 9:05, 4 users, load average: 126.32, 76.00, 49.73
163 00:28:26 up 9:06, 4 users, load average: 165.14, 96.48, 58.43
164 00:29:26 up 9:07, 4 users, load average: 132.22, 100.43, 62.23
165 00:30:26 up 9:08, 4 users, load average: 98.44, 96.11, 63.12
166 00:31:26 up 9:09, 4 users, load average: 113.44, 99.98, 66.47
167 00:32:26 up 9:10, 4 users, load average: 91.29, 96.83, 67.53
168 00:33:26 up 9:11, 4 users, load average: 67.54, 88.47, 66.46
169 00:34:26 up 9:12, 4 users, load average: 90.20, 91.37, 68.85
170 00:35:26 up 9:13, 4 users, load average: 66.07, 84.52, 67.92
171 00:36:26 up 9:14, 4 users, load average: 65.21, 81.17, 67.82
172 00:37:27 up 9:15, 4 users, load average: 37.02, 70.30, 64.92
173 00:38:27 up 9:16, 4 users, load average: 23.21, 60.12, 61.75
174 00:39:27 up 9:17, 4 users, load average: 14.69, 50.88, 58.46
175 00:40:27 up 9:18, 4 users, load average: 18.34, 44.83, 55.88
176 00:41:27 up 9:19, 4 users, load average: 24.19, 42.21, 54.31
177 00:42:27 up 9:20, 4 users, load average: 19.45, 37.63, 51.98
178 00:43:27 up 9:21, 4 users, load average: 39.88, 40.26, 51.99
179 00:44:27 up 9:22, 4 users, load average: 35.88, 38.82, 50.74
180 00:45:27 up 9:23, 4 users, load average: 30.76, 36.33, 49.12
181 00:46:27 up 9:24, 4 users, load average: 50.64, 41.56, 50.14
182 00:47:27 up 9:25, 4 users, load average: 53.48, 44.15, 50.50
183 00:48:27 up 9:26, 4 users, load average: 50.56, 44.74, 50.29
184 00:49:27 up 9:27, 4 users, load average: 35.02, 41.29, 48.75
185 00:50:27 up 9:28, 4 users, load average: 38.36, 40.51, 47.98
186 00:51:27 up 9:29, 4 users, load average: 46.91, 42.62, 48.25
187 00:52:27 up 9:30, 4 users, load average: 28.36, 38.41, 46.47
188 00:53:27 up 9:31, 4 users, load average: 20.75, 34.56, 44.65
189 00:54:27 up 9:32, 4 users, load average: 29.30, 34.09, 43.83
190 00:55:27 up 9:33, 4 users, load average: 41.25, 36.48, 44.02
191 00:56:27 up 9:34, 4 users, load average: 36.59, 36.44, 43.56
192 00:57:27 up 9:35, 4 users, load average: 28.67, 34.31, 42.39
193 00:58:27 up 9:36, 4 users, load average: 36.35, 35.43, 42.26
194 00:59:27 up 9:37, 4 users, load average: 39.51, 36.75, 42.30
195 01:00:27 up 9:38, 4 users, load average: 53.77, 39.30, 42.72

bd808 added a comment.Jun 30 2017, 2:56 AM

SAL entries:

== 2017-06-30 ==
02:29	<chasemp>	labstore1005 start drbd
02:14	<chasemp>	reboot labstore1005 (5m ago)
01:33	<chasemp>	time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
01:29	<andrewbogott>	rebooting tools-cron-01
01:25	<chasemp>	reboot labstoer1005
01:23	<chasemp>	fail nfs from labstore1005 to labstore1004 (I failed to log a previous failover to 1004 and back)

== 2017-06-29 ==
22:16	<chasemp>	set cfq scheduler on labstore1005
21:40	<chasemp>	reboot labstore1004 with grub set to gnulinux-advanced-1773f282-5a1b-441e-865c-8b70a0ebc925>gnulinux-4.4.0-3-amd64-advanced-1773f282-5a1b-441e-865c-8b70a0ebc925
20:33	<andrewbogott>	depooling, rebooting, and repooling every lighttpd node three at a time
18:30	<chasemp>	restart nfs on labstore1004 (primary)
15:38	<chasemp>	restart nfs-exportd on labstore1004
17:22	<bd808>	rebooting tools-static-11
17:20	<andrewbogott>	rebooting tools-static-10
16:27	<chasemp>	restart k8s components on master (madhu)
16:10	<chasemp>	tools-flannel-etcd-01:~$ sudo service etcd restart
15:57	<chasemp>	reboot tools-docker-registery-01 for nfs
15:09	<chasemp>	set downtimes for labstore1004/1005 failover see https://etherpad.wikimedia.org/p/labstore_reboots
bd808 updated the task description. (Show Details)Jun 30 2017, 3:13 AM
bd808 updated the task description. (Show Details)Jun 30 2017, 3:17 AM
Jay8g added a subscriber: Jay8g.Jun 30 2017, 4:07 AM
herron added a subscriber: herron.Jun 30 2017, 5:46 PM
chasemp triaged this task as High priority.Jun 30 2017, 6:15 PM
bd808 updated the task description. (Show Details)Jul 1 2017, 6:37 PM
Paladox added a subscriber: Paladox.Jul 1 2017, 6:45 PM
bd808 lowered the priority of this task from High to Normal.Nov 21 2018, 11:14 PM

Pretty sure this is not "High" priority 18 months later.