Page MenuHomePhabricator

Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Closed, ResolvedPublic

Description

Incident report on wikitech

Roughly what occurred (needs times and check for accuracy/order):

  • Rebooted labstore1004 activate upgraded kernel
  • Promoted labstore1004 to primary and failed clients over to it
  • Load spiked on labstore1004
  • (LDAP outage from unrelated causes)
  • Rebooted labstore1005 to activate upgraded kernel
  • Promoted labstore1005 to primary and failed clients over to it
  • Load spiked on labstore1005
  • Load spiked on several NFS clients
  • Reboots of various NFS clients in Tools project to ensure that stale NFS handles are not to blame for load spikes
  • Kubernetes nodes not able to communicate with etcd
  • Reboot flannel and flannel etcd
  • Reboot kubernetes etcd
  • Load continues to be very very high on labstore1005 NFS primary
  • Halt new pod scheduling on Kubernetes
  • Rolling reboot of Kubernetes nodes
  • Rolling reboot of grid engine nodes
  • Tune kernel parameters on labstore1005
  • Let things sit to see if load will settle down
  • Rollback labstore1004 kernel to 4.4.2-3+wmf8
  • Change i/o scheduler on labstore1005 from deadline to cfq
  • Let things sit to see if load will settle down
  • Re-enable new pod scheduling on Kubernetes to restore service to clients
  • Let things sit to see if load will settle down
  • 2017-06-30T00:28:26 Load spikes hit new high of 165.14 1m avg on labstore1005
  • Promote labstore1004 to NFS primary and fail clients over
  • Load on labstore1004 stays within pre-update expected values
  • Rollback labstore1005 kernel to 4.4.2-3+wmf8
  • Let things sit

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
ResolvedMoritzMuehlenhoff
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Resolvedaborrero
Resolvedtaavi
DuplicateNone
Resolvedtaavi
DeclinedNone
Resolvedaborrero
DeclinedNone
Resolvedaborrero
Resolvedtaavi
Resolvedtaavi
Resolved nskaggs
Declinedtaavi
DeclinedNone
Resolved Bstorm

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some data about the system load we saw:

1bd808$ while /bin/true; do w|head -1; sleep 60; done
2 21:48:22 up 6:26, 5 users, load average: 22.48, 21.74, 24.36
3 21:49:22 up 6:27, 5 users, load average: 20.15, 21.54, 24.15
4 21:50:22 up 6:28, 5 users, load average: 15.01, 19.54, 23.28
5 21:51:22 up 6:29, 5 users, load average: 23.41, 21.89, 23.90
6 21:52:22 up 6:30, 5 users, load average: 9.67, 18.17, 22.49
7 21:53:22 up 6:31, 5 users, load average: 5.26, 15.47, 21.30
8 21:54:22 up 6:32, 5 users, load average: 3.68, 13.05, 20.09
9 21:55:22 up 6:33, 5 users, load average: 9.31, 12.55, 19.46
10 21:56:22 up 6:34, 5 users, load average: 22.45, 16.02, 20.23
11 21:57:22 up 6:35, 5 users, load average: 16.71, 15.48, 19.78
12 21:58:22 up 6:36, 5 users, load average: 10.81, 14.23, 19.09
13 21:59:22 up 6:37, 5 users, load average: 4.34, 11.72, 17.92
14 22:00:22 up 6:38, 5 users, load average: 20.87, 13.74, 18.15
15 22:01:22 up 6:39, 5 users, load average: 107.12, 40.87, 27.27
16 22:02:22 up 6:40, 5 users, load average: 79.89, 46.28, 30.06
17 22:03:22 up 6:41, 5 users, load average: 39.44, 40.80, 29.19
18 22:04:22 up 6:42, 5 users, load average: 23.71, 35.94, 28.24
19 22:05:22 up 6:43, 5 users, load average: 23.53, 33.20, 27.75
20 22:06:22 up 6:44, 5 users, load average: 23.09, 31.63, 27.57
21 22:07:22 up 6:45, 5 users, load average: 13.49, 27.30, 26.33
22 22:08:22 up 6:46, 5 users, load average: 17.68, 25.54, 25.76
23 22:09:22 up 6:47, 5 users, load average: 18.33, 24.19, 25.28
24 22:10:23 up 6:48, 5 users, load average: 33.64, 26.41, 25.92
25 22:11:23 up 6:49, 5 users, load average: 42.27, 31.18, 27.65
26 22:12:23 up 6:50, 5 users, load average: 28.28, 29.27, 27.22
27 22:13:23 up 6:51, 5 users, load average: 23.89, 28.23, 27.01
28 22:14:23 up 6:52, 6 users, load average: 19.23, 26.27, 26.42
29 22:15:23 up 6:53, 6 users, load average: 29.90, 27.27, 26.71
30 # <chasemp> !log set cfq scheduler on labstore1005
31 22:16:23 up 6:54, 6 users, load average: 39.06, 30.90, 28.03
32 22:17:23 up 6:55, 6 users, load average: 29.66, 29.78, 27.83
33 22:18:23 up 6:56, 6 users, load average: 28.39, 29.13, 27.72
34 22:19:23 up 6:57, 6 users, load average: 19.96, 26.63, 26.95
35 22:20:23 up 6:58, 6 users, load average: 17.86, 24.48, 26.17
36 22:21:23 up 6:59, 6 users, load average: 31.95, 27.74, 27.21
37 22:22:23 up 7:00, 6 users, load average: 24.14, 26.34, 26.77
38 22:23:23 up 7:01, 6 users, load average: 18.88, 24.37, 26.05
39 22:24:23 up 7:02, 6 users, load average: 24.18, 24.08, 25.80
40 22:25:23 up 7:03, 6 users, load average: 40.34, 29.36, 27.55
41 22:26:23 up 7:04, 6 users, load average: 36.21, 30.58, 28.11
42 22:27:23 up 7:05, 6 users, load average: 22.02, 27.42, 27.17
43 22:28:23 up 7:06, 6 users, load average: 17.21, 24.87, 26.29
44 22:29:23 up 7:07, 6 users, load average: 9.84, 21.39, 25.01
45 22:30:23 up 7:08, 6 users, load average: 15.81, 20.20, 24.33
46 22:31:23 up 7:09, 6 users, load average: 42.33, 27.60, 26.66
47 22:32:23 up 7:10, 6 users, load average: 24.94, 25.29, 25.92
48 22:33:23 up 7:11, 6 users, load average: 15.22, 22.62, 24.98
49 22:34:23 up 7:12, 6 users, load average: 6.63, 18.73, 23.48
50 22:35:23 up 7:13, 6 users, load average: 11.32, 17.40, 22.70
51 22:36:23 up 7:14, 6 users, load average: 20.44, 19.36, 23.07
52 22:37:23 up 7:15, 6 users, load average: 9.59, 16.41, 21.82
53 22:38:23 up 7:16, 6 users, load average: 5.33, 13.94, 20.63
54 22:39:23 up 7:17, 6 users, load average: 4.71, 12.06, 19.56
55 22:40:23 up 7:18, 6 users, load average: 18.87, 14.60, 19.95
56 22:41:23 up 7:19, 6 users, load average: 18.17, 15.34, 19.88
57 22:42:23 up 7:20, 6 users, load average: 17.02, 15.53, 19.66
58 22:43:23 up 7:21, 6 users, load average: 9.19, 13.71, 18.78
59 22:44:23 up 7:22, 6 users, load average: 7.81, 12.37, 18.00
60 22:45:23 up 7:23, 6 users, load average: 15.55, 13.35, 17.96
61 22:46:23 up 7:24, 6 users, load average: 29.23, 17.72, 19.18
62 22:47:23 up 7:25, 6 users, load average: 22.96, 18.22, 19.27
63 22:48:24 up 7:26, 6 users, load average: 18.69, 17.75, 19.04
64 22:49:24 up 7:27, 6 users, load average: 17.10, 17.72, 18.96
65 22:50:24 up 7:28, 6 users, load average: 12.75, 16.20, 18.35
66 22:51:24 up 7:29, 6 users, load average: 18.27, 17.72, 18.77
67 22:52:24 up 7:30, 6 users, load average: 10.41, 15.61, 17.98
68 22:53:24 up 7:31, 6 users, load average: 6.43, 13.67, 17.18
69 22:54:24 up 7:32, 6 users, load average: 4.77, 11.82, 16.31
70 22:55:24 up 7:33, 6 users, load average: 12.29, 12.35, 16.20
71 22:56:24 up 7:34, 6 users, load average: 19.81, 14.69, 16.77
72 22:57:24 up 7:35, 6 users, load average: 15.23, 14.24, 16.48
73 22:58:24 up 7:36, 6 users, load average: 7.96, 12.50, 15.75
74 22:59:24 up 7:37, 6 users, load average: 4.49, 10.58, 14.88
75 # Re-enabled Kubernetes scheduler
76 23:00:24 up 7:38, 6 users, load average: 30.33, 15.23, 16.12
77 23:01:24 up 7:39, 6 users, load average: 77.63, 32.98, 22.25
78 23:02:24 up 7:40, 6 users, load average: 49.11, 33.68, 23.22
79 23:03:24 up 7:41, 6 users, load average: 31.96, 31.74, 23.21
80 23:04:24 up 7:42, 6 users, load average: 23.84, 29.24, 22.87
81 23:05:24 up 7:43, 6 users, load average: 25.82, 28.46, 22.99
82 23:06:24 up 7:44, 6 users, load average: 26.25, 28.69, 23.44
83 23:07:24 up 7:45, 6 users, load average: 14.66, 24.97, 22.49
84 23:08:24 up 7:46, 6 users, load average: 13.73, 22.73, 21.87
85 23:09:24 up 7:47, 6 users, load average: 9.77, 20.01, 20.99
86 23:10:24 up 7:48, 6 users, load average: 25.70, 21.49, 21.37
87 23:11:24 up 7:49, 6 users, load average: 24.64, 22.67, 21.83
88 23:12:24 up 7:50, 6 users, load average: 16.08, 20.23, 21.02
89 23:13:24 up 7:51, 6 users, load average: 15.49, 19.19, 20.61
90 23:14:24 up 7:52, 6 users, load average: 13.61, 18.20, 20.19
91 23:15:24 up 7:53, 6 users, load average: 22.57, 19.67, 20.55
92 23:16:24 up 7:54, 6 users, load average: 35.45, 23.99, 21.99
93 23:17:24 up 7:55, 6 users, load average: 29.25, 23.99, 22.10
94 23:18:24 up 7:56, 6 users, load average: 31.99, 25.49, 22.72
95 23:19:24 up 7:57, 6 users, load average: 39.21, 28.43, 23.89
96 23:20:24 up 7:58, 6 users, load average: 36.44, 29.37, 24.49
97 23:21:24 up 7:59, 6 users, load average: 43.42, 32.92, 26.03
98 23:22:24 up 8:00, 6 users, load average: 43.83, 35.04, 27.19
99 23:23:24 up 8:01, 6 users, load average: 30.10, 32.89, 26.96
100 23:24:24 up 8:02, 6 users, load average: 35.34, 33.66, 27.58
101 23:25:24 up 8:03, 6 users, load average: 53.45, 38.55, 29.60
102 23:26:25 up 8:04, 6 users, load average: 42.10, 38.54, 30.19
103 23:27:25 up 8:05, 6 users, load average: 27.22, 34.98, 29.49
104 23:28:25 up 8:06, 6 users, load average: 19.43, 31.20, 28.52
105 23:29:25 up 8:07, 6 users, load average: 16.17, 28.22, 27.67
106 23:30:25 up 8:08, 6 users, load average: 34.74, 30.24, 28.34
107 23:31:25 up 8:09, 6 users, load average: 55.58, 37.68, 31.05
108 23:32:25 up 8:10, 6 users, load average: 38.48, 36.08, 30.92
109 23:33:25 up 8:11, 6 users, load average: 30.98, 34.18, 30.58
110 23:34:25 up 8:12, 6 users, load average: 24.63, 31.89, 30.03
111 23:36:25 up 8:14, 6 users, load average: 34.26, 33.03, 30.64
112 23:37:25 up 8:15, 6 users, load average: 24.90, 30.52, 29.93
113 23:38:25 up 8:16, 6 users, load average: 21.08, 28.21, 29.16
114 23:39:25 up 8:17, 6 users, load average: 15.94, 25.44, 28.14
115 23:40:25 up 8:18, 6 users, load average: 23.06, 25.17, 27.85
116 23:41:25 up 8:19, 6 users, load average: 25.25, 26.07, 28.03
117 23:42:25 up 8:20, 6 users, load average: 20.00, 24.12, 27.22
118 23:43:25 up 8:21, 6 users, load average: 15.96, 22.19, 26.36
119 23:44:25 up 8:22, 6 users, load average: 23.61, 23.32, 26.49
120 23:45:25 up 8:23, 6 users, load average: 30.87, 24.97, 26.83
121 23:46:25 up 8:24, 6 users, load average: 37.74, 28.79, 28.07
122 23:47:25 up 8:25, 6 users, load average: 29.97, 27.89, 27.79
123 23:48:25 up 8:26, 6 users, load average: 29.88, 28.12, 27.86
124 23:49:25 up 8:27, 6 users, load average: 31.49, 28.37, 27.93
125 23:50:25 up 8:28, 6 users, load average: 24.30, 27.03, 27.51
126 23:51:25 up 8:29, 6 users, load average: 20.32, 25.68, 27.03
127 23:52:25 up 8:30, 6 users, load average: 17.82, 23.77, 26.28
128 23:53:25 up 8:31, 6 users, load average: 15.61, 22.18, 25.58
129 23:54:25 up 8:32, 6 users, load average: 14.75, 20.46, 24.76
130 23:55:25 up 8:33, 6 users, load average: 30.94, 23.56, 25.52
131 23:56:25 up 8:34, 6 users, load average: 37.85, 27.34, 26.73
132 23:57:25 up 8:35, 6 users, load average: 31.53, 27.32, 26.75
133 23:58:25 up 8:36, 6 users, load average: 21.11, 25.36, 26.13
134 23:59:25 up 8:37, 5 users, load average: 11.26, 21.81, 24.86
135 00:00:25 up 8:38, 5 users, load average: 30.35, 24.46, 25.53
136 00:01:25 up 8:39, 5 users, load average: 128.66, 53.90, 35.60
137 00:02:26 up 8:40, 5 users, load average: 118.68, 65.93, 40.96
138 00:03:26 up 8:41, 5 users, load average: 68.00, 61.41, 41.01
139 00:04:26 up 8:42, 5 users, load average: 48.45, 57.09, 40.80
140 00:05:26 up 8:43, 5 users, load average: 30.11, 50.17, 39.43
141 00:06:26 up 8:44, 5 users, load average: 37.95, 48.86, 39.66
142 00:07:26 up 8:45, 5 users, load average: 25.56, 43.11, 38.25
143 00:08:26 up 8:46, 5 users, load average: 17.92, 38.16, 36.88
144 00:09:26 up 8:47, 4 users, load average: 11.29, 32.67, 35.07
145 00:10:26 up 8:48, 4 users, load average: 32.73, 33.91, 35.29
146 00:11:26 up 8:49, 4 users, load average: 35.13, 34.68, 35.48
147 00:12:26 up 8:50, 4 users, load average: 31.50, 33.41, 34.98
148 00:13:26 up 8:51, 4 users, load average: 26.91, 31.79, 34.33
149 00:14:26 up 8:52, 4 users, load average: 29.68, 31.24, 33.95
150 00:15:26 up 8:53, 4 users, load average: 33.28, 32.27, 34.15
151 00:16:26 up 8:54, 4 users, load average: 42.20, 34.76, 34.88
152 00:17:26 up 8:55, 4 users, load average: 26.46, 31.80, 33.86
153 00:18:26 up 8:56, 4 users, load average: 24.92, 30.23, 33.18
154 00:19:26 up 8:57, 4 users, load average: 21.84, 28.31, 32.33
155 00:20:26 up 8:58, 4 users, load average: 27.89, 28.66, 32.19
156 00:21:26 up 8:59, 4 users, load average: 33.64, 30.41, 32.58
157 00:22:26 up 9:00, 4 users, load average: 30.08, 30.01, 32.31
158 00:23:26 up 9:01, 4 users, load average: 23.74, 28.01, 31.47
159 00:24:26 up 9:02, 4 users, load average: 51.83, 35.55, 33.86
160 00:25:26 up 9:03, 4 users, load average: 92.31, 48.55, 38.34
161 00:26:26 up 9:04, 4 users, load average: 124.97, 67.08, 45.42
162 00:27:26 up 9:05, 4 users, load average: 126.32, 76.00, 49.73
163 00:28:26 up 9:06, 4 users, load average: 165.14, 96.48, 58.43
164 00:29:26 up 9:07, 4 users, load average: 132.22, 100.43, 62.23
165 00:30:26 up 9:08, 4 users, load average: 98.44, 96.11, 63.12
166 00:31:26 up 9:09, 4 users, load average: 113.44, 99.98, 66.47
167 00:32:26 up 9:10, 4 users, load average: 91.29, 96.83, 67.53
168 00:33:26 up 9:11, 4 users, load average: 67.54, 88.47, 66.46
169 00:34:26 up 9:12, 4 users, load average: 90.20, 91.37, 68.85
170 00:35:26 up 9:13, 4 users, load average: 66.07, 84.52, 67.92
171 00:36:26 up 9:14, 4 users, load average: 65.21, 81.17, 67.82
172 00:37:27 up 9:15, 4 users, load average: 37.02, 70.30, 64.92
173 00:38:27 up 9:16, 4 users, load average: 23.21, 60.12, 61.75
174 00:39:27 up 9:17, 4 users, load average: 14.69, 50.88, 58.46
175 00:40:27 up 9:18, 4 users, load average: 18.34, 44.83, 55.88
176 00:41:27 up 9:19, 4 users, load average: 24.19, 42.21, 54.31
177 00:42:27 up 9:20, 4 users, load average: 19.45, 37.63, 51.98
178 00:43:27 up 9:21, 4 users, load average: 39.88, 40.26, 51.99
179 00:44:27 up 9:22, 4 users, load average: 35.88, 38.82, 50.74
180 00:45:27 up 9:23, 4 users, load average: 30.76, 36.33, 49.12
181 00:46:27 up 9:24, 4 users, load average: 50.64, 41.56, 50.14
182 00:47:27 up 9:25, 4 users, load average: 53.48, 44.15, 50.50
183 00:48:27 up 9:26, 4 users, load average: 50.56, 44.74, 50.29
184 00:49:27 up 9:27, 4 users, load average: 35.02, 41.29, 48.75
185 00:50:27 up 9:28, 4 users, load average: 38.36, 40.51, 47.98
186 00:51:27 up 9:29, 4 users, load average: 46.91, 42.62, 48.25
187 00:52:27 up 9:30, 4 users, load average: 28.36, 38.41, 46.47
188 00:53:27 up 9:31, 4 users, load average: 20.75, 34.56, 44.65
189 00:54:27 up 9:32, 4 users, load average: 29.30, 34.09, 43.83
190 00:55:27 up 9:33, 4 users, load average: 41.25, 36.48, 44.02
191 00:56:27 up 9:34, 4 users, load average: 36.59, 36.44, 43.56
192 00:57:27 up 9:35, 4 users, load average: 28.67, 34.31, 42.39
193 00:58:27 up 9:36, 4 users, load average: 36.35, 35.43, 42.26
194 00:59:27 up 9:37, 4 users, load average: 39.51, 36.75, 42.30
195 01:00:27 up 9:38, 4 users, load average: 53.77, 39.30, 42.72

SAL entries:

== 2017-06-30 ==
02:29	<chasemp>	labstore1005 start drbd
02:14	<chasemp>	reboot labstore1005 (5m ago)
01:33	<chasemp>	time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
01:29	<andrewbogott>	rebooting tools-cron-01
01:25	<chasemp>	reboot labstoer1005
01:23	<chasemp>	fail nfs from labstore1005 to labstore1004 (I failed to log a previous failover to 1004 and back)

== 2017-06-29 ==
22:16	<chasemp>	set cfq scheduler on labstore1005
21:40	<chasemp>	reboot labstore1004 with grub set to gnulinux-advanced-1773f282-5a1b-441e-865c-8b70a0ebc925>gnulinux-4.4.0-3-amd64-advanced-1773f282-5a1b-441e-865c-8b70a0ebc925
20:33	<andrewbogott>	depooling, rebooting, and repooling every lighttpd node three at a time
18:30	<chasemp>	restart nfs on labstore1004 (primary)
15:38	<chasemp>	restart nfs-exportd on labstore1004
17:22	<bd808>	rebooting tools-static-11
17:20	<andrewbogott>	rebooting tools-static-10
16:27	<chasemp>	restart k8s components on master (madhu)
16:10	<chasemp>	tools-flannel-etcd-01:~$ sudo service etcd restart
15:57	<chasemp>	reboot tools-docker-registery-01 for nfs
15:09	<chasemp>	set downtimes for labstore1004/1005 failover see https://etherpad.wikimedia.org/p/labstore_reboots
bd808 lowered the priority of this task from High to Medium.Nov 21 2018, 11:14 PM

Pretty sure this is not "High" priority 18 months later.

Bstorm subscribed.

Moving this to the graveyard and linking it to a more recent task.

labstore1005 is now on Debian stretch. When we fail over to it next week, it will be interesting to see how load looks.

Bstorm claimed this task.

Both labstores are now stretch. Overall, what was seen on this ticket has not happened. Load is higher in general on the servers, but it has no discernible impact on actual performance. Beyond that the clients are not seeing load spikes.

Since the kernel version is far beyond the version in this ticket, I don't think it is useful to keep this open.