etcd cluster in codfw has raft consensus issues
Open, LowPublic

Description

For the second time in 1 month and a half, replication from the eqiad cluster to the codfw cluster for our main etcd failed due to RAFT consensus issues on the codfw cluster.

The start of the consensus loss appears to be:

Apr  2 01:01:16 conf2002 etcd[64619]: 42dba8818c7e95a2 [term: 3784] received a MsgVote message with higher term from 7c750d35ce52dd67 [term: 3785]
Apr  2 01:01:16 conf2002 etcd[64619]: 42dba8818c7e95a2 became follower at term 3785
Apr  2 01:01:16 conf2002 etcd[64619]: 42dba8818c7e95a2 [logterm: 3784, index: 11308810, vote: 0] voted for 7c750d35ce52dd67 [logterm: 3784, index: 11308810] at term 3785
Apr  2 01:01:16 conf2002 etcd[64619]: raft.node: 42dba8818c7e95a2 lost leader 42dba8818c7e95a2 at term 3785
Apr  2 01:01:17 conf2002 etcd[64619]: 42dba8818c7e95a2 [term: 3785] ignored a MsgAppResp message with lower term from 8e8915c7c2d67b6 [term: 3784]
Apr  2 01:01:17 conf2002 etcd[64619]: 42dba8818c7e95a2 [term: 3785] ignored a MsgAppResp message with lower term from 8e8915c7c2d67b6 [term: 3784]
Apr  2 01:01:17 conf2002 etcd[64619]: 42dba8818c7e95a2 [term: 3785] ignored a MsgAppResp message with lower term from 8e8915c7c2d67b6 [term: 3784]
Apr  2 01:01:17 conf2002 etcd[64619]: 42dba8818c7e95a2 [term: 3785] ignored a MsgHeartbeatResp message with lower term from 8e8915c7c2d67b6 [term: 3784]

and also on another member

Apr  2 01:01:16 conf2001 etcd[64305]: 8e8915c7c2d67b6 [term: 3784] received a MsgVote message with higher term from 7c750d35ce52dd67 [term: 3785]
Apr  2 01:01:16 conf2001 etcd[64305]: 8e8915c7c2d67b6 became follower at term 3785
Apr  2 01:01:16 conf2001 etcd[64305]: 8e8915c7c2d67b6 [logterm: 3784, index: 11308810, vote: 0] voted for 7c750d35ce52dd67 [logterm: 3784, index: 11308810] at term 3785
Apr  2 01:01:16 conf2001 etcd[64305]: raft.node: 8e8915c7c2d67b6 lost leader 42dba8818c7e95a2 at term 3785
Apr  2 01:01:17 conf2001 etcd[64305]: 8e8915c7c2d67b6 [term: 3785] ignored a MsgApp message with lower term from 42dba8818c7e95a2 [term: 3784]

and on the third one

Apr  2 01:01:16 conf2003 etcd[63864]: 7c750d35ce52dd67 is starting a new election at term 3784
Apr  2 01:01:16 conf2003 etcd[63864]: 7c750d35ce52dd67 became candidate at term 3785
Apr  2 01:01:16 conf2003 etcd[63864]: 7c750d35ce52dd67 received vote from 7c750d35ce52dd67 at term 3785
Apr  2 01:01:16 conf2003 etcd[63864]: 7c750d35ce52dd67 [logterm: 3784, index: 11308810] sent vote request to 42dba8818c7e95a2 at term 3785
Apr  2 01:01:16 conf2003 etcd[63864]: 7c750d35ce52dd67 [logterm: 3784, index: 11308810] sent vote request to 8e8915c7c2d67b6 at term 3785
Apr  2 01:01:16 conf2003 etcd[63864]: raft.node: 7c750d35ce52dd67 lost leader 42dba8818c7e95a2 at term 3785
Apr  2 01:01:17 conf2003 etcd[63864]: 7c750d35ce52dd67 [term: 3785] ignored a MsgApp message with lower term from 42dba8818c7e95a2 [term: 3784]
Apr  2 01:01:17 conf2003 etcd[63864]: 7c750d35ce52dd67 received vote from 42dba8818c7e95a2 at term 3785
Apr  2 01:01:17 conf2003 etcd[63864]: 7c750d35ce52dd67 [q:2] has received 2 votes and 0 vote rejections
Apr  2 01:01:17 conf2003 etcd[63864]: 7c750d35ce52dd67 became leader at term 3785
Apr  2 01:01:17 conf2003 etcd[63864]: raft.node: 7c750d35ce52dd67 elected leader 7c750d35ce52dd67 at term 3785
Apr  2 01:01:20 conf2003 etcd[63864]: 7c750d35ce52dd67 [term: 3785] received a MsgVote message with higher term from 42dba8818c7e95a2 [term: 3786]
Apr  2 01:01:20 conf2003 etcd[63864]: 7c750d35ce52dd67 became follower at term 3786
Apr  2 01:01:20 conf2003 etcd[63864]: 7c750d35ce52dd67 [logterm: 3785, index: 11308815, vote: 0] voted for 42dba8818c7e95a2 [logterm: 3785, index: 11308815] at term 3786
Apr  2 01:01:20 conf2003 etcd[63864]: raft.node: 7c750d35ce52dd67 lost leader 7c750d35ce52dd67 at term 3786
Apr  2 01:01:21 conf2003 etcd[63864]: raft.node: 7c750d35ce52dd67 elected leader 42dba8818c7e95a2 at term 3786

So it seems like a classical consensus loss and convergence problem.

The positives:

  • etcd kept being available read-only for the full length of the issue
  • Alarms work correctly and advised us of the issue

The negatives:

  • the consensus loss was pretty long, around 1 hour or so.
  • the replication system is a bit oversensitive to this kind of failures, but I don't see a lot of ways to make it easier to recover from

Things we might want to do:

  • Experiment thoroughly with etcd3 and its v2 store, and upgrade the codfw cluster to v3
  • Tweak the way replica works not to die when it's just unable to write for a raft error
Joe created this task.Apr 3 2017, 6:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 3 2017, 6:12 AM
Joe triaged this task as Low priority.Apr 3 2017, 6:12 AM
Joe added a project: User-Joe.
Volans added a subscriber: Volans.Sun, Sep 3, 9:58 AM

I might have a good candidate for what is causing it: mdadm checkarray

The mdadm crontab is:

57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi

As you can see, although the crontab is weekly, it really runs only in the first week of the month, and for the last 3 months the alarm triggered exactly around the same time on the first Sunday of the month.

Here the timeline of the last event from dmesg on the 3 hosts:

  • conf2001
[Sep 3 00:59] md: data-check of RAID array md0
[  +0.000002] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  +0.000001] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  +0.000003] md: using 128k window, over a total of 48794624k.
[  +0.018149] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
[Sep 3 01:07] md: md0: data-check done.
[  +0.017437] md: data-check of RAID array md2
[  +0.000003] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  +0.000001] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  +0.000004] md: using 128k window, over a total of 438449152k.
[Sep 3 02:34] md: md2: data-check done.
  • conf2002:
[Sep 3 01:02] md: data-check of RAID array md0
[  +0.000003] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  +0.000001] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  +0.000003] md: using 128k window, over a total of 48794624k.
[  +0.018483] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
[Sep 3 01:10] md: md0: data-check done.
[  +0.018387] md: data-check of RAID array md2
[  +0.000003] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  +0.000001] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  +0.000004] md: using 128k window, over a total of 438449152k.
[Sep 3 02:39] md: md2: data-check done.
  • conf2003:
[Sep 3 00:55] md: data-check of RAID array md0
[  +0.000003] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  +0.000002] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  +0.000004] md: using 128k window, over a total of 48794624k.
[  +0.017168] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
[Sep 3 01:03] md: md0: data-check done.
[  +0.014045] md: data-check of RAID array md2
[  +0.000003] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  +0.000001] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  +0.000003] md: using 128k window, over a total of 438449152k.
[Sep 3 02:33] md: md2: data-check done.

And the timeline from the etcd and etcd-mirror logs:

Sep  3 01:04:05 conf2002 etcdmirror-conftool-eqiad-wmnet[17378]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/eqiad/api_appserver/apache2/mw1208.eqiad.wmnet at index 110038
[... SNIP ...]
Sep  3 01:04:12 conf2002 etcd[552]: got unexpected response error (etcdserver: request timed out)
Sep  3 01:04:12 conf2002 etcdmirror-conftool-eqiad-wmnet[17378]: [etcd-mirror] ERROR: Action compareAndSwap failed on /conftool/v1/pools/eqiad/api_appserver/apache2/mw1208.eqiad.wmnet: Raft Internal Error : etcdserver: request timed out
Joe moved this task from Backlog to Doing on the User-Joe board.Wed, Sep 6, 3:44 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-07T14:44:59Z] <_joe_> manually running /usr/share/mdadm/checkarray --cron --all --idle --quiet on conf2001, trying to reproduce consensus issues in T162013

Joe added a comment.Fri, Sep 8, 9:46 AM

Running the mdadm command on one host caused a re-election to happen. It seems likely we found the culprit, so now I'm going to run the command at the same time on first two hosts, then on all three, to verify we found the origin of the issues.

Mentioned in SAL (#wikimedia-operations) [2017-09-08T09:48:00Z] <_joe_> running /usr/share/mdadm/checkarray --cron --all --idle --quiet on conf2001 and conf2003 trying to reproduce consensus issues in T162013

Mentioned in SAL (#wikimedia-operations) [2017-09-08T10:09:53Z] <_joe_> etcd cluster lost consensus T162013

Joe added a comment.Fri, Sep 8, 10:48 AM

Result of the latest experiment:

So we confirmed the problem is tied to I/O latencies. Things we might do to mitigate the problem:

  • Reduce the raid resync speed via sysctl
  • Stagger the raid resyncs across different weeks
  • Raise the RAFT timeouts to account for the higher latency

I will start with reducing the RAID sync speed manually and restarting the resync afterwards, in order to find a good balance between resync speed and I/O latency.

Mentioned in SAL (#wikimedia-operations) [2017-09-08T12:36:52Z] <_joe_> restarting a check of the md2 devices on conf200* T162013 after reducing sync speed

Joe added a comment.Fri, Sep 8, 12:52 PM

Reducing the sync speed manually did the job, so we can just puppetize this.

Change 376712 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] etcd: limit RAID resync speed if on linux software raid

https://gerrit.wikimedia.org/r/376712

Change 376712 merged by Giuseppe Lavagetto:
[operations/puppet@production] etcd: limit RAID resync speed if on linux software raid

https://gerrit.wikimedia.org/r/376712

Joe closed this task as Resolved.Fri, Sep 8, 1:40 PM
Joe claimed this task.
Joe added a comment.Fri, Sep 8, 3:21 PM

I was too optimistic, it appears, in declaring victory. The new resync at reduced speeds still triggered consensus issues. It seems the version of etcd we're using is particularly sensitive to i/o latency spikes. So while I'm inclined to either disable altoghether the raid resyncs or to stagger them between the servers of each cluster, I will consider upgrading etcd to a newer version (still in the 2.x series) as an option.

Joe reopened this task as Open.Fri, Sep 8, 3:21 PM
Joe moved this task from Doing to Backlog on the User-Joe board.Thu, Sep 14, 9:43 AM