October 20th from ~8am UTC to 9am UTC writes on some of the elasticsearch eqiad indices were failing due to quorum not reached (mainly on commonswiki_file). This had no functional impact, the writes are queued and retried. Still, some investigation is required.
What we know so far:
- master at that time was elastic1036
- elastic1027 lost connection to master multiple times [08:04:28-08:06:05] [08:09:08-08:09:54 [08:14:38-08:15:28] [08:18:38-08:19:54] (see log below)
- also at the same time, puppet logs a few errors with getaddrinfo on elastic1027 (see log below)
- PyBal detected connection errors to elastic1027 at closely the same times (see log below)
- the above points let us think that elastic1027 lost network connectivity for a ~3 minutes
- ~30k writes operations failed on commonswiki_file (2 lines per error):
gehel@fluorine:~$ grep -c QUORUM /a/mw-log/archive/CirrusSearch.log-20161020 66434
- number of errors on other indices is negligeable:
gehel@fluorine:~$ grep QUORUM /a/mw-log/archive/CirrusSearch.log-20161020 | grep -c -v commonswiki_file 56
- The actual error is (in fluorine:/a/mw-log/archive/CirrusSearch.log-20161020):
update: /commonswiki_file_1413317847/page/8372075 caused [commonswiki_file_1413317847][2] Not enough active copies to meet write consistency of [QUORUM] (have 1, needed 2). Timeout: [1ms], request: [BulkShardRequest to [commonswiki_file_1413317847] containing [1] requests] {"numBulk":1,"indexType":"file","queryType":"send_data_write","took":514,"message":"unknown: Error in one or more bulk request actions:\n\nupdate: /commonswiki_file_1413317847/page/8372075 caused [commonswiki_file_1413317847][2] Not enough active copies to meet write consistency of [QUORUM] (have 1, needed 2). Timeout: [1ms], request: [BulkShardRequest to [commonswiki_file_1413317847] containing [1] requests]\n"}
What we don't understand:
- loosing 1 node, we loose at most 1 copy of a shard. commonswiki_file is configured with 1 primary and 2 replicas, which means the quorum should still be reached with one lost shard.
- why was commonswiki_file the only index impacted. hypothesis:
- very large shards: ~70Gb
- larger writes (pdfs, images, ...)
What we should do:
- current discovery settings are the default, which mean that we should detect a fault in ~33 seconds. We might want to increase the number of retries and interval. It is unclear what the downside is. The current values are:
- ping_interval = 1s
- ping_timeout = 30s
- ping_retries = 3
- decrease the size of commonswiki_shards by increasing the number of shards. We currently have 5 shards (with 2 replicas each), we should probably raise this to 15 shards, which should decrease shard size from ~70Gb to ~20Gb. Note that this implies that we end up with more shards than nodes.
elastic1027 leaving and rejoining cluster
gehel@elastic1027:~$ egrep "(master_left|detected_master)" /var/log/elasticsearch/production-search-eqiad.log [2016-10-20 08:04:28,593][INFO ][discovery.zen ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout] [2016-10-20 08:06:05,936][INFO ][cluster.service ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}]) [2016-10-20 08:09:08,766][INFO ][discovery.zen ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout] [2016-10-20 08:09:54,533][INFO ][cluster.service ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}]) [2016-10-20 08:14:38,090][INFO ][discovery.zen ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout] [2016-10-20 08:15:28,809][INFO ][cluster.service ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}]) [2016-10-20 08:18:38,938][INFO ][discovery.zen ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout] [2016-10-20 08:19:54,945][INFO ][cluster.service ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}])
Puppet getaddrinfo errors:
gehel@elastic1027:~$ sudo grep getaddrinfo /var/log/syslog Oct 20 08:14:59 elastic1027 puppet-agent[30053]: getaddrinfo: Name or service not known Oct 20 08:15:12 elastic1027 puppet-agent[30053]: (/File[/var/lib/puppet/lib]) Failed to generate additional resources using 'eval_generate': getaddrinfo: Name or service not known Oct 20 08:15:21 elastic1027 puppet-agent[30053]: (/File[/var/lib/puppet/lib]) Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet://puppet/plugins: getaddrinfo: Name or service not known
PyBal logs related to elasticsearch:
gehel@lvs1003:~$ sudo journalctl -u pybal --since "2016-10-20 07:00" | grep search-https_9243 Oct 20 08:03:29 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 30.001 s Oct 20 08:03:29 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure. Oct 20 08:03:51 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance IdleConnection reports server elastic1027.eqiad.wmnet (enabled/down/not pooled) down: User timeout caused connection failure. Oct 20 08:03:51 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed. Oct 20 08:03:53 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed. Oct 20 08:03:54 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 15.037 s Oct 20 08:04:02 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed. Oct 20 08:04:04 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s Oct 20 08:04:14 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s Oct 20 08:04:21 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed. Oct 20 08:04:24 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s Oct 20 08:04:34 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s Oct 20 08:04:44 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s Oct 20 08:04:54 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.026 s Oct 20 08:05:04 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.035 s Oct 20 08:05:14 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.033 s Oct 20 08:05:24 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.031 s Oct 20 08:05:34 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.029 s Oct 20 08:05:44 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.027 s Oct 20 08:05:54 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.024 s Oct 20 08:06:04 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.028 s Oct 20 08:06:14 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up Oct 20 08:09:25 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 0.028 s Oct 20 08:09:25 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: 503 Service Unavailable Oct 20 08:09:35 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.034 s Oct 20 08:09:45 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.038 s Oct 20 08:09:55 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up Oct 20 08:14:56 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 0.032 s Oct 20 08:14:56 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: 503 Service Unavailable Oct 20 08:15:06 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.035 s Oct 20 08:15:16 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.027 s Oct 20 08:15:26 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.021 s Oct 20 08:15:36 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up Oct 20 08:18:56 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 0.027 s Oct 20 08:18:56 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: 503 Service Unavailable Oct 20 08:19:06 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.031 s Oct 20 08:19:16 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.026 s Oct 20 08:19:26 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.036 s Oct 20 08:19:36 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.028 s Oct 20 08:19:46 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.028 s Oct 20 08:19:57 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up