Page MenuHomePhabricator

Write errors on elasticsearch eqiad
Closed, ResolvedPublic

Description

October 20th from ~8am UTC to 9am UTC writes on some of the elasticsearch eqiad indices were failing due to quorum not reached (mainly on commonswiki_file). This had no functional impact, the writes are queued and retried. Still, some investigation is required.

What we know so far:

  • master at that time was elastic1036
  • elastic1027 lost connection to master multiple times [08:04:28-08:06:05] [08:09:08-08:09:54 [08:14:38-08:15:28] [08:18:38-08:19:54] (see log below)
  • also at the same time, puppet logs a few errors with getaddrinfo on elastic1027 (see log below)
  • PyBal detected connection errors to elastic1027 at closely the same times (see log below)
  • the above points let us think that elastic1027 lost network connectivity for a ~3 minutes
  • ~30k writes operations failed on commonswiki_file (2 lines per error):
gehel@fluorine:~$ grep -c QUORUM /a/mw-log/archive/CirrusSearch.log-20161020 
66434
  • number of errors on other indices is negligeable:
gehel@fluorine:~$ grep QUORUM /a/mw-log/archive/CirrusSearch.log-20161020 | grep -c -v commonswiki_file
56
  • The actual error is (in fluorine:/a/mw-log/archive/CirrusSearch.log-20161020):
update: /commonswiki_file_1413317847/page/8372075 caused [commonswiki_file_1413317847][2] Not enough active copies to meet write consistency of [QUORUM] (have 1, needed 2). Timeout: [1ms], request: [BulkShardRequest to [commonswiki_file_1413317847] containing [1] requests]
 {"numBulk":1,"indexType":"file","queryType":"send_data_write","took":514,"message":"unknown: Error in one or more bulk request actions:\n\nupdate: /commonswiki_file_1413317847/page/8372075 caused [commonswiki_file_1413317847][2] Not enough active copies to meet write consistency of [QUORUM] (have 1, needed 2). Timeout: [1ms], request: [BulkShardRequest to [commonswiki_file_1413317847] containing [1] requests]\n"}

What we don't understand:

  • loosing 1 node, we loose at most 1 copy of a shard. commonswiki_file is configured with 1 primary and 2 replicas, which means the quorum should still be reached with one lost shard.
  • why was commonswiki_file the only index impacted. hypothesis:
    • very large shards: ~70Gb
    • larger writes (pdfs, images, ...)

What we should do:

  • current discovery settings are the default, which mean that we should detect a fault in ~33 seconds. We might want to increase the number of retries and interval. It is unclear what the downside is. The current values are:
    • ping_interval = 1s
    • ping_timeout = 30s
    • ping_retries = 3
  • decrease the size of commonswiki_shards by increasing the number of shards. We currently have 5 shards (with 2 replicas each), we should probably raise this to 15 shards, which should decrease shard size from ~70Gb to ~20Gb. Note that this implies that we end up with more shards than nodes.

elastic1027 leaving and rejoining cluster

gehel@elastic1027:~$ egrep "(master_left|detected_master)" /var/log/elasticsearch/production-search-eqiad.log
[2016-10-20 08:04:28,593][INFO ][discovery.zen            ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2016-10-20 08:06:05,936][INFO ][cluster.service          ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}])
[2016-10-20 08:09:08,766][INFO ][discovery.zen            ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2016-10-20 08:09:54,533][INFO ][cluster.service          ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}])
[2016-10-20 08:14:38,090][INFO ][discovery.zen            ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2016-10-20 08:15:28,809][INFO ][cluster.service          ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}])
[2016-10-20 08:18:38,938][INFO ][discovery.zen            ] [elastic1027] master_left [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2016-10-20 08:19:54,945][INFO ][cluster.service          ] [elastic1027] detected_master {elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}, added {{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true},}, reason: zen-disco-receive(from master [{elastic1036}{2SkXa07VRC23HfTDTI1ChQ}{10.64.16.45}{10.64.16.45:9300}{rack=B3, row=B, master=true}])

Puppet getaddrinfo errors:

gehel@elastic1027:~$ sudo grep getaddrinfo /var/log/syslog
Oct 20 08:14:59 elastic1027 puppet-agent[30053]: getaddrinfo: Name or service not known
Oct 20 08:15:12 elastic1027 puppet-agent[30053]: (/File[/var/lib/puppet/lib]) Failed to generate additional resources using 'eval_generate': getaddrinfo: Name or service not known
Oct 20 08:15:21 elastic1027 puppet-agent[30053]: (/File[/var/lib/puppet/lib]) Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet://puppet/plugins: getaddrinfo: Name or service not known

PyBal logs related to elasticsearch:

gehel@lvs1003:~$ sudo journalctl -u pybal --since "2016-10-20 07:00" | grep search-https_9243
Oct 20 08:03:29 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 30.001 s
Oct 20 08:03:29 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure.
Oct 20 08:03:51 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance IdleConnection reports server elastic1027.eqiad.wmnet (enabled/down/not pooled) down: User timeout caused connection failure.
Oct 20 08:03:51 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed.
Oct 20 08:03:53 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed.
Oct 20 08:03:54 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 15.037 s
Oct 20 08:04:02 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed.
Oct 20 08:04:04 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s
Oct 20 08:04:14 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s
Oct 20 08:04:21 lvs1003 pybal[19384]: [search-https_9243 IdleConnection] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Connection failed.
Oct 20 08:04:24 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s
Oct 20 08:04:34 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s
Oct 20 08:04:44 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.002 s
Oct 20 08:04:54 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.026 s
Oct 20 08:05:04 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/down/not pooled): Fetch failed, 0.035 s
Oct 20 08:05:14 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.033 s
Oct 20 08:05:24 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.031 s
Oct 20 08:05:34 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.029 s
Oct 20 08:05:44 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.027 s
Oct 20 08:05:54 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.024 s
Oct 20 08:06:04 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.028 s
Oct 20 08:06:14 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up
Oct 20 08:09:25 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 0.028 s
Oct 20 08:09:25 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: 503 Service Unavailable
Oct 20 08:09:35 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.034 s
Oct 20 08:09:45 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.038 s
Oct 20 08:09:55 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up
Oct 20 08:14:56 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 0.032 s
Oct 20 08:14:56 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: 503 Service Unavailable
Oct 20 08:15:06 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.035 s
Oct 20 08:15:16 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.027 s
Oct 20 08:15:26 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.021 s
Oct 20 08:15:36 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up
Oct 20 08:18:56 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/up/pooled): Fetch failed, 0.027 s
Oct 20 08:18:56 lvs1003 pybal[19384]: [search-https_9243] ERROR: Monitoring instance ProxyFetch reports server elastic1027.eqiad.wmnet (enabled/up/pooled) down: 503 Service Unavailable
Oct 20 08:19:06 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.031 s
Oct 20 08:19:16 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.026 s
Oct 20 08:19:26 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.036 s
Oct 20 08:19:36 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.028 s
Oct 20 08:19:46 lvs1003 pybal[19384]: [search-https_9243 ProxyFetch] WARN: elastic1027.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 0.028 s
Oct 20 08:19:57 lvs1003 pybal[19384]: [search-https_9243] INFO: Server elastic1027.eqiad.wmnet (enabled/partially up/not pooled) is up

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 316964 had a related patch set uploaded (by DCausse):
[cirrus] Increase the number of shards to 15 for commonswiki_file

https://gerrit.wikimedia.org/r/316964

Change 316976 had a related patch set uploaded (by Gehel):
elasticsearch: tuning of zen discovery settings

https://gerrit.wikimedia.org/r/316976

Configuration of zen discovery seems more complex than I first thought. It seems that there are different configuration for master election and fault detection. discovery.zen.ping_timeout (which seems to have been renamed to discovery.zen.ping.timeout but docs not updated) vs discovery.zen.fd.ping.timeout. This needs to dig into the elasticsearch code a bit more...

Deskana moved this task from needs triage to Current work on the Discovery-Search board.

The connectivity issues could be related to a typo in a DNS record. This is now corrected.

Note that we still need to find a way to be more resilient to this kind of problem.

This problem raised another issue.
The mw_cirrus_metastore was recreated by error causing all titlesuggest indices to fail to recycle.
It's due to the fact that the push job script recreates this index if it does not exist, unfortunately either elastic or the script is not reliable enough. The index was not really gone (logs extracted from /var/log/mediawiki/cirrus-sanitize/push-jobs.log-20161021)

cuwiki
-----------------------------------------------------------------
cuwiki:  2016-10-20 08:13:24 Detected 7894 ids to check, selecting profile XS
cuwiki:  2016-10-20 08:13:24 0 checker job(s) in the queue.
cuwiki:  mw_cirrus_metastore missing creating.
cuwiki:  Creating metastore index... mw_cirrus_metastore_first ok
cuwiki:         Index is red retrying...
cuwiki:         Green!
cuwiki:  Switching mw_cirrus_metastore alias from mw_cirrus_versions to mw_cirrus_metastore_first.
cuwiki:  Deleting old index mw_cirrus_versions
cuwiki:  2016-10-20 08:16:11 Pushing CheckerJob( 101, 110, 1617,  )
cuwiki:  2016-10-20 08:16:11 Pushing CheckerJob( 111, 120, 2239,  )

We can see that the script believe the index is gone but then try to delete it, the script should be more careful when checking index existence.

Created T148821 to followup on this.

The problem was partially resolved by restoring a dump of this index from https://dumps.wikimedia.org/other/cirrussearch/current/cirrus-metastore-eqiad-20161017.json.gz

Change 316964 merged by jenkins-bot:
[cirrus] Increase the number of shards to 15 for commonswiki_file

https://gerrit.wikimedia.org/r/316964

Mentioned in SAL (#wikimedia-operations) [2016-11-09T14:28:19Z] <zfilipin@tin> Synchronized tests/cirrusTest.php: SWAT: [[gerrit:316964|[cirrus] Increase the number of shards to 15 for commonswiki_file (T148736)]] (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2016-11-09T14:29:33Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:316964|[cirrus] Increase the number of shards to 15 for commonswiki_file (T148736)]] (duration: 00m 49s)

@dcausse should we get this merged before your reindex gets to commonswiki?

I think everything concerning this issue was merged.
It's unclear to me if we need to do more but I don't know what...
@Gehel should we keep this task open?
What we've done so far is :

  • decrease shard size for commons
  • don't autocreate the metastore

for some reason i thought that hadn't been merged yet, even thought quite obviously the previous 3 lines in the ticket are about merging and deploying it. This seems done to me.