Page MenuHomePhabricator

Evaluate impact of adding ~2700 new shards to production cluster
Closed, ResolvedPublic

Description

The simplest way to solve our problems with elasticsearch and the single type restriction would be to move namespace to the meta store, and add a new index for archive. Archives are small, at most about 50k documents per wiki, so only needs a single shard. It still would need to exist for 900 wikis with a primary and 2 replicas. If we want to reject this solution because it's not going to work, that should be easy enough to prove.

  • Devise a test that measures the latency of various master operations.
  • Develop test on a single node local elasticsearch cluster
  • If promising, run test on hot-spare production cluster
  • If still promising, run test on live production cluster to see full-load impact

Cluster operations to measure:

  • Index Create/Delete
  • Index settings/mapping update
  • Move shard between nodes (if >1 node)
  • cluster settings update
  • Time to read outputs like cluster state, _cat/shards, _cat/indices
  • probably more

Event Timeline

EBernhardson created this task.

Basic report: http://paws-public.wmflabs.org/paws-public/User:EBernhardson_(WMF)/ElasticsearchMasterLatency/TooManyShards.ipynb

This is currently incomplete, but due to a mixup i will need to wait until monday to run the final data collection in eqiad. Specifically i ran the test today with 1800 extra indices, instead of the 900 i was intending to. Overall the results look ok, although the 1800 extra index test seems to suggest there are certainly limits.

For the eqiad tests the with-archives version was run in the ~1 hour leading up to the busiest time of day. This might explain why the 2x archives test shows the operations slowing down as time goes on. The default version without archive indices was run on the opposite side as load was coming down but does not show the same change. Because of this i don't think it will be as useful to re-run the eqiad test now and instead we should wait until the busiest part of monday to collect measurements.

TODO: Need to figure out how best to move report results from paws to phabricator for longer term reference in the tickets.

Currently re-running the eqiad tests.

The previous eqiad test with 2x archive indices did show a potential problem about 80% through the test something weird happened and create_index spiked with a single create at 600 seconds, and a couple on each side of it at 200+ seconds. This is worrying, and certainly something we hoped to find with these tests (as opposed to in production after rollout). As this test was with 1800 new indices instead of the 900 we expected it is not as worrying, but would still be nice to know what happened.

Correlating some graphs against when i know it should have run and looking through the master node logs have been able to determine the 100 iterations of the test were run between approximately 16:50 and 18:30 UTC on 2018-04-27. The 10 minute lag is clearly reported in logs filtered by index creation. This puts the stall at 17:57:44 to 18:09:36. Took a look at the rest of the logs generated between these two points, but there is nothing too interesting there. All the messages are related to disk thresholds and being unable to assign shards to particular nodes. These aren't great, but we have them all day every day anyways. Looked over our node-comparisson graphs between 1040 (master) and 1044 (another server) and again nothing particular interesting. Both servers continued serving typical queries, nothing looks out of the ordinary.

It is worth noting that the create_index action doesn't only create the index, but also waits for the cluster to return to green after creating the index. Further investigation into appropriate logs, it seems the create_index wasn't even the culprit. Just before that at 17:57:41 we added a replica to kiwiki_archive_NNN. The message bringing the cluster from yellow back to green was that a kiwiki_archive_NNN shard was started. The add_replica mutation does wait for green, but perhaps there was a race condition where the cluster had not yet transitioned from green to yellow after updating auto_expand_replicas. So that doesn't solve our problem, but does say that our measurements are far from perfect.

I don't know why it took 10 minutes to come back to green after adding a replica. The logs don't give any good hints either.

Mentioned in SAL (#wikimedia-operations) [2018-04-30T21:46:05Z] <ebernhardson> T192972 increase eqiad elasticsearch disk watermarks from 75/80 to 85/85

Finished new eqiad test with expected number of archives, notebook linked above has been updated. This shows a similar problem to the 2x-archive test, in that adding new shards to the cluster is typically finished in a reasonable timeframe, but sometimes waiting for the cluster to return to green takes several minutes (up to 5 in this test).

Digging through and looking at things, i think i dismissed the disk watermark log messages too early. We limit where the cluster can place shards, such that an index with 4 copies per shard needs to have 1 copy in each row of the dc. Unfortunately 11 of our nodes (all of them part of elastic1017-1031, which have 500G disks) are not accepting new indices due to hitting the disk low watermark. I'm pretty sure what is happening then is we are waiting around for the cluster to figure out how to shuffle dat a between nodes such that it can create the new shard.

We actually set the watermarks pretty conservatively, with nodes not accepting new shards at 75% disk consumed and pushing shards away from the node at 80%. Most of the old nodes though are sitting between 75% and 80% which means they are happy enough to keep serving traffic, but won't accept new shards which makes the allocator unhappy. My first thought here is to raise both of these values, and set them to the same value. We don't want to be in an extended state where a significant portion of the cluster is unable to accept new shards, but thats exactly what happens when the low and high disk watermarks have different values.

I plan to run a new test pushing both of these values to 85%. At 85% the nodes will start pushing indices away when it gets to 74G free which seems like more than enough space. We should keep in mind though the reason they are lower, at 75%, was to encourage elasticsearch to better spread data across the cluster.

Change 430066 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: raise alerting limit for free disk space

https://gerrit.wikimedia.org/r/430066

Mentioned in SAL (#wikimedia-operations) [2018-05-01T16:15:55Z] <ebernhardson> T192972 change eqiad elasticsearch disk watermarks from 85/85 to 80/80 to match disk space alerts

Nothing to add, really. But I read your last few comments and it all sounds reasonable.

Calling this one done, I've exported the report to pdf so it lasts longer:

The summary seems to be

  • codfw is happy either way
  • eqiad is kinda-sorta happy right now
  • adding 900 indices to eqiad caused rare but significant spikes in time to go back to green.

Based on these results, after some discussion, we've decided T193654 is a good direction forward from here.

SInce it was asked, and for the record, a list of the different indices sizes: P7257

Change 430066 abandoned by Gehel:
elasticsearch: raise alerting limit for free disk space

Reason:
already implemented in another patch

https://gerrit.wikimedia.org/r/430066

Vvjjkkii renamed this task from Evaluate impact of adding ~2700 new shards to production cluster to zaeaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed EBernhardson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Mainframe98 renamed this task from zaeaaaaaaa to Evaluate impact of adding ~2700 new shards to production cluster.Jul 1 2018, 7:42 AM
Mainframe98 closed this task as Resolved.
Mainframe98 assigned this task to EBernhardson.
Mainframe98 lowered the priority of this task from High to Medium.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added subscribers: gerritbot, Aklapper.