In response to T192972 which measured, by proxy, the health of the elasticsearch cluster master we split the production deployment into multiple clusters with separate masters. Re-run the analysis now that the planned fix has been deployed.
|Resolved||EBernhardson||T183281 [epic] ELK upgrade to 6.x (elasticsearch, kibana, logstash)|
|Resolved||None||T183282 [epic] Search cluster upgrade to 6.x|
|Resolved||EBernhardson||T192972 Evaluate impact of adding ~2700 new shards to production cluster|
|Resolved||EBernhardson||T215969 Measure mutation latency across the newly split elasticsearch clusters|
Re-ran data collection and the report. Of particular interest here is going to be chi-eqiad which is serving the majority of traffic. The over-time graphs for chi-eqiad aren't great, but they are better than before. Additionally the largest spikes are directly attributable to disk space issues we are currently experiencing in eqiad. Looking at the allocation explain while running the test shows that sometimes the master decides all nodes are above the disk threshold. I ended up needing to increase the watermark from 75% to 79% for the test to even run.
The numbers that seem most important, only chi-eqiad (primary load-bearing cluster)
Absolute numbers seem to have generally improved across the board. Index creation at 4s is better, but not amazing. For reference the default master_timeout is 30s, which we had to bump to 2m in the cirrus configuration for index creation requests (and a few others). Based on these metrics I think it's safe to say the master node + cluster state sync process is healthier and is having an easier time mutating cluster state than before, but with the allocation problems related to disk space it's hard to say with confidence we don't still have problems with mutation latency spikes.
One possibly way to test would be to drop our 2 minute master timeout back to 30s and see how daily completion suggester builds and whatnot work. I would love to rip out all the related master timeout code in cirrus.
Unfortunately completion suggester build logs don't seem to make it into logstash, but the local copies in mwmaint1002.eqiad.wmnet:/var/log/mediawiki/cirrus-suggest seem to suggest things went ok.
I used grep -L "Recycling index" *.eqiad.log *.eqiad.log-20190220 to get a list of logs that created new indexes. All 45 matching wikis had the cluster 'go green' in a single loop of the check. A look over the last few days of enwiki build logs seems to suggest this was already normal. Basically nothing looks broken, so lets keep with the 30s for now.
The spikes on create_index are pretty extreme, with 194s for chi-eqiad-with-archive and 291s for omega-eqiad-with-archive. Is that just bad luck, or is something going on with the archives that makes this sometimes take much longer?
As far as i could tell while the test was running the big spikes were caused by the cluster entering a yellow state and staying there for an extended period. The elastic api's were reporting no nodes available to assign shards to. This was happening because about half our cluster is unable to accept new shards due to disk space issues, and we demand elastic to put the 3 copies of a shard into 3 different dc rows (of which there are only 4 to choose from). The end result is that every once in awhile elastic decides it doesn't have anywhere it can assign a shard until it moves some things around.
The spikes do make me less confident in the final results, but I don't have any good way to resolve that problem long enough to test. We could to something extreme like change the disk watermarks to 90% while running the test, but experience tells us that simply leads to elastic moving a bunch of shards onto the half of the cluster with more disk space. Additionally after changing the watermarks we would have to wait around until elastic has rebalanced the cluster into a steady state.