Measure mutation latency across the newly split elasticsearch clusters
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Feb 12 2019, 9:59 PM

Description

In response to T192972 which measured, by proxy, the health of the elasticsearch cluster master we split the production deployment into multiple clusters with separate masters. Re-run the analysis now that the planned fix has been deployed.

Details

	Subject	Repo	Branch	Lines +/-
	[cirrus] reduce master timeout to 30s	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	EBernhardson	T183281 [epic] ELK upgrade to 6.x (elasticsearch, kibana, logstash)
Resolved	None	T183282 [epic] Search cluster upgrade to 6.x
Resolved	EBernhardson	T192972 Evaluate impact of adding ~2700 new shards to production cluster
Resolved	EBernhardson	T215969 Measure mutation latency across the newly split elasticsearch clusters

Event Timeline

EBernhardson triaged this task as Medium priority.Feb 12 2019, 9:59 PM

EBernhardson created this task.

Re-ran data collection and the report. Of particular interest here is going to be chi-eqiad which is serving the majority of traffic. The over-time graphs for chi-eqiad aren't great, but they are better than before. Additionally the largest spikes are directly attributable to disk space issues we are currently experiencing in eqiad. Looking at the allocation explain while running the test shows that sometimes the master decides all nodes are above the disk threshold. I ended up needing to increase the watermark from 75% to 79% for the test to even run.

TooManyShards-after-cluster-split.pdf1 MBDownload

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Feb 13 2019, 1:00 AM

The numbers that seem most important, only chi-eqiad (primary load-bearing cluster)

mutation	metric	old	old-w-replica	new	new-w-replica
add_replica	mean	1.68	4.23	1.00	1.28
create_index	mean	3.40	6.7	2.71	3.91
fetch_cluster_state	mean	7.0	8.98	2.13	2.70
move_shard	mean	2.64	4.43	1.35	1.28

Absolute numbers seem to have generally improved across the board. Index creation at 4s is better, but not amazing. For reference the default master_timeout is 30s, which we had to bump to 2m in the cirrus configuration for index creation requests (and a few others). Based on these metrics I think it's safe to say the master node + cluster state sync process is healthier and is having an easier time mutating cluster state than before, but with the allocation problems related to disk space it's hard to say with confidence we don't still have problems with mutation latency spikes.

One possibly way to test would be to drop our 2 minute master timeout back to 30s and see how daily completion suggester builds and whatnot work. I would love to rip out all the related master timeout code in cirrus.

Change 491231 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [cirrus] reduce master timeout to 30s

https://gerrit.wikimedia.org/r/491231

gerritbot added a project: Patch-For-Review.Feb 18 2019, 10:24 AM

Sounds good to me and easy enough to test

EBernhardson moved this task from Needs review to Waiting on the Discovery-Search (Current work) board.Feb 19 2019, 6:30 PM

Change 491231 merged by jenkins-bot:
[operations/mediawiki-config@master] [cirrus] reduce master timeout to 30s

https://gerrit.wikimedia.org/r/491231

Mentioned in SAL (#wikimedia-operations) [2019-02-20T00:05:32Z] <ebernhardson@deploy1001> Synchronized wmf-config/CirrusSearch-production.php: SWAT T215969 Return cirrussearch master timeout back to the default value (duration: 00m 57s)

Unfortunately completion suggester build logs don't seem to make it into logstash, but the local copies in mwmaint1002.eqiad.wmnet:/var/log/mediawiki/cirrus-suggest seem to suggest things went ok.

I used grep -L "Recycling index" *.eqiad.log *.eqiad.log-20190220 to get a list of logs that created new indexes. All 45 matching wikis had the cluster 'go green' in a single loop of the check. A look over the last few days of enwiki build logs seems to suggest this was already normal. Basically nothing looks broken, so lets keep with the 30s for now.

EBernhardson moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Feb 20 2019, 7:26 PM

The spikes on create_index are pretty extreme, with 194s for chi-eqiad-with-archive and 291s for omega-eqiad-with-archive. Is that just bad luck, or is something going on with the archives that makes this sometimes take much longer?

In T215969#4969983, @TJones wrote:

The spikes on create_index are pretty extreme, with 194s for chi-eqiad-with-archive and 291s for omega-eqiad-with-archive. Is that just bad luck, or is something going on with the archives that makes this sometimes take much longer?

As far as i could tell while the test was running the big spikes were caused by the cluster entering a yellow state and staying there for an extended period. The elastic api's were reporting no nodes available to assign shards to. This was happening because about half our cluster is unable to accept new shards due to disk space issues, and we demand elastic to put the 3 copies of a shard into 3 different dc rows (of which there are only 4 to choose from). The end result is that every once in awhile elastic decides it doesn't have anywhere it can assign a shard until it moves some things around.

The spikes do make me less confident in the final results, but I don't have any good way to resolve that problem long enough to test. We could to something extreme like change the disk watermarks to 90% while running the test, but experience tells us that simply leads to elastic moving a bunch of shards onto the half of the cluster with more disk space. Additionally after changing the watermarks we would have to wait around until elastic has rebalanced the cluster into a steady state.

@EBernhardson, thanks for the explanation!

debt closed this task as Resolved.Feb 22 2019, 8:38 PM

	F28205539: TooManyShards-after-cluster-split.pdf
	Feb 13 2019, 1:00 AM

Measure mutation latency across the newly split elasticsearch clustersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Measure mutation latency across the newly split elasticsearch clusters
Closed, ResolvedPublic
Actions

Related Objects
Search...