Page MenuHomePhabricator

Evaluate reducing shard counts for smaller wikis
Closed, DeclinedPublic

Description

We think that some of the slowness with the master server in elasticsearch is due to the number of indices and the number of shards in the cluster. We should write up some script to parse through elasticsearch's /_cat/indices API and determine which indexes could have fewer total shards assigned to them. In some of our documentation nik suggested that ~2GB is a good size for shards. We should also re-evaluate this number, we don't know that it's wrong, but we also don't know that it's right. There are tradeoff's in both directions.

This fits in with our Q3 goal of evaluating the current elasticsearch configuration and optimizing it as appropriate.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a project: CirrusSearch.
EBernhardson added subscribers: EBernhardson, dcausse.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

I took a first stab at this with P2509. To run use curl -s elastic1001:9200/_cat/indices?bytes=b | php parse_es_indices.php

Using max shard sizes of 500MB for titlesuggest and 2GB for everything else we find most of the gain would happen in the titlesuggest. We can remove 174 shards, which is only 2% of the total shard count. It's something but not amazing.

It turns out all of the normal indices, except fiwiki_general, are fine. fiwiki_general could be reduced from 10 shards to 3 shards achieving 1.4GB per shard. Numerous titlesuggest indices can still be shrunk.

Change 261287 had a related patch set uploaded (by EBernhardson):
Adjust cirrus titlesuggest index shard counts

https://gerrit.wikimedia.org/r/261287

Something else to consider might be evaluating shared indices for projects of the same language? This wouldn't make sense for everything, large indices like enwiki and dewiki should almost certainly be their own thing. But what about the smaller wiki's? This might add unnecessary complexity that isn't worth the maintenance burden though.

Additionally completion suggester would probably need to stay one index per wiki.

Attempted to estimate combining projects of same language with P2510

Setting a cap on the combined project size to 2GB (1 shard) we could reduce the total shard count (with replicas) across the cluster by 1761 shards (19.4%). This still keeps content and general indices separate.

Deskana moved this task from Inbox to Technical on the CirrusSearch board.
Deskana subscribed.

I pulled this into the sprint and put it in "Needs review", since it has a patch assigned to it. Judging by the title of this task, that may or may not have been the correct action. Feel free to undo that if it's wrong.

Change 261287 merged by jenkins-bot:
Adjust cirrus titlesuggest index shard counts

https://gerrit.wikimedia.org/r/261287

posted to wrong ticket...

Deskana claimed this task.

There were some ideas about improving things here by putting a bunch of different projects into the same shard. e.g. having, say, French Wikibooks, Wiktionary, Wikisource, etc. in the same shard. This would not have any user facing changes at all, but would help with the technical issue. This would probably require a lot of added complexity into CirrusSearch that would mean it probably wouldn't be worth the effort.

Given that, @EBernhardson and I decided to decline this task. We can always reopen in the future if we do decide to work on this.