Page MenuHomePhabricator

Consider sharding big logging indices
Open, Stalled, Needs TriagePublic

Description

While thinking about the parent task I realized that nowadays we have quite a few big indices that could use sharding. Both to improve write throughput and to have smaller chunks for the cluster to move around and allocate. From the tables below I'd say logstash-k8s- (daily) and ecs-default (weekly) will benefit from three shards each.

by primary store size

logging-sd1003:~$ curl -s 'localhost:9200/_cat/indices?v&s=pri.store.size:desc&h=index,docs.count,store.size,pri.store.size' | head -40
index                                     docs.count store.size pri.store.size
logstash-k8s-1-7.0.0-1-2025.03.20          353068203    756.5gb        378.2gb
logstash-k8s-1-7.0.0-1-2025.03.22          332984033    719.2gb        359.6gb
logstash-k8s-1-7.0.0-1-2025.03.23          322154768    710.9gb        355.2gb
logstash-k8s-1-7.0.0-1-2025.03.26          316145796    704.2gb        350.1gb
logstash-k8s-1-7.0.0-1-2025.03.21          317901461      689gb        346.9gb
logstash-k8s-1-7.0.0-1-2025.03.24          308202593    685.8gb        342.9gb
logstash-k8s-1-7.0.0-1-2025.03.25          306283460    674.3gb        336.9gb
logstash-k8s-1-7.0.0-1-2025.03.27          295521776    671.8gb        334.7gb
logstash-k8s-1-7.0.0-1-2025.03.19          300233118    662.4gb        331.9gb
ecs-default-1-1.11.0-7-2025.07             461966797    619.7gb        309.8gb
ecs-default-1-1.11.0-7-2025.13             514661974    914.2gb        304.8gb
ecs-default-1-1.11.0-7-2025.14             464281740    886.5gb        295.4gb
logstash-k8s-1-7.0.0-1-2025.03.18          249784956    583.5gb        291.7gb
ecs-default-1-1.11.0-7-2025.03             450099232    580.2gb        290.1gb
ecs-default-1-1.11.0-7-2025.12             431112217    869.8gb        290.1gb
ecs-default-1-1.11.0-7-2025.04             440656566      580gb          290gb
ecs-default-1-1.11.0-7-2025.11             422901061    572.8gb        286.4gb
logstash-k8s-1-7.0.0-1-2025.03.17          248211902    567.2gb        283.6gb
logstash-k8s-1-7.0.0-1-2025.03.14          266114632    565.7gb        282.8gb
logstash-k8s-1-7.0.0-1-2025.03.13          257479957    559.6gb          281gb
logstash-k8s-1-7.0.0-1-2025.03.16          247241754    556.6gb        278.3gb
ecs-default-1-1.11.0-7-2025.08             434139750    553.4gb        276.8gb
ecs-default-1-1.11.0-7-2025.10             411028271    553.1gb        276.5gb
ecs-default-1-1.11.0-7-2025.09             417823916      548gb        273.9gb
ecs-default-1-1.11.0-7-2025.05             415520704    543.3gb        271.6gb
ecs-default-1-1.11.0-7-2025.06             415670882    543.2gb        271.6gb
logstash-k8s-1-7.0.0-1-2025.03.15          237617113    540.5gb        270.5gb
logstash-k8s-1-7.0.0-1-2025.03.12          226569749      516gb          258gb
logstash-k8s-1-7.0.0-1-2025.03.11          216317415    495.9gb        247.9gb
logstash-k8s-1-7.0.0-1-2025.03.28          188622366    492.5gb        246.6gb
logstash-k8s-1-7.0.0-1-2025.03.10          214195345    492.6gb        246.3gb
logstash-k8s-1-7.0.0-1-2025.02.27          221387590    487.8gb        243.9gb
logstash-k8s-1-7.0.0-1-2025.03.29          180586474    474.7gb        238.5gb
logstash-k8s-1-7.0.0-1-2025.03.31          194919777    470.4gb          236gb
logstash-k8s-1-7.0.0-1-2025.03.04          212468794      467gb        233.5gb
logstash-k8s-1-7.0.0-1-2025.03.01          199220492    463.6gb        231.8gb
logstash-mediawiki-1-7.0.0-1-2025.03.12    350244105    683.1gb        227.7gb
logstash-k8s-1-7.0.0-1-2025.04.02          169500007    673.5gb        225.1gb
logstash-k8s-1-7.0.0-1-2025.03.06          190826564    445.2gb        222.6gb

by docs count

logging-sd1003:~$ curl -s 'localhost:9200/_cat/indices?v&s=docs.count:desc&h=index,docs.count,store.size,pri.store.size' | head -40
index                                     docs.count store.size pri.store.size
ecs-default-1-1.11.0-7-2025.13             514661974    914.2gb        304.8gb
ecs-default-1-1.11.0-7-2025.14             464281740    886.5gb        295.4gb
ecs-default-1-1.11.0-7-2025.07             461966797    619.7gb        309.8gb
ecs-default-1-1.11.0-7-2025.03             450099232    580.2gb        290.1gb
ecs-default-1-1.11.0-7-2025.04             440656566      580gb          290gb
ecs-default-1-1.11.0-7-2025.08             434139750    553.4gb        276.8gb
ecs-default-1-1.11.0-7-2025.12             431112217    869.8gb        290.1gb
ecs-default-1-1.11.0-7-2025.11             422901061    572.8gb        286.4gb
ecs-default-1-1.11.0-7-2025.09             417823916      548gb        273.9gb
ecs-default-1-1.11.0-7-2025.06             415670882    543.2gb        271.6gb
ecs-default-1-1.11.0-7-2025.05             415520704    543.3gb        271.6gb
ecs-default-1-1.11.0-7-2025.10             411028271    553.1gb        276.5gb
logstash-k8s-1-7.0.0-1-2025.03.20          353068203    756.5gb        378.2gb
logstash-mediawiki-1-7.0.0-1-2025.03.12    350244105    683.1gb        227.7gb
logstash-k8s-1-7.0.0-1-2025.03.22          332984033    719.2gb        359.6gb
logstash-k8s-1-7.0.0-1-2025.03.23          322154768    710.9gb        355.2gb
logstash-k8s-1-7.0.0-1-2025.03.21          317901461      689gb        346.9gb
logstash-k8s-1-7.0.0-1-2025.03.26          316145796    704.2gb        350.1gb
logstash-k8s-1-7.0.0-1-2025.03.24          308202593    685.8gb        342.9gb
logstash-k8s-1-7.0.0-1-2025.03.25          306283460    674.3gb        336.9gb
logstash-k8s-1-7.0.0-1-2025.03.19          300233118    662.4gb        331.9gb
logstash-k8s-1-7.0.0-1-2025.03.27          295521776    671.8gb        334.7gb
ecs-default-1-1.11.0-7-2025.15             271865234    548.7gb        185.7gb
logstash-k8s-1-7.0.0-1-2025.03.14          266114632    565.7gb        282.8gb
logstash-k8s-1-7.0.0-1-2025.03.13          257479957    559.6gb          281gb
logstash-k8s-1-7.0.0-1-2025.03.18          249784956    583.5gb        291.7gb
logstash-k8s-1-7.0.0-1-2025.03.17          248211902    567.2gb        283.6gb
logstash-k8s-1-7.0.0-1-2025.03.16          247241754    556.6gb        278.3gb
logstash-k8s-1-7.0.0-1-2025.03.15          237617113    540.5gb        270.5gb
ecs-alerts-2-1.7.0-5-2022                  231750002     93.6gb         46.8gb
ecs-k8s-1-1.11.0-7-2025.03                 228632290     74.7gb         37.3gb
logstash-k8s-1-7.0.0-1-2025.03.12          226569749      516gb          258gb
logstash-mediawiki-1-7.0.0-1-2025.03.13    222618910    496.1gb        165.7gb
logstash-k8s-1-7.0.0-1-2025.02.27          221387590    487.8gb        243.9gb
logstash-k8s-1-7.0.0-1-2025.03.11          216317415    495.9gb        247.9gb
logstash-k8s-1-7.0.0-1-2025.03.10          214195345    492.6gb        246.3gb
logstash-k8s-1-7.0.0-1-2025.03.04          212468794      467gb        233.5gb
logstash-syslog-1-7.0.0-1-2025.02.28       210954109      160gb           80gb
logstash-syslog-1-7.0.0-1-2025.04.01       209262871    164.8gb         55.1gb

Details

Event Timeline

Adding shards is worth a try IMO. This got me thinking, what metrics should we monitor to know how much of an improvement sharding makes?

I took a first stab at baseline metrics with a new performance row in https://grafana-rw.wikimedia.org/d/oXH_v3rWk/logstash-cluster-health that graphs indexing latency, segment counts, merge pressure, thread pool/queue, etc. What other performance metrics should we include and measure against as we tune?

Change #1138754 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logstash: bump shards for logstash-k8s

https://gerrit.wikimedia.org/r/1138754

Another option is partitioning the data into more indexes to reduce index size.

At least 10 years ago, the optimal index speed strategy was single shard (archive link).

If we choose to begin sharding, let's put together some benchmarks. The cluster isn't broken right now, but reducing overall index size makes the user experience better and faster.

Another option is partitioning the data into more indexes to reduce index size.

linking to related partitioning idea in T392230#10769053

Change #1138754 abandoned by Filippo Giunchedi:

[operations/puppet@production] logstash: bump shards for logstash-k8s

Reason:

Per comment

https://gerrit.wikimedia.org/r/1138754

fgiunchedi changed the task status from Open to Stalled.May 6 2025, 12:42 PM

Thank you for the feedback, after the incident a couple of weeks back we have been fine -- I'm stalling the task for now and we can revisit sharding, in addition to the current partitioning, if it becomes a problem