Page MenuHomePhabricator

Reduce replica count from 2 to 1 for indices that are >21 days old
Closed, DeclinedPublic

Description

If we get enough daily traffic that we are hitting a 90-95% utilization frequently I would suggest that the first tuning we do is reducing the replica count from 2 to 1 selectively. Right now each index is present on each data host in the cluster. If we started dropping that back having the data for a given index present on only 2/3 of the cluster we would still be fairly robust from individual node failure and regain quite a bit of disk on each node.

I would suggest that this decrease in redundancy be phased in slowly as space is needed. For the first pass we could drop that replica count on days N-22 to N-30 and when we outgrew that start dropping it sooner and sooner until we only kept full copy for days N and N-1. That would give us quite a bit of headroom on the current hardware and should easily carry us forward until we can budget for adding nodes/disk to the cluster in the next fiscal year.

Event Timeline

bd808 claimed this task.
bd808 raised the priority of this task from to Medium.
bd808 updated the task description. (Show Details)
bd808 added subscribers: bd808, EBernhardson, Aklapper and 2 others.

Change 250501 had a related patch set uploaded (by BryanDavis):
logstash: Drop replica count to 1 after 21 days

https://gerrit.wikimedia.org/r/250501

Change 250501 abandoned by BryanDavis:
logstash: Drop replica count to 1 after 21 days

Reason:
Dropping runJobs info messages makes this unnecessary

https://gerrit.wikimedia.org/r/250501

Unneeded after dropping the runJobs MediaWiki logging channel to warning level.