Page MenuHomePhabricator

Create a cookbook for memcached management
Open, LowPublicFeature

Description

We need a cookbook to help us reboot/restart the memcached cluster.

Infrastructure Overview
  • Each DC contains 18 memcached hosts and 3 gutter pool hosts
  • Memcached cluster capacity: 2.4TB of available RAM
  • Gutter pool capacity: ~768GB
Performance Impact
  • Both a daemon restart and a server reboot results in loss of all data
  • When a server goes offline, the gutter pool cluster replaces it until the host becomes available again
  • The gutter pool is always cold
  • MediaWiki works significantly harder to warm up a cold host or gutter pool
    • This translates to increased latency, additional database queries, etc
Requirements

Create a cookbook that implements the following:

  • Accepts a range of servers or a single server
  • Batch size: restart/reboot no more than 2 hosts at a time
    • this can be configurable ofc, with a warning about when we reboot/restart more than 2
  • Warm-up period: after hosts are back online, wait for a specified duration to allow the cache to warm up before proceeding to the next pair
    • we could monitor potentially the cache hit ratio?
  • Ensure that we restart hosts either from the main cluster or the gutterpool cluster, but never both
  • Operation modes: include separate flags for:
    • Restarting the daemon
    • Rebooting the server

Event Timeline

Change #1211066 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] cumin: add aliases for memcached-gutter hosts

https://gerrit.wikimedia.org/r/1211066

Change #1211089 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/cookbooks@master] memcached: add memcached restart/reboot cookbook

https://gerrit.wikimedia.org/r/1211089

Change #1211066 merged by Effie Mouzeli:

[operations/puppet@production] cumin: add aliases for memcached-gutter hosts

https://gerrit.wikimedia.org/r/1211066

MLechvien-WMF changed the subtype of this task from "Task" to "Feature Request".
MLechvien-WMF subscribed.

@jijiki in which situation would that cookbook be useful?

@jijiki I'm assuming this is not critical for recurring operations or switchover, but please change priority if you disagree.