Page MenuHomePhabricator

Upgrade and improve our application object caching service (memcached)
Open, MediumPublic

Description

Overview
Our object caching service is based on mcrouter and memcached. Mcrouter is a memcached protocol router for scaling memcached. Currently, each mediawiki server is running an instance of mcrouter which in turn is configured with the same pool of memcached servers that constitute our pool.

Current Issues

  • When a shard becomes unavailable, we get TKOs which cause latency problems T203786 T208934 T239983
  • All memcached servers are on Debian Jessie, with its LTS support ending in June 2020
  • Redis is co-located in the same set of servers

To address the above issues, we will initially introduce a secondary pool of memcached servers, called gutter servers, capable of temporarily replacing any unavailable servers. This functionality will be provided by mcrouter. When we have the gutter servers in place and failover works, we can proceed with rolling upgrading all memcached servers to Debian buster. Since there have not been any major changes in the memcached protocol, we do not expect any major issues.

Another thing to take into account is mcrouter proxies. We have 4 mw servers in each DC which are used to replicate specific keys (dictated by mediawiki) from one datacentre to the other. We want to test the gutter pool functionality on the proxy level, i.e. define a secondary set of mcrouter proxy servers on each DC, where mcrouter will failover to in case a primary proxy is unavailable.

Lastly, developers are already working on completely retiring the use of Redis in Mediawiki, thus there will be no need to worry about its upgrade. (TBA links to related tasks)

Action Plan

  • Test gutter pool servers in beta
  • Test new memcached settings in beta
  • Image 6 new gutter servers (3 in eqiad, 3 in codfw)
  • Make relevant puppet changes to get gutter pool metrics
  • Make relevant puppet changes to support memcached on Debian Buster
  • Test gutter pool in production (mwdebug*)
  • Test proxy gutter pool in eqiad and/or codfw
  • Make relevant puppet changes to support the gutter pool configuration
  • Enable and test the gutter pool in canaries
  • Test memcached 1.5.x (buster) in canaries
  • Enable and test the gutter pool in production
  • Test onhost memcached
  • Deploy onhost memcached
  • Roll upgrade to buster in secondary datacenter (codfw)
  • Roll upgrade to buster in secondary datacenter (eqiad)
  • Upgrade memcached to version 1.6.x
  • Enable TLS

Related tasks: T203786

Reads:

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedelukey
InvalidJclark-ctr
Resolvedelukey
Resolvedaaron
Resolvedelukey
Resolvedjijiki
Resolvedaaron
ResolvedNone
OpenKrinkle
Resolvedjijiki
Declinedaaron
DeclinedNone
ResolvedKrinkle
ResolvedJoe
Openjijiki
OpenNone
DeclinedNone
OpenNone
Resolvedelukey
Resolvedaaron
Resolvedjijiki
Resolvedjijiki
Resolvedelukey
Resolvedelukey
Resolvedjbond
Resolvedjijiki
Resolvedjijiki
Resolvedhashar
Resolvedjijiki
OpenNone

Event Timeline

jijiki triaged this task as Medium priority.Feb 11 2020, 12:57 PM
jijiki added projects: serviceops, SRE.
jijiki renamed this task from Upgrade and improve our application object caching service (memcachedd) to Upgrade and improve our application object caching service (memcached).Feb 11 2020, 1:51 PM
RobH closed subtask Unknown Object (Task) as Resolved.Feb 18 2020, 10:35 PM

Change 592519 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mcrouter: enable failover route for on all canaries

https://gerrit.wikimedia.org/r/592519

Change 592520 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mcrouter: enable the gutter pool everywhere.

https://gerrit.wikimedia.org/r/592520

Change 592519 merged by Giuseppe Lavagetto:
[operations/puppet@production] mcrouter: enable failover route for on all canaries

https://gerrit.wikimedia.org/r/592519

Change 592520 merged by Giuseppe Lavagetto:
[operations/puppet@production] mcrouter: enable the gutter pool everywhere.

https://gerrit.wikimedia.org/r/592520

Change 594239 had a related patch set uploaded (by RLazarus; owner: RLazarus):
[operations/puppet@production] mediawiki: Clean up $use_gutter now that it's true everywhere.

https://gerrit.wikimedia.org/r/594239

Change 594239 merged by RLazarus:
[operations/puppet@production] mcrouter_wancache: Clean up $use_gutter now that it's true everywhere.

https://gerrit.wikimedia.org/r/594239

Krinkle closed subtask Restricted Task as Resolved.Feb 5 2021, 10:24 PM