Page MenuHomePhabricator

Deprecate the usage of nutcracker for memcached
Closed, ResolvedPublic

Description

This task was originally meant only to remove nutcracker from the MediaWiki's config, but why not be bold and expand its scope!

Nutcracker (for memcached) is currently used in the following:

  • labswiki/wikitech (there is a plan to fold labswiki into the MW appservers sometimes during 2019/2020 FY)
  • Thumbor T221081
  • MediaWiki appservers

The configuration on app/api appservers for memcached has been removed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/510697.

The idea for this task is to migrate tools using nutcracker to mcrouter (if possible) to avoid maintaining in the long run two code bases (the nutcracker's one is basically abandoned).

Event Timeline

elukey triaged this task as Normal priority.Jan 21 2019, 7:40 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2019, 7:40 AM

(Tagging Performance-Team to track Aaron's implicit involvement through CC, as this appears implicitly blocked on approval or a recommendation for how to phase out the remaining uses of the nutcracker service.)

The main one I know about is Wikitech (labswiki), which uses Nutcracker still (instead of Mcrouter) because Wikitech's MW server provisioning is separate from the main MW servers. As a result of this, it doesn't yet have Mcrouter installed.

This morning while looking at logstash I found something that seems related to enwiki mentioning the nutcracker's unix socket:

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.01.22/mediawiki?id=AWh00iPPzpjgITg6elhH&_g=h@66534ad

kchapman moved this task from Inbox to Doing on the Performance-Team board.Jan 22 2019, 9:10 PM
Krinkle assigned this task to aaron.Jan 22 2019, 9:11 PM
Joe added a comment.Jan 23 2019, 8:24 AM

We still have some usages of nutcracker for memcached, specifically within $wgObjectCaches['mysql-multiwrite'], which is in turn used for the ParserCache.

Change 486134 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Switch parser cache tier 1 to mcrouter

https://gerrit.wikimedia.org/r/486134

Change 486134 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch parser cache top tier backing cache to mcrouter

https://gerrit.wikimedia.org/r/486134

Change 486558 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make labs just use the "mcrouter" object cache

https://gerrit.wikimedia.org/r/486558

Change 486558 merged by jenkins-bot:
[operations/mediawiki-config@master] Make labs just use the "mcrouter" object cache

https://gerrit.wikimedia.org/r/486558

aaron added a comment.EditedJan 25 2019, 10:11 PM

Things needed here:

  • Use only mcrouter in deployment-prep (no multiwrite) from MW
  • Remove puppet code for deployment-prep (nothing to do here; MW::appserver role via Horizon => MW::common => MW::nutcracker)
  • Install mcrouter on the memached servers used by labswiki
  • Make MW use mcrouter on labswiki
  • Remove "memcached-pecl" cache entry from config
  • Remove labswiki nutcracker code from puppet

There is also the use of twemproxy as a redis proxy...which should be it's own task, as that is a larger matter.

elukey added a subscriber: Andrew.EditedJan 31 2019, 11:11 PM

@Andrew if you have time can we chat about the labswiki steps stated above?

Adding also @aborrero to get his thoughts about it (so we could in case better follow up in the EU timezone if needed).

elukey added a subscriber: aborrero.Feb 4 2019, 2:22 PM
Andrew added a comment.Feb 4 2019, 3:44 PM

My short answer is: I don't object to moving wikitech from nutracker to mcrouter but I'm currently a bit baffled by the mcrouter puppet setup.

There are two appservers for wikitech, labweb1001 and labweb1002, and the interesting stuff is puppetized via profile::openstack::base::wikitech::web. Presumably they'll want to be their own two-host memcached cluster.

Change 487889 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::wmcs::openstack::main::labweb: add mcrouter config

https://gerrit.wikimedia.org/r/487889

elukey moved this task from Backlog to In Progress on the User-Elukey board.Feb 4 2019, 5:51 PM

Change 487889 abandoned by Elukey:
role::wmcs::openstack::main::labweb: add mcrouter config

Reason:
After a chat with Joe it came up that we should explore different options, going to abandon this change.

https://gerrit.wikimedia.org/r/487889

elukey moved this task from In Progress to Stalled on the User-Elukey board.Feb 25 2019, 8:44 AM

nutcracker seems also used by thumbor when contacting its own memcached cluster, but due to T208934 I am not sure what would be the best thing to do. @fgiunchedi any thoughts?

nutcracker seems also used by thumbor when contacting its own memcached cluster, but due to T208934 I am not sure what would be the best thing to do. @fgiunchedi any thoughts?

IIRC at the time we used nutcracker for thumbor because it was easy, AFAIK there's nothing preventing moving to mcrouter and toss nutcracker. I'll defer to @Gilles and @jijiki though!

Yes, you can go ahead and switch Thumbor to Mcrouter. Thumbor uses memcached for non-critical request throttling. Memcached can be down for a bit and Thumbor will continue working normally, albeit without some types of throttling working.

aaron removed aaron as the assignee of this task.Mar 2 2019, 12:21 AM
kchapman moved this task from Doing to Radar on the Performance-Team board.Mar 25 2019, 9:33 PM
kchapman edited projects, added Performance-Team (Radar); removed Performance-Team.

While working on memcached on mc1022 I tried to check stat conns and observed a lot of connections like this:

STAT 27:addr tcp:10.64.32.57:48326
STAT 27:state conn_read
STAT 27:secs_since_last_cmd 6592671
STAT 28:addr tcp:10.64.48.81:47586

The connection above is waiting for a command since ages ago. I checked on the host:

elukey@mw1246:~$ sudo lsof | grep 47586
nutcracke   758             nutcracker   49u     IPv4           23913412        0t0        TCP mw1246.eqiad.wmnet:47586->mc1022.eqiad.wmnet:11211 (ESTABLISHED)
nutcracke   758   773       nutcracker   49u     IPv4           23913412        0t0        TCP mw1246.eqiad.wmnet:47586->mc1022.eqiad.wmnet:11211 (ESTABLISHED)

I think it is reasonable to assume that we don't use nutcracker anymore for mediawiki (on the app/api appservers), so in theory the next step for this task could be to remove its (memcached) config on api/appservers reducing connections to memcached that are not doing anything at the moment.

Change 504831 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mediawiki::nutcracker: make memcached configuration optional

https://gerrit.wikimedia.org/r/504831

My idea going forward is to:

  1. get https://gerrit.wikimedia.org/r/504831 reviewed/merged, and then remove the hiera config for memcached in deployment-prep.
  2. let it bake for a bit to see if any error/weirdness comes up
  3. remove the memcached configuration in prod
  1. will need a roll restart of all nutcracker instances to get properly applied, that means (probably) breaking redis connections during the reload. This in turn may cause some errors to users logged in, but I don't think it is avoidable (I can follow up with community liaisons to be ready in case).

Change 504831 merged by Elukey:
[operations/puppet@production] profile::mediawiki::nutcracker: make memcached configuration optional

https://gerrit.wikimedia.org/r/504831

Change 508817 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove mediawiki's nutcracker config from deployment-prep

https://gerrit.wikimedia.org/r/508817

Change 508817 merged by Elukey:
[operations/puppet@production] Remove mediawiki's nutcracker config from deployment-prep

https://gerrit.wikimedia.org/r/508817

Mentioned in SAL (#wikimedia-releng) [2019-05-09T08:17:04Z] <elukey> remove mediawiki memcached nutcracker config from deployment-prep (should be unused) - T214275

Change 509462 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::canary_appserver: remove nutcracker memcached conf

https://gerrit.wikimedia.org/r/509462

Change 510121 had a related patch set uploaded (by Jbond; owner: Elukey):
[operations/puppet@production] role::mediawiki::canary_appserver: remove nutcracker memcached conf

https://gerrit.wikimedia.org/r/510121

Change 510124 had a related patch set uploaded (by Jbond; owner: Elukey):
[operations/puppet@production] role::mediawiki::canary_appserver: remove nutcracker memcached conf

https://gerrit.wikimedia.org/r/510124

Change 510124 abandoned by Elukey:
role::mediawiki::canary_appserver: remove nutcracker memcached conf

https://gerrit.wikimedia.org/r/510124

Change 509462 abandoned by Elukey:
role::mediawiki::canary_appserver: remove nutcracker memcached conf

Reason:
this doesn't override the hiera value, we'll probably need to use regex.yaml

https://gerrit.wikimedia.org/r/509462

Change 510153 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Empty mediawiki_memcached_servers for 3 mw hosts

https://gerrit.wikimedia.org/r/510153

Change 510121 abandoned by Jbond:
role::mediawiki::canary_appserver: remove nutcracker memcached conf

Reason:
test no longer needed

https://gerrit.wikimedia.org/r/510121

Change 510309 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove nutcracker memcached config in codfw

https://gerrit.wikimedia.org/r/510309

Change 510309 merged by Elukey:
[operations/puppet@production] Remove nutcracker memcached config in codfw

https://gerrit.wikimedia.org/r/510309

Change 510434 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mediawiki::nutcracker: add port alarms only if config is deployed

https://gerrit.wikimedia.org/r/510434

Change 510434 abandoned by Elukey:
profile::mediawiki::nutcracker: add port alarms only if config is deployed

Reason:
All right then! Skipping this :)

https://gerrit.wikimedia.org/r/510434

Mentioned in SAL (#wikimedia-operations) [2019-05-16T05:34:21Z] <elukey> roll restart of nutcracker on mw2* to pick up new config changes (no more memcached config) - T214275

Connections to the memcached shards dropped nicely after the nutcracker restart.

Change 510153 merged by Elukey:
[operations/puppet@production] Empty mediawiki_memcached_servers for 3 mw hosts

https://gerrit.wikimedia.org/r/510153

Mentioned in SAL (#wikimedia-operations) [2019-05-16T08:22:32Z] <elukey> depool/restart-nutcracker-pool mw1238 - T214275

Mentioned in SAL (#wikimedia-operations) [2019-05-16T08:32:17Z] <elukey> depool/restart-nutcracker-pool mw1293/1313 - T214275

Change 510697 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove memcached config for nutcracker in eqiad

https://gerrit.wikimedia.org/r/510697

Change 510697 merged by Elukey:
[operations/puppet@production] Remove memcached config for nutcracker in eqiad

https://gerrit.wikimedia.org/r/510697

Mentioned in SAL (#wikimedia-operations) [2019-06-04T08:32:01Z] <elukey> remove memcached nutcracker config from mw1* hosts (not used). Changes will be picked up when nutcracker will be restarted (after reboots, etc..) - T214275

elukey renamed this task from Consider removing the last traces of nutcracker in Mediawiki configs to Deprecate the usage of nutcracker for memcached.Jun 4 2019, 8:50 AM
elukey updated the task description. (Show Details)
elukey added a comment.Jun 5 2019, 6:17 AM

Connections established to the eqiad memcached shards after the reboot of the mw1* hosts:

elukey updated the task description. (Show Details)Jun 5 2019, 6:21 AM
elukey moved this task from Stalled to Mcrouter/Memcached on the User-Elukey board.Jul 5 2019, 7:02 AM
elukey added a comment.Jul 5 2019, 9:47 AM

The two remaining use cases are:

  • labswiki
  • thumbor

The latter should be doable, but the former seems a bit more complicated. Is there any plan to deprecate the labswiki infra and fold it into the appserver layer? If so I'd avoid to work on labswiki, if not we can surely think about moving it away from nutcracker.

Andrew added a comment.Jul 8 2019, 2:35 PM

The latter should be doable, but the former seems a bit more complicated. Is there any plan to deprecate the labswiki infra and fold it into the appserver layer?

There definitely is such a plan, although it will be quite a while (maybe end of this FY?) before we're able to move forward.

jijiki added a project: serviceops.EditedJul 8 2019, 2:55 PM

@elukey I can do thumbor, not sure when yet though. I opened T221081 a while a go for it

The latter should be doable, but the former seems a bit more complicated. Is there any plan to deprecate the labswiki infra and fold it into the appserver layer?

There definitely is such a plan, although it will be quite a while (maybe end of this FY?) before we're able to move forward.

Thanks! I think this is totally fine, there is not real rush to move everything away now from nutcracker.

@elukey I can do thumbor, not sure when yet though. I opened T221081 a while a go for it

Thanks a lot :)

@aaron what do you think? Would it be ok in your opinion to close this task?

elukey updated the task description. (Show Details)Jul 12 2019, 2:46 PM
aaron added a comment.Jul 12 2019, 5:39 PM

The latter should be doable, but the former seems a bit more complicated. Is there any plan to deprecate the labswiki infra and fold it into the appserver layer?

There definitely is such a plan, although it will be quite a while (maybe end of this FY?) before we're able to move forward.

Thanks! I think this is totally fine, there is not real rush to move everything away now from nutcracker.

@elukey I can do thumbor, not sure when yet though. I opened T221081 a while a go for it

Thanks a lot :)
@aaron what do you think? Would it be ok in your opinion to close this task?

Seems reasonable for now.

elukey closed this task as Resolved.Thu, Jul 25, 9:46 AM
elukey claimed this task.