Page MenuHomePhabricator

Reimage one memcached shard per DC to Buster
Closed, ResolvedPublic

Description

Given T251378 and T213089, I'd propose to move one memcached shard to Buster. The goal is to verify and test with production traffic that the configuration for Buster works as expected. There has been a lot of testing in the past, but I am pretty sure that some tuning will be needed for our production mc* hosts. The idea is to work on one shard for the moment, and then upgrade all the others when we'll be in a good state.

During the reimaging of the server, the mcrouter instances running on all mediawiki servers, will fall back to the gutter pull servers. The production impact will be basically the same as if we were simply restarting a memcached server.

Due to switchover on the 27th, we will reimage a server by the 30th Oct.

Redis

One of our concern is what we happen with the Redis server running on the newly reimaged buster host. Our options are:

  • Remove a redis shard, and do not install Redis on those hosts <-- this is what we are going with now.
    • mc1036
    • mc2036
  • Remove this host from the pool -> requires restarting nucracker across all mediawiki hosts, which in turn will have some user impact
  • Install Redis 5.0 -> preferred solution as this is the default version in buster
  • Install Redis 2.8 -> we need to repackage this for buster, version we are currently running.
    • mc1034
    • mc2034

Memcached

  • There will be loss of EditorJourney data if a memcached shard becomes unavailable

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failures and such) from the existence of the long-term centralauth_Token cookie. That would at least prevent logouts.

ChronologyProtector would only have a short window of some anomalies if a redis server is de-pooled.

WikimediaEvents will lose track of the analytics funnel/conversion sessions for 1/18th of users (they will be seen as new users). Would be nice if the 'editor-journey' keys where copied over first I suppose.

AbuseFilter will have a few stats reset...seems tolerable.

Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failures and such) from the existence of the long-term centralauth_Token cookie. That would at least prevent logouts.

ChronologyProtector would only have a short window of some anomalies if a redis server is de-pooled.

WikimediaEvents will lose track of the analytics funnel/conversion sessions for 1/18th of users (they will be seen as new users). Would be nice if the 'editor-journey' keys where copied over first I suppose.

Thanks a lot for the response @aaron, can you expand a bit the WikimediaEvents use case? Also asking @Ottomata to check to verify if this could be a problem or not :)
Copying keys in this case might be tricky, since we'd have to know in advance how the mc1036's key will be remapped by libketama/twemproxy once the shard will disappear from the config..

In general, I'd like to proceed, possibly stopping Redis for some hours before pulling the trigger on reimage etc.. It will also but a good indicator about what impact it is caused by a Redis shard going down, something that may happen any time due to a variety of reasons.

First I've heard that WikimediaEvents uses memcached to track funnels. From https://meta.wikimedia.org/wiki/Schema_talk:EditorJourney, it looks like we should ask @kostajh.

Data in redis got evicted at furious rates for years before sessionstore, if we remove one server from the ring i don't expect any real issue.

Side note: This question is also interesting from a DC switchover perspective (T243316) since that will also effectively be a Redis flush. In previous switchovers we only explicitly handled replication for sessions data, and now that's out of Redis. If there's anything else in there that we can't afford to drop and recreate, now would be a great time to know that.

Redis is currently replicated cross-dc and it won't mean redis get flushed. We will flush the replicas in the secundary dc before the switch in hopes of minimizing the replica drift (yes redis replica is porous), as they will re-replicate from scratch.

Most data will be still present after the switchover.

First I've heard that WikimediaEvents uses memcached to track funnels. From https://meta.wikimedia.org/wiki/Schema_talk:EditorJourney, it looks like we should ask @kostajh.

Yes, this stems from the work done in T208233: Improve hashing strategy for EditorJourney / cfd14c14d0a5d3e72bef1a620b0c6ab08ccd5d24. That said, I am pinging @nettrom_WMF and @MMiller_WMF to see if EditorJourney is something we are still interested in, because if not we could potentially remove or deactivate that code.

If we want to play this very safe, we could do the following steps:

  • step 1 - stop redis on mc1036 and wait a day to see if anything is reported or if any functionality is impaired. Rollback in case is very quick and easy.
  • step 2 - remove mc1036/2036 from nutcracker's config merging https://gerrit.wikimedia.org/r/595810. Then wait again one/two days, to find if anything has been impacted. If so, rollback should be easy as well.

After step 2, we'd be ready to reimage the shard to Buster, with only memcached 1.6.6 running.

That said, I am pinging @nettrom_WMF and @MMiller_WMF to see if EditorJourney is something we are still interested in, because if not we could potentially remove or deactivate that code.

The only project I think we'd use it for is to investigate the Vietnamese Welcome Survey abandonment rate (T216668). There hasn't been time to dig into this data for quite a while, as evident by that task hanging around. As far as I'm concerned, we might as well remove/deactivate it and pick it up later if we find a need to understand newcomer usage patterns more. I'll leave it to @MMiller_WMF to decide.

@kostajh -- are you asking whether we should deactivate EditorJourney in all wikis, so as to stop it from recording data anywhere? If so, I am fine with that because we're not using the data right now. But we will need the ability to turn it back on in the future.

@nettrom_WMF -- the Vietnamese welcome survey question is likely going to come back, especially if we end up pursuing the integration with the Content Translation tool (in which we will use the welcome survey to identify whether a newcomer should be recommended to do translation tasks). But we probably don't have any EditorJourney data on Vietnamese welcome survey hanging around right now, because it's been so long since the survey has been active in that wiki. So to study the survey with EditorJourney, we would have to enable it, and have EditorJourney enabled at the same time, right?

@kostajh -- are you asking whether we should deactivate EditorJourney in all wikis, so as to stop it from recording data anywhere? If so, I am fine with that because we're not using the data right now. But we will need the ability to turn it back on in the future.

Yeah, we could just deactivate it and leave the code in place should we wish to turn it back on again. Shall we do that?

@kostajh -- maybe we should do that, but I would like to hear from @nettrom_WMF about what that would mean for our analysis of Vietnamese welcome survey.

This comment was removed by MMiller_WMF.

@kostajh -- maybe we should do that, but I would like to hear from @nettrom_WMF about what that would mean for our analysis of Vietnamese welcome survey.

@MMiller_WMF @nettrom_WMF following up on this again.

We have a few options:

  1. Leave EditorJourney logging code enabled, and know that there will be some inconsistent data during the period in which the server is reimaged (a few hours), so for any analyses we run, we should just exclude logging data for that time period.
    1. Or, we could switch off EditorJourney logging via a config patch, then SRE can proceed with reimaging, then we can re-enable
  2. Or we can just switch off EditorJourney logging entirely if we are not analyzing and don't have plans for further usage

@kostajh : Thanks for picking this up and pinging me about it. I think we should switch off EditorJourney since we're not actively using the data in any ongoing analysis.

Also, @MMiller_WMF and I should sit down and plan to run another experiment with the Welcome Survey in Vietnamese, and if we decide that abandonment rate should be a part of it we should enable EJ on only that wiki specifically for that experiment.

@kostajh : Thanks for picking this up and pinging me about it. I think we should switch off EditorJourney since we're not actively using the data in any ongoing analysis.

Also, @MMiller_WMF and I should sit down and plan to run another experiment with the Welcome Survey in Vietnamese, and if we decide that abandonment rate should be a part of it we should enable EJ on only that wiki specifically for that experiment.

Hmm, I spoke too soon. We rely on the wgWMEUnderstandingFirstDay being set in order to oversample in Schema:EditAttemptStep (in WikimediEvents's shouldSchemaEditAttemptStepOversample()), so we need to detangle the configuration value from that method before we can switch off EditorJourney logging. It shouldn't be that complicated -- I think instead of checking to see if wgWMEUnderstandingFirstDay is true, we instead want to see if GrowthExperiments extension is enabled, because we want to oversample edit attempts for all GrowthExperiments users regardless of whether they are opted-in to the Homepage experiment. @nettrom_WMF does that sound right to you? /cc @Catrope and @Tgr

Change 633514 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/mediawiki-config@master] Disable wgWMEUnderstandingFirstDay (EditorJourney) logging

https://gerrit.wikimedia.org/r/633514

Hmm, I spoke too soon. We rely on the wgWMEUnderstandingFirstDay being set in order to oversample in Schema:EditAttemptStep (in WikimediEvents's shouldSchemaEditAttemptStepOversample()), so we need to detangle the configuration value from that method before we can switch off EditorJourney logging. It shouldn't be that complicated -- I think instead of checking to see if wgWMEUnderstandingFirstDay is true, we instead want to see if GrowthExperiments extension is enabled, because we want to oversample edit attempts for all GrowthExperiments users regardless of whether they are opted-in to the Homepage experiment. @nettrom_WMF does that sound right to you?

From what I remember, we have two paths that require oversampling EditAttemptStep:

  1. A user opens the editor during their first 24 hours (in other words, tracked by EditorJourney).
  2. A user clicks on a suggested task from Newcomer Tasks and opens the editor.

I'm not sure if the second path is somehow intertwined with the first in the code? Apart from these two paths, I don't think we have other needs for oversampling EditAttemptStep, so I think we should make sure we fall back on EditAttemptStep's regular sampling rate if a user isn't following these paths in order to limit how much data we store. @MMiller_WMF, please pitch in if I've forgotten something or my suggestions are different from what you'd like.

I think instead of checking to see if wgWMEUnderstandingFirstDay is true, we instead want to see if GrowthExperiments extension is enabled, because we want to oversample edit attempts for all GrowthExperiments users regardless of whether they are opted-in to the Homepage experiment.

PageView::userIsInCohort() already checks that. Conceptually this is still part of undestanding the first day, though, isn't it? (Or first two weeks in the case of the help panel apparently.)

Hmm, I spoke too soon. We rely on the wgWMEUnderstandingFirstDay being set in order to oversample in Schema:EditAttemptStep (in WikimediEvents's shouldSchemaEditAttemptStepOversample()), so we need to detangle the configuration value from that method before we can switch off EditorJourney logging. It shouldn't be that complicated -- I think instead of checking to see if wgWMEUnderstandingFirstDay is true, we instead want to see if GrowthExperiments extension is enabled, because we want to oversample edit attempts for all GrowthExperiments users regardless of whether they are opted-in to the Homepage experiment. @nettrom_WMF does that sound right to you?

From what I remember, we have two paths that require oversampling EditAttemptStep:

  1. A user opens the editor during their first 24 hours (in other words, tracked by EditorJourney).
  2. A user clicks on a suggested task from Newcomer Tasks and opens the editor.

I'm not sure if the second path is somehow intertwined with the first in the code? Apart from these two paths, I don't think we have other needs for oversampling EditAttemptStep, so I think we should make sure we fall back on EditAttemptStep's regular sampling rate if a user isn't following these paths in order to limit how much data we store. @MMiller_WMF, please pitch in if I've forgotten something or my suggestions are different from what you'd like.

Ah, right. You do remember correctly @nettrom_WMF, there are two distinct paths to perform the oversampling:

$pageViews = new PageViews( $context );
		$userInCohort = $wgWMEUnderstandingFirstDay && $pageViews->userIsInCohort();

		// The editingStatsOversample request parameter can trigger oversampling
		$fromRequest = $context->getRequest()->getBool( 'editingStatsOversample' );

		$shouldOversample = $userInCohort || $fromRequest;
		Hooks::run(
			'WikimediaEventsShouldSchemaEditAttemptStepOversample',
			[ $context, &$shouldOversample ]
		);

So if we have switched EditorJourney off ($wgWMEUnderstandingFirstDay is false) then oversampling should still happen when editingStatsOversample is in the query parameter, which happens when a user clicks on a task via the suggested edits module.

And yes, to confirm, if a user is not in the editor journey cohort (or EditorJourney is switched off entirely), and the query parameter override isn't present, then we fall back to the normal sampling rate.

Change 634012 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/mediawiki-config@master] Disable EditorJourney (UnderstandingFirstDay)

https://gerrit.wikimedia.org/r/634012

@kostajh Can you confirm whether something does or does not need to change in WikimediaEvents or GrowthExperiments prior to the partial deletion of Redis data for the maintenance here? I see a patch for turning it off, which suggests that partial loss ahead of it might be fine if it is already meant to be turned off. But, I also see a comment at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/633514/ which says that some logic in WikimediaEvents needs to change first?

@kostajh Can you confirm whether something does or does not need to change in WikimediaEvents or GrowthExperiments prior to the partial deletion of Redis data for the maintenance here? I see a patch for turning it off, which suggests that partial loss ahead of it might be fine if it is already meant to be turned off. But, I also see a comment at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/633514/ which says that some logic in WikimediaEvents needs to change first?

Nothing needs to change in WikimediaEvents or GrowthExperiments (see also my comment in that config patch).I'll try to get those config patches merged and deployed on Monday.

Change 633514 merged by jenkins-bot:
[operations/mediawiki-config@master] labs: Disable EditorJourney (UnderstandingFirstDay)

https://gerrit.wikimedia.org/r/633514

Change 634012 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable EditorJourney (UnderstandingFirstDay)

https://gerrit.wikimedia.org/r/634012

@jijiki EditorJourney logging is now switched off. We may at some point want to re-enable but will wait for this work to be finished before doing so.

Mentioned in SAL (#wikimedia-operations) [2020-10-19T11:20:02Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: 26b97261f2b9d1991ea08fe32b6007ba6fe5088f: Disable EditorJourney (UnderstandingFirstDay) (T252391) (duration: 01m 10s)

Change 635987 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] Set debian buster for mc1019

https://gerrit.wikimedia.org/r/635987

Change 595810 abandoned by Elukey:
[operations/puppet@production] Remove mc1036/mc2036 from the Redis Nutcracker config

Reason:
Not needed anymore

https://gerrit.wikimedia.org/r/595810

Change 637708 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: remove shard18 from redis.yaml

https://gerrit.wikimedia.org/r/637708

Icinga downtime for 8:00:00 set by rzl@cumin1001 on 1 host(s) and their services with reason: reimaging to Buster

mc2036.codfw.wmnet

Change 637708 merged by Effie Mouzeli:
[operations/puppet@production] hiera: remove shard18 from redis.yaml

https://gerrit.wikimedia.org/r/637708

Mentioned in SAL (#wikimedia-operations) [2020-10-30T17:19:37Z] <effie> disable puppet on mc1036 and mc2036 - T252391

Change 635987 merged by Effie Mouzeli:
[operations/puppet@production] Set debian buster for mc2036

https://gerrit.wikimedia.org/r/635987

Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts:

mc2036.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010301739_jiji_15617_mc2036_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2036.codfw.wmnet']

Of which those FAILED:

['mc2036.codfw.wmnet']

Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts:

mc2036.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010302040_jiji_15243_mc2036_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2036.codfw.wmnet']

Of which those FAILED:

['mc2036.codfw.wmnet']

Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts:

mc2036.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010302044_jiji_17606_mc2036_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2036.codfw.wmnet']

and were ALL successful.

jijiki renamed this task from Reimage one memcached shard to Buster to Reimage one memcached shard per DC to Buster.Oct 30 2020, 9:16 PM
jijiki updated the task description. (Show Details)

We removed shard18 from redis.yaml so to be able to avoid installing redis-server on this server pair (mc1036-mc2036).

  • mc1036.eqiad.wmnet is left with puppet being disabled, as it is broken due to the server being absent from redis.yaml
  • mc2036.codfw.wmnet has been reimaged to buster without redis-server installed 🎉
  • mc2036.codfw.wmnet has been reimaged to buster without redis-server installed 🎉

This is indeed really nice :)

One nit - for the gutter pool, we added some specific settings for 1.5.x, meanwhile on 2036 we are keeping the defaults that we use for Jessie. This is the hiera config for the gutter:

profile::memcached::version: 'present'
profile::memcached::threads: 16
profile::memcached::growth_factor: 1.15
profile::memcached::min_slab_size: 48
profile::memcached::size: 241591
profile::memcached::extended_options:
  - 'modern'
profile::memcached::port: 11211

Some notes:

  • the threads value is very important since 1.5+ code is way more scalable that 1.4 (less locking etc..), so the default 4 threads can be increased. 2036 has 32 cores from nproc, so 8/16 threads for memcached should be good.
  • growth_factor and min_slab_size are the tricky ones, I added some thoughts in T252391#6223839. Under load (namely on mc10xx) I think that we'll have to tune those, but it is difficult to know it in advance. The values above might be a good starting point.
  • modern options enable all the new 1.5+ goodies.
  • size needs to be adjusted, since we currently use -m 89088 meanwhile on the host we have 128G of ram. If we add 10/20G for each node we'll increase our overall capacity without really risking anything.

Last but not the least, Moritz and John packaged 1.6.6 for IDP, that supports TLS, so we could think about using it for this use case as well (maybe also upgrading the gutter pool!)

Change 639089 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] Set debian buster for mc1036

https://gerrit.wikimedia.org/r/639089

Change 639089 merged by Effie Mouzeli:
[operations/puppet@production] Set debian buster for mc1036

https://gerrit.wikimedia.org/r/639089

Change 639099 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mc1036: Initial memcached 1.5 tuning

https://gerrit.wikimedia.org/r/639099

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1036.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011041706_jiji_1322_mc1036_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mc1036.eqiad.wmnet']

and were ALL successful.

Change 639099 merged by Effie Mouzeli:
[operations/puppet@production] mc1036: Initial memcached 1.5.x tuning

https://gerrit.wikimedia.org/r/639099

Adding some thoughts about mc1036, in my opinion it is really flying with the new config :)

With the extra +20G of ram that we added (since there was plenty of room for expansion on the host) we reached 50M objects stored , instead of 28M:

Screen Shot 2020-11-16 at 8.09.23 AM.png (998×1 px, 138 KB)

Handling more objects is not trivial for memcached, since with more space there is also the chance of more cache pollution. For example, on 1.4, slabs assigned to a size class cannot be reclaimed, even if they are empty. This makes the first allocations very important, since they will dictate the overall distribution of allocations for the shard in the long term. With 1.5 slabs can be reassigned when empty, and objects assigned to an expired key are automatically cleaned up by a thread in the background (as opposed to freed only when there is no space for new keys in the slab).

This brings us to another great improvement, namely evictions and reclaims:

Screen Shot 2020-11-16 at 8.15.19 AM.png (259×1 px, 98 KB)

Again this is due to the combination of extra space and new logic for memcached memory handling.

The get hit ratio went from 0.95XX to 0.96XX, one of the highest of the shards:

Screen Shot 2020-11-16 at 8.17.27 AM.png (944×1 px, 119 KB)

Whoever is interested in slab metrics, please check the grafana dashboard. There are also new metrics in the 1.5.x panel related to the hot/warm/cold LRU cache stages.

@elukey beat me to writing the celebratory post :D

Since we are happy with the current settings, I think we can continue by installing redis 2.8 on those two hosts. We have forward ported redis 2.8 to buster, so we can try to re-add those two hosts to the nutcracker cluster, and see how it goes

Since we are happy with the current settings, I think we can continue by installing redis 2.8 on those two hosts. We have forward ported redis 2.8 to buster, so we can try to re-add those two hosts to the nutcracker cluster, and see how it goes

@jijiki Redis' memory footprint is very low afaics (<1G) but its os page cache usage is some GBs on mc10xx nodes. I am pretty sure that Redis is the only one using page cache since on mc1036 the usage is zero (since we have only memcached running on it).

If we add Redis on mc1036 we should monitor memory usage, since memcached is using ~110G now. There should be plenty of space for Redis' page cache, but better safe than sorry :)

Since we are happy with the current settings, I think we can continue by installing redis 2.8 on those two hosts. We have forward ported redis 2.8 to buster, so we can try to re-add those two hosts to the nutcracker cluster, and see how it goes

@jijiki Redis' memory footprint is very low afaics (<1G) but its os page cache usage is some GBs on mc10xx nodes. I am pretty sure that Redis is the only one using page cache since on mc1036 the usage is zero (since we have only memcached running on it).

If we add Redis on mc1036 we should monitor memory usage, since memcached is using ~110G now. There should be plenty of space for Redis' page cache, but better safe than sorry :)

I agree, I am optimistic given we have reduced our redis usage, it will not be an issue:

image.png (566×334 px, 107 KB)

but I see your point:

image.png (1×3 px, 185 KB)

Thank you!

Change 646638 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] redis: define redis version on buster

https://gerrit.wikimedia.org/r/646638

Change 647197 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] redis: define redis version on buster for multidc

https://gerrit.wikimedia.org/r/647197

Change 646638 abandoned by Effie Mouzeli:
[operations/puppet@production] redis: define redis version on buster

Reason:
abandoned for 647197

https://gerrit.wikimedia.org/r/646638

Change 647204 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: install redis on mc1036

https://gerrit.wikimedia.org/r/647204

Change 647197 merged by Effie Mouzeli:
[operations/puppet@production] redis: define redis version on buster for multidc

https://gerrit.wikimedia.org/r/647197

Change 647204 merged by Effie Mouzeli:
[operations/puppet@production] hiera: install redis on shard16

https://gerrit.wikimedia.org/r/647204

jijiki claimed this task.
jijiki updated the task description. (Show Details)

Change 713552 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/WikimediaEvents@master] Remove UnderstandingFirstDay code

https://gerrit.wikimedia.org/r/713552

Change 713552 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Remove UnderstandingFirstDay code

https://gerrit.wikimedia.org/r/713552