Page MenuHomePhabricator

Upgrade redis_misc hosts to Debian Trixie (Redis 8.0)
Open, HighPublic

Description

Redis recommends upgrading one major version at a time, which would mean upgrading the servers twice;

  • Debian Bookworm and Redis 7.0
  • Debian Trixie and Redis 8.0

However, given that we have active hardware refreshes in both datacenters T418918: rdb101[56] implementation tracking T418924: rdb201[34] implementation tracking, we have an opportunity to skip Bookworm entirely and move directly to Trixie, provided that there are no concerns from teams currently using redis_misc.

If possible we should take the chance to T421711: ServiceOps: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets during the OS upgrade reimage.

Services using redis_misc

Pair 1

PortDBUsagePoC/tag
63782, 3Netbox tasks (db 2) and Netbox caching (db 3)netbox Infrastructure-Foundations
63790changeprop / cpjobqueue / api-gatewayMW-Interfaces-Team
63800RatelimitMW-Interfaces-Team
63810filebackend.php (redisLockManager)MediaWiki-Platform-Team
63820filebackend.php (redisLockManager)

Pair 2

PortDBUsagePoC/tag
63780IDP (CAS-SSO) ProductionInfrastructure-Foundations
63781IDP (CAS-SSO) Test
63790changeprop / cpjobqueue / api-gateway
63800Ratelimit
63810filebackend.php (redisLockManager)
63820docker-registryServiceOps new
How?

The new hosts will be reimaged directly to Trixie (Redis 8.0) and services migrated one by one (or more). If any issues come up, we can simply revert the service back to the previous servers (point it back at the old rdb hosts).

Open Questions to owners before proceeding
  • Shall we migrate the data?
    • We need to decide whether to migrate existing Redis data to the new hosts or start fresh. Given that Redis is generally considered to be ephemeral storage, not migrating the data should be an acceptable risk.
    • How does data persistence affect your service?
  • Application behavior under server unavailability
    • It is currently unknown how each of the services above behaves if their Redis storage becomes unavailable. Maybe this could be a good opportunity to test this in a controlled manner, since rolling back to existing hosts would be easy to do.
    • Can your service tolerate a brief Redis unavailability?
Dashboard improvements
  • improve grafana dashboards: https://grafana-rw.wikimedia.org/d/000000174/redis
    • Role-aware filtering via $role
    • cache hit ratio, memory total, connected clients
    • new panels:
      • connected clients & replicas
      • replication lag
      • cache hits vs misses
    • reorganise layout
Wikitech Updates

Event Timeline

jijiki added a subtask: Unknown Object (Task).
jijiki added a subtask: Unknown Object (Task).

Infrastructure-Foundations two question for netbox

  • how netbox will behave if it looses connectivity to its redis and then start with a cold cache?
  • do we have any concerns updating to redis 8?

Same questions for IDP, I guess that is 4 questions.

MW-Interfaces-Team same questions for you for changeprop/cpjobqueue /api-gateway/Ratelimit:

*how they will behave if it looses connectivity to its redis and then start with a cold cache?

  • do we have any concerns updating to redis 8?

Infrastructure-Foundations and MW-Interfaces-Team , we'd need your inputs on those questions.

This migration is required for the Debian upgrades this quarter, as Bullseye is at the very end of EOL period

Hey!

Infrastructure-Foundations two question for netbox

  • how netbox will behave if it looses connectivity to its redis and then start with a cold cache?

It will be slower but I don't foresee a major issue, but I can re-ask around to my team and get back to you asap.

  • do we have any concerns updating to redis 8?

Not really, it shouldn't be an issue.

Same questions for IDP, I guess that is 4 questions.

For IDP I'd ask to @SLyngshede-WMF for a confirmation :)

We are going to perform a quick test for netbox, namely switching the Redis database to one not used and restart. Easy to rollback, and we'll see what is the cost of starting netbox from "scratch" Redis-wise. I don't foresee major blockers, so please go ahead with the planning, I'll post an update asap for the final confirmation.

For IDP we do not need to migrate data, this is just session storage. The only downside is that people will need to sign back in.

IDP is deployed on two hosts, so we can chance the Redis configuration for the standby host test that everything is working as expected, then fail-over to that one. Once we're happy we can then move the remaining host.

If Redis disappears from underneath CAS/IDP the service will most likely need to be restarted. It should reconnect, but I won't trust that to much.

We can start with IDP-Test, I'm happy to help and test behavior.

Regarding redisLockManager:

Shall we migrate the data?

The data is just ephemeral "session X has SH/EX lock on key Y" stuff. 2 of the 3 server slots in ProductionServices.php, rdb1, rdb2, and rdb, have to be up for things to move along. In theory, with redisLockManager, you could migrate one, wait 5 minutes, and go to the next one. In practice, given the actual ProductionServices config we have, two of those "server slots" map to rdb1013, just different ports. This used to be different years ago. Anyway, since you would be migrating to *new* hosts, the safest thing would just be to migrate *one* of the slots on rdb1013, such as rdb2, to the new host, wait 5 minutes, and then do the same for each of the other two slots (in any order).

Can your service tolerate a brief Redis unavailability?

In theory, one of the server slots could go down and little would happen (maybe slight latency increase). In practice, for the same reason as a above, this would not be true if the *host* of one of the slots went down. In that case, it looks like locks would not be able to be acquired, causing upload failure on all wikis. However, if you just point one rdb[1-3] slot to a new host, and the connections there fail for some reason, the service will still work. So if that's the kind of unavailability you're worried about, then I don't see an issue.

We are going to perform a quick test for netbox, namely switching the Redis database to one not used and restart. Easy to rollback, and we'll see what is the cost of starting netbox from "scratch" Redis-wise. I don't foresee major blockers, so please go ahead with the planning, I'll post an update asap for the final confirmation.

Great! Thank you!

For IDP we do not need to migrate data, this is just session storage. The only downside is that people will need to sign back in.

IDP is deployed on two hosts, so we can chance the Redis configuration for the standby host test that everything is working as expected, then fail-over to that one. Once we're happy we can then move the remaining host.

If Redis disappears from underneath CAS/IDP the service will most likely need to be restarted. It should reconnect, but I won't trust that to much.

We can start with IDP-Test, I'm happy to help and test behavior.

Thank you! I will ping you. Any concerns regarding Redis 8?

Regarding redisLockManager:

Shall we migrate the data?

The data is just ephemeral "session X has SH/EX lock on key Y" stuff. 2 of the 3 server slots in ProductionServices.php, rdb1, rdb2, and rdb, have to be up for things to move along. In theory, with redisLockManager, you could migrate one, wait 5 minutes, and go to the next one. In practice, given the actual ProductionServices config we have, two of those "server slots" map to rdb1013, just different ports. This used to be different years ago. Anyway, since you would be migrating to *new* hosts, the safest thing would just be to migrate *one* of the slots on rdb1013, such as rdb2, to the new host, wait 5 minutes, and then do the same for each of the other two slots (in any order).

Can your service tolerate a brief Redis unavailability?

In theory, one of the server slots could go down and little would happen (maybe slight latency increase). In practice, for the same reason as a above, this would not be true if the *host* of one of the slots went down. In that case, it looks like locks would not be able to be acquired, causing upload failure on all wikis. However, if you just point one rdb[1-3] slot to a new host, and the connections there fail for some reason, the service will still work. So if that's the kind of unavailability you're worried about, then I don't see an issue.

Excellent, thank you! Are there any concerns regarding regarding Redis 8?

Excellent, thank you! Are there any concerns regarding regarding Redis 8?

I didn't see any relevant breaking changes in the redis 7 and 8 changelogs.

MW-Interfaces-Team same questions for you for changeprop/cpjobqueue /api-gateway/Ratelimit:

*how they will behave if it looses connectivity to its redis and then start with a cold cache?

  • do we have any concerns updating to redis 8?

I *think* that api-gateway/Ratelimit is just storing ephemeral counters. You'd want to check with @daniel though in case there is some token/session stuff (though we try to keep that stateless, with expiration, to avoid storage lookups. I haven't been involved in the ongoing API rate limiter work.

For changeprop, redis is used for rate limiting and tracking problematic titles (failures, extreme hit rates) for ignore/backoff logic. So, more ephemeral counters.

For the changeprop-jobqueue, redis is used for job deduplication (per-job-hash and per-root-job). If you lose redis data, then some existing jobs that are duplicates of already-started jobs will no longer be seen as duplicates (determined when the offset reaches them). For any given job-hash, no more than 1 duplicate might get processed, e.g. 10 jobs with a hash matching some prior job that started after they enqueued would still be seen as 1 non-duplicate (incorrect)ly and 9 duplicates (correctly). Loss of redis data would also allow an extra round of backlink updates for some templates/modules/files to run instead of no-op. I think it's manageable. A correctly syncd migration would be tricky given how fast-moving the data is and you probably don't want to stop job processing for this. However, it could be done the dumb simple way, without caring about intervening changes, which would just let some duplicate jobs through, but less than without migration. Jobs are also supposed to be idempotent. I think data migration could be skipped if it's a huge pain.

I *think* that api-gateway/Ratelimit is just storing ephemeral counters. You'd want to check with @daniel though in case there is some token/session stuff (though we try to keep that stateless, with expiration, to avoid storage lookups. I haven't been involved in the ongoing API rate limiter work.

For the ratelimit service we don't care about loosing state. A reset of all counters is fine.

As far as I know, the ratelimit counters are the only thing the gateway(s) use Redis for, but I am not 100% sure. Best check with @Clement_Goubert
and @hnowlan .

I *think* that api-gateway/Ratelimit is just storing ephemeral counters. You'd want to check with @daniel though in case there is some token/session stuff (though we try to keep that stateless, with expiration, to avoid storage lookups. I haven't been involved in the ongoing API rate limiter work.

For the ratelimit service we don't care about loosing state. A reset of all counters is fine.

As far as I know, the ratelimit counters are the only thing the gateway(s) use Redis for, but I am not 100% sure. Best check with @Clement_Goubert
and @hnowlan .

Yep that's it.

Thank you! I will ping you. Any concerns regarding Redis 8?

I'll test that in advance. I can't see it being much of an issue, we're using very few features in Redis, but I'll test and let you know.

CAS/IDP is now tested against Redis 8 locally. It all works perfectly.

Change #1275430 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] site.pp: add role for rdb2011

https://gerrit.wikimedia.org/r/1275430

Change #1275502 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] (DNM) site.pp: add role for rdb2011

https://gerrit.wikimedia.org/r/1275502

Change #1275752 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] redis::slave: Move to firewall::service

https://gerrit.wikimedia.org/r/1275752

Change #1275757 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] redis::master: Move to firewall::service

https://gerrit.wikimedia.org/r/1275757

Change #1275752 merged by Muehlenhoff:

[operations/puppet@production] redis::slave: Move to firewall::service

https://gerrit.wikimedia.org/r/1275752

The current Redis slaves have been migrated to use firewall::service by merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275752 which was confirmed to be a NOP on existing hosts. This unblocks the use of nftables for the new Redis trixie hosts:

jmm@puppetdb1003:~$ nftables-compat-check.py rdb1012.eqiad.wmnet
Warning: This server uses defs_from_etcd, which isn't implemented for nft yet

All firewall services are compatible with nftables. The full list is:
{'full_monitoring_metrics_access_udp', 'ssh_from_cumin_masters', 'redis_slave_role', 'ssh_from_bastion', 'full_monitoring_metrics_access_tcp'}

Change #1275757 merged by Muehlenhoff:

[operations/puppet@production] redis::master: Move to firewall::service

https://gerrit.wikimedia.org/r/1275757

The current Redis masters have been migrated to use firewall::service by merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275757 which was confirmed to be a NOP on existing hosts. This unblocks the use of nftables for the new Redis trixie hosts as well:

jmm@puppetdb1003:~$ nftables-compat-check.py rdb1013.eqiad.wmnet
Warning: This server uses defs_from_etcd, which isn't implemented for nft yet

All firewall services are compatible with nftables. The full list is:
{'ssh_from_cumin_masters', 'redis_master_role', 'ssh_from_bastion', 'full_monitoring_metrics_access_tcp', 'full_monitoring_metrics_access_udp'}
Jhancock.wm closed subtask Unknown Object (Task) as Resolved.Tue, Apr 21, 4:57 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Fri, Apr 24, 12:53 PM

Change #1275502 abandoned by Effie Mouzeli:

[operations/puppet@production] site.pp: add role for rdb2011

Reason:

rebase mumbo jumbo

https://gerrit.wikimedia.org/r/1275502

Change #1277429 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] site.pp: add role for rdb2011

https://gerrit.wikimedia.org/r/1277429

I'm updating the code for redis lock manager right now (for T366938: Reduce relying on database locks ) and if you give me like a week or two, I make the system much more robust and safer towards issues.

Change #1282308 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] redis::master: Remove obsolete code only used for old ferm service

https://gerrit.wikimedia.org/r/1282308

Change #1282315 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] redis::master: Pass ports as an array, not a string

https://gerrit.wikimedia.org/r/1282315

Change #1282315 abandoned by Muehlenhoff:

[operations/puppet@production] redis::master: Pass ports as an array, not a string

Reason:

1282311 was merged instead

https://gerrit.wikimedia.org/r/1282315

Change #1282353 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] redis::master: Remove obsolete code only used for old ferm service

https://gerrit.wikimedia.org/r/1282353

Change #1282308 abandoned by Muehlenhoff:

[operations/puppet@production] redis::master: Remove obsolete code only used for old ferm service

Reason:

Replaced by 1282353

https://gerrit.wikimedia.org/r/1282308

Change #1277429 merged by Effie Mouzeli:

[operations/puppet@production] site.pp: add role for rdb2011 and rdb2012

https://gerrit.wikimedia.org/r/1277429

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1003 for host rdb2011.codfw.wmnet with OS trixie

Change #1283733 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] idp_test: switch to rdb2011

https://gerrit.wikimedia.org/r/1283733

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1003 for host rdb2012.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1003 for host rdb2011.codfw.wmnet with OS trixie completed:

  • rdb2011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605061157_jiji_582750_rdb2011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@Blake and I reimaged rdb2011 and rdb2012 today, and submitted an idp test patch. @SLyngshede-WMF please merge the patch and let us know how it went, the soonest the better :)

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1003 for host rdb2012.codfw.wmnet with OS trixie completed:

  • rdb2012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605061224_jiji_609364_rdb2012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1283733 merged by Slyngshede:

[operations/puppet@production] idp_test: switch to rdb2011

https://gerrit.wikimedia.org/r/1283733

@SLyngshede-WMF as discussed, since we accidentally wired IDP to a non primary redis host, we can migrate idp to rdb2011 directly, and wrap up the migration

Change #1285324 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] idp: migrate IDP to Redis 8

https://gerrit.wikimedia.org/r/1285324

Change #1285336 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mediawiki-common: add rdb2011 and rdb2012 IPs

https://gerrit.wikimedia.org/r/1285336

Change #1285339 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] redioscope: codfw: replace rdb2007 with rdb2011 (Redis 8)

https://gerrit.wikimedia.org/r/1285339

Change #1285340 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] api-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8)

https://gerrit.wikimedia.org/r/1285340

Change #1285341 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] changeprop-jobqueue: codfw: replace rdb2007 with rdb2011 (Redis 8)

https://gerrit.wikimedia.org/r/1285341

Change #1285342 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] changeprop: codfw: replace rdb2007 with rdb2011 (Redis 8)

https://gerrit.wikimedia.org/r/1285342

Change #1285343 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8)

https://gerrit.wikimedia.org/r/1285343

Change #1285344 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] rest-gateway: codfw: replace rdb2007 with rdb2011 (Redis 8)

https://gerrit.wikimedia.org/r/1285344