Redis recommends upgrading one major version at a time, which would mean upgrading the servers twice;
- Debian Bookworm and Redis 7.0
- Debian Trixie and Redis 8.0
However, given that we have active hardware refreshes in both datacenters T418918: rdb101[56] implementation tracking T418924: rdb201[34] implementation tracking, we have an opportunity to skip Bookworm entirely and move directly to Trixie, provided that there are no concerns from teams currently using redis_misc.
If possible we should take the chance to T421711: ServiceOps: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets during the OS upgrade reimage.
Services using redis_misc
Pair 1
| Port | DB | Usage | PoC/tag |
|---|---|---|---|
| 6378 | 2, 3 | Netbox tasks (db 2) and Netbox caching (db 3) | netbox Infrastructure-Foundations |
| 6379 | 0 | changeprop / cpjobqueue / api-gateway | MW-Interfaces-Team |
| 6380 | 0 | Ratelimit | MW-Interfaces-Team |
| 6381 | 0 | filebackend.php (redisLockManager) | MediaWiki-Platform-Team |
| 6382 | 0 | filebackend.php (redisLockManager) | |
Pair 2
| Port | DB | Usage | PoC/tag |
|---|---|---|---|
| 6378 | 0 | IDP (CAS-SSO) Production | Infrastructure-Foundations |
| 6378 | 1 | IDP (CAS-SSO) Test | |
| 6379 | 0 | changeprop / cpjobqueue / api-gateway | |
| 6380 | 0 | Ratelimit | |
| 6381 | 0 | filebackend.php (redisLockManager) | |
| 6382 | 0 | docker-registry | ServiceOps new |
How?
The new hosts will be reimaged directly to Trixie (Redis 8.0) and services migrated one by one (or more). If any issues come up, we can simply revert the service back to the previous servers (point it back at the old rdb hosts).
Open Questions to owners before proceeding
- Shall we migrate the data?
- We need to decide whether to migrate existing Redis data to the new hosts or start fresh. Given that Redis is generally considered to be ephemeral storage, not migrating the data should be an acceptable risk.
- How does data persistence affect your service?
- Application behavior under server unavailability
- It is currently unknown how each of the services above behaves if their Redis storage becomes unavailable. Maybe this could be a good opportunity to test this in a controlled manner, since rolling back to existing hosts would be easy to do.
- Can your service tolerate a brief Redis unavailability?
Dashboard improvements
- improve grafana dashboards: https://grafana-rw.wikimedia.org/d/000000174/redis
- Role-aware filtering via $role
- cache hit ratio, memory total, connected clients
- new panels:
- connected clients & replicas
- replication lag
- cache hits vs misses
- reorganise layout
Wikitech Updates
- Update the wikitech article accordingly