Hosts releases1003 and releases2003 should be upgraded or replaced with bookworm hosts.
| 1 | failing over the backend of https://releases.wikimedia.org - the plan |
|---|---|
| 2 | |
| 3 | status quo: |
| 4 | |
| 5 | the service has 2 backends; one in eqiad and one in codfw; as of 2025-11-13 the host names are: |
| 6 | releases1003.eqiad.wmnet & releases2003.codfw.wmnet |
| 7 | |
| 8 | releases1003 is still on bullseye while releases2003 has already been upgraded to bookworm (though not trixie) |
| 9 | |
| 10 | The DNS name releases.discovery.wmnet determines which of the backends gets the traffic and it is currently an alias for releases1003.eqiad.wmnet, making eqiad the active DC. |
| 11 | |
| 12 | |
| 13 | actual goal: no backends are still on outdated bullseye. |
| 14 | |
| 15 | via steps: fail-over traffic from 1003 to 2003; verify 2003 works fine; reimage 1003 |
| 16 | |
| 17 | A-1) Stop rsync and puppet on both hosts |
| 18 | |
| 19 | A) Hiera: in hieradata/common.yaml switch the definition of "releases_server" and "releases_servers_failover" |
| 20 | |
| 21 | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204933 |
| 22 | |
| 23 | what this does: changes which of the servers is the source for rsyncing data between servers; so the server that releases should be uploaded to or be created on. |
| 24 | |
| 25 | https://puppet-compiler.wmflabs.org/output/1204933/7609/releases1003.eqiad.wmnet/index.html |
| 26 | |
| 27 | B) jenkins service: |
| 28 | |
| 29 | prepare: control the service by DC name, not hostname: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204980 |
| 30 | |
| 31 | then simply switch eqiad - codfw to mask/stop service in inactive DC and unmask/start service in active DC |
| 32 | |
| 33 | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204982 |
| 34 | |
| 35 | C) DNS: in templates/wmnet switch the releases.discovery.wmnet name to the other backend |
| 36 | |
| 37 | https://gerrit.wikimedia.org/r/c/operations/dns/+/1204684 |
| 38 | |
| 39 | what this does: changes which of the servers gets the traffic from the CDN/caching servers. The discovery name is what is used in Apache Traffic Server config which maps releases.wikimedia.org to it. (trafficserver/backend.yaml). Therefore ATS config does not have to be changed, only DNS. |
| 40 | |
| 41 | actual steps: |
| 42 | |
| 43 | - update docs, create wiki fingerprint / host pages (https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org#Documentation_updates) |
| 44 | - schedule downtime(s) |
| 45 | - disable puppet on both backends |
| 46 | - merge and deploy change A (Hiera) |
| 47 | - merge and deploy change B (jenkins service) |
| 48 | - ensure no users are uploading / are informed of maintenance / new server name |
| 49 | - re-enable puppet on both |
| 50 | -- verify rsync changes look good |
| 51 | -- verify jenkins service masked on old, unmasked on new |
| 52 | - merge and deploy change C (DNS) |
| 53 | -- verify discovery name points to new backend |
| 54 | -- tail apache logs on 2003 while making some requests to releases.wikimedia.org |
| 55 | -- run httpbb tests from deployment server for releases services |
| 56 | -- delete downtimes |
| 57 | -- announce to releasers-* admin groups members (get email addresses from admin.yaml) |
| 58 | -- end maintenance - close tickets as resolved |
| 59 | |
| 60 | - (later) |
| 61 | -- reimage 1003 with bookworm |
| 62 | |
| 63 | - (later) |
| 64 | -- reimage 1003 with trixie ...(and continue the cycle).. |