Page MenuHomePhabricator

Replace deployment-etcd-01 with a Buster host
Closed, ResolvedPublic

Description

deployment-etcd-01.deployment-prep.eqiad.wmflabs is running Jessie and needs to be replaced with a Buster machine.

Event Timeline

taavi triaged this task as Medium priority.

The current instance is running etcd 2.2.1. Buster has 3.2.26 available.

etcdctl is convinced that it's running on wikimedia.cloud and refuses to connect, even when specifying a .wmflabs address:

root@deployment-etcd-01:~# etcdctl -C https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379 ls
Error:  client: etcd cluster is unavailable or misconfigured
error #0: x509: certificate is valid for deployment-etcd-01.deployment-prep.eqiad.wmflabs, deployment-etcd-01, etcd.deployment-prep.eqiad.wmflabs, etcd.deployment-prep.eqiad.wmflabs, not deployment-etcd-01.deployment-prep.eqiad1.wikimedia.cloud

Looking at it it has conftool data:

root@deployment-etcd-01:~# curl -L https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys
{"action":"get","node":{"dir":true,"nodes":[{"key":"/conftool","dir":true,"modifiedIndex":5,"createdIndex":5},{"key":"/test","value":"1","modifiedIndex":4,"createdIndex":4}]}}
root@deployment-etcd-01:~# curl -L https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys/conftool
{"action":"get","node":{"key":"/conftool","dir":true,"nodes":[{"key":"/conftool/v1","dir":true,"modifiedIndex":5,"createdIndex":5}],"modifiedIndex":5,"createdIndex":5}}
root@deployment-etcd-01:~# curl -L https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys/conftool/v1
{"action":"get","node":{"key":"/conftool/v1","dir":true,"nodes":[{"key":"/conftool/v1/mediawiki-config","dir":true,"modifiedIndex":48,"createdIndex":48},{"key":"/conftool/v1/services","dir":true,"modifiedIndex":5,"createdIndex":5}],"modifiedIndex":5,"createdIndex":5}}
root@deployment-etcd-01:~# curl -L https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys/conftool/v1/services
{"action":"get","node":{"key":"/conftool/v1/services","dir":true,"nodes":[{"key":"/conftool/v1/services/cache_maps","dir":true,"modifiedIndex":19,"createdIndex":19},{"key":"/conftool/v1/services/scb","dir":true,"modifiedIndex":5,"createdIndex":5},{"key":"/conftool/v1/services/swift","dir":true,"modifiedIndex":23,"createdIndex":23},{"key":"/conftool/v1/services/thumbor","dir":true,"modifiedIndex":32,"createdIndex":32},{"key":"/conftool/v1/services/appserver","dir":true,"modifiedIndex":27,"createdIndex":27},{"key":"/conftool/v1/services/cache_text","dir":true,"modifiedIndex":11,"createdIndex":11},{"key":"/conftool/v1/services/dns","dir":true,"modifiedIndex":34,"createdIndex":34},{"key":"/conftool/v1/services/imagescaler","dir":true,"modifiedIndex":35,"createdIndex":35},{"key":"/conftool/v1/services/maps","dir":true,"modifiedIndex":46,"createdIndex":46},{"key":"/conftool/v1/services/parsoid","dir":true,"modifiedIndex":22,"createdIndex":22},{"key":"/conftool/v1/services/prometheus","dir":true,"modifiedIndex":38,"createdIndex":38},{"key":"/conftool/v1/services/videoscaler","dir":true,"modifiedIndex":14,"createdIndex":14},{"key":"/conftool/v1/services/cache_misc","dir":true,"modifiedIndex":6,"createdIndex":6},{"key":"/conftool/v1/services/eventbus","dir":true,"modifiedIndex":13,"createdIndex":13},{"key":"/conftool/v1/services/jobrunner","dir":true,"modifiedIndex":37,"createdIndex":37},{"key":"/conftool/v1/services/pdf","dir":true,"modifiedIndex":31,"createdIndex":31},{"key":"/conftool/v1/services/restbase","dir":true,"modifiedIndex":10,"createdIndex":10},{"key":"/conftool/v1/services/sca","dir":true,"modifiedIndex":16,"createdIndex":16},{"key":"/conftool/v1/services/testserver","dir":true,"modifiedIndex":28,"createdIndex":28},{"key":"/conftool/v1/services/api_appserver","dir":true,"modifiedIndex":47,"createdIndex":47},{"key":"/conftool/v1/services/aqs","dir":true,"modifiedIndex":36,"createdIndex":36},{"key":"/conftool/v1/services/cache_upload","dir":true,"modifiedIndex":9,"createdIndex":9},{"key":"/conftool/v1/services/elasticsearch","dir":true,"modifiedIndex":18,"createdIndex":18},{"key":"/conftool/v1/services/phabricator","dir":true,"modifiedIndex":24,"createdIndex":24}],"modifiedIndex":5,"createdIndex":5}}

Beta does not use etcd for db config. It is still used for other services?

Etcd v2 -> v3 migration looks annoying. The cluster needs to first be upgraded to 2.3, then 3.0 to 3.1 and only after that to 3.2.

Caller survey:

taavi@deployment-etcd-01:/var/log/nginx$ sudo cat etcd_access.log | cut -d ' ' -f1 | sort | uniq -c  | sort 
   1098 172.16.4.16
     13 172.16.5.46
    142 172.16.1.115
   3122 172.16.4.119
    628 172.16.4.18
    968 172.16.4.98

taavi@deployment-etcd-01:/var/log/nginx$ sudo cat etcd_access.log | cut -d ' ' -f1 | sort | uniq | sort | xargs -L 1 host
115.1.16.172.in-addr.arpa domain name pointer deployment-parsoid11.deployment-prep.eqiad1.wikimedia.cloud.
119.4.16.172.in-addr.arpa domain name pointer deployment-mediawiki-07.deployment-prep.eqiad1.wikimedia.cloud.
16.4.16.172.in-addr.arpa domain name pointer deployment-mwmaint01.deployment-prep.eqiad1.wikimedia.cloud.
18.4.16.172.in-addr.arpa domain name pointer deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud.
98.4.16.172.in-addr.arpa domain name pointer deployment-jobrunner03.deployment-prep.eqiad1.wikimedia.cloud.
46.5.16.172.in-addr.arpa domain name pointer deployment-etcd-01.deployment-prep.eqiad1.wikimedia.cloud.

Might be related: found some DNS records, _etcd._tcp.beta.wmflabs.org. and _etcd_server._tcp.beta.wmflabs.org., that are pointing to a non-existent instance that was deleted in early 2019: T218729#5140552.

Mentioned in SAL (#wikimedia-releng) [2021-03-05T13:40:20Z] <Majavah> create deployment-etcd02 and sign its puppet certificate T276462

Mentioned in SAL (#wikimedia-releng) [2021-03-05T17:50:13Z] <Majavah> switch deployment-prep hiera key etcd_host to use deployment-etcd02 ref T276462

Change 668751 had a related patch set uploaded (by Majavah; owner: Majavah):
[operations/mediawiki-config@master] betacluster: switch etcd to deployment-etcd02

https://gerrit.wikimedia.org/r/668751

progress update:

  • deployment-etcd02 is now running etcd v3 and had conftool data imported from deployment-etcd-01
  • I switched over deployment-prep global hiera key etcd_host to the new host.
  • There is a mediawiki-config patch but it hasn's not been merged or deployed yet

Change 668751 merged by jenkins-bot:
[operations/mediawiki-config@master] betacluster: switch etcd to deployment-etcd02

https://gerrit.wikimedia.org/r/668751

Mentioned in SAL (#wikimedia-releng) [2021-03-05T19:14:59Z] <Majavah> beta cluster etcd was switched from deployment-etcd-01 to deployment-etcd02 ref T276462

Mentioned in SAL (#wikimedia-releng) [2021-03-05T19:30:30Z] <Majavah> shutdown deployment-etcd-01 to see if anything breaks, will delete if nothing has broken during next week T276462

Mentioned in SAL (#wikimedia-releng) [2021-03-11T16:49:07Z] <Majavah> delete deployment-etcd-01 T276462