Page MenuHomePhabricator

Toolforge: move docker nodes from eqiad to eqiad1
Closed, ResolvedPublic

Description

We need to move and/or rebuild docker-related servers from eqiad to eqiad1.

Affected hosts:

  • tools-docker-builder-05 (rebuild in eqiad1 to try puppet code)
  • tools-docker-registry-01 (rebuild using new puppet code)
  • tools-docker-registry-02 (rebuild using new puppet code)

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-01-11T10:46:27Z] <arturo> T213418 migrating tools-docker-registry-02 from eqiad to eqiad1

Mentioned in SAL (#wikimedia-cloud) [2019-01-11T10:51:44Z] <arturo> T213418 created tools-docker-builder-06 in eqiad1

Change 483731 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refactor docker builder profile

https://gerrit.wikimedia.org/r/483731

Change 483731 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refactor docker builder profile

https://gerrit.wikimedia.org/r/483731

Mentioned in SAL (#wikimedia-cloud) [2019-01-11T11:55:14Z] <arturo> T213418 shutdown tools-docker-builder-05, will give a grace period before deleting the VM

Change 483739 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: clush: update references to tools-docker-builder

https://gerrit.wikimedia.org/r/483739

Change 483739 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: clush: update references to tools-docker-builder

https://gerrit.wikimedia.org/r/483739

Change 483763 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: docker builder: missing infrastructure profile

https://gerrit.wikimedia.org/r/483763

Change 483763 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: docker builder: missing infrastructure profile

https://gerrit.wikimedia.org/r/483763

Change 483765 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refactor docker registry profile

https://gerrit.wikimedia.org/r/483765

The docker-registry-01 image has about 50GB storage in use (the registry itself). This means the migration script will take a lot of time to move the VM from main to eqiad1.
Currently the registry data isn't synced between the registry servers (??) so there is no way to do this movement without causing downtime to the registry.

I could:

  • just ignore the downtime and simply move docker-registry-01 to eqiad1
  • manually sync the registry data from docker-registry-01 to docker-registry-02, and then do floating IP failover, and then move docker-registry-01
  • failover the floating IP, rebuild all docker containers in the registry and push them to the new registry
  • use this chance to introduce some syncing mechanism (rsync?), wait for the automatic sync to be completed and then move docker-registry-01 after the floating IP failover.

I believe the registry is a key piece in Toolforge, worth having a more robust HA deployment to avoid rebuilding all the registry in case of a disaster.

Mentioned in SAL (#wikimedia-cloud) [2019-01-14T16:44:54Z] <arturo> T213418 docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02

Apparently, the docker-registry is now running by means of tools-docker-registry-02 which lives in eqiad1. I couldn't reuse the floating IP because different ranges from old/new region, but I simply updated the DNS record for docker-registry.tools.wmflabs.org.

I can query the catalog from my home in the new server:

curl https://docker-registry.tools.wmflabs.org/v2/_catalog
{"repositories":["grrrit","jessie-toollabs","jessie-wikimedia","jupyterhub-hub","jupyterhub-proxy","jupyterkube","nagf","pause","paws-cull","paws-hub","paws-hub-hail-mary","paws-mysql-proxy","paws-proxy","paws-public-nginx","paws-public-renderer","paws-query-killer","paws-singleuser-sample","pawshub","pawsuser","tiller","toollabs-base","toollabs-golang-base","toollabs-golang-web","toollabs-interactive","toollabs-java-base","toollabs-java-web","toollabs-jdk8-base","toollabs-jdk8-web","toollabs-nodejs-base","toollabs-nodejs-web","toollabs-php-base","toollabs-php-web","toollabs-php72-base","toollabs-php72-web","toollabs-python-base","toollabs-python-web","toollabs-python2-base","toollabs-python2-web","toollabs-ruby-base","toollabs-ruby-web","toollabs-static-web","toollabs-stretch","toollabs-tcl-base","toollabs-tcl-web","wikimedia-jessie","wikimedia-trusty"]}

On the server:

Jan 14 16:59:15 tools-docker-registry-02 docker-registry[17710]: time="2019-01-14T16:59:15Z" level=info msg="response completed" http.request.host=docker-registry.tools.wmflabs.org http.request.id=e4eaf33b-9d55-4140-a929-6854eda58eb0 http.request.method=GET http.request.remoteaddr=x.x.x.x http.request.uri="/v2/_catalog" http.request.useragent="curl/7.62.0" http.response.contenttype="application/json; charset=utf-8" http.response.duration=10.569601ms http.response.status=200 http.response.written=868 instance.id=335bf41d-8012-4843-8a5b-1aef8e6a8601 version="v2.1.1+debian"
Jan 14 16:59:15 tools-docker-registry-02 docker-registry[17710]: 127.0.0.1 - - [14/Jan/2019:16:59:15 +0000] "GET /v2/_catalog HTTP/1.1" 200 868 "" "curl/7.62.0"

Mentioned in SAL (#wikimedia-cloud) [2019-01-15T14:21:42Z] <arturo> T213418 put a backup of the docker registry in NFS just in case: aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/

I've decided I will create 2 brand new servers: tools-docker-registry-03 and tools-docker-registry-04 using the new puppet code and including the code for T213695.

Registry data has been copied into tools-docker-registry-03/04. I will do some more tests and I think I'm ready to migrate the registry to the new machines.

Mentioned in SAL (#wikimedia-cloud) [2019-01-16T14:23:59Z] <arturo> T213418 allocate floating IPs for tools-docker-registry-03 & 04

Mentioned in SAL (#wikimedia-cloud) [2019-01-16T14:34:52Z] <arturo> T213418 point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)

Mentioned in SAL (#wikimedia-cloud) [2019-01-16T16:38:18Z] <arturo> T213418 shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so

Change 483765 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refactor docker registry profile

https://gerrit.wikimedia.org/r/483765

Generated some docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Docker-registry

Will close task once a grace period (one week or so) expires for the old VMS which are in shutdown state.

Mentioned in SAL (#wikimedia-cloud) [2019-01-24T09:45:35Z] <arturo> T213418 delete tools-docker-builder-05 and tools-docker-registry-01

Mentioned in SAL (#wikimedia-cloud) [2019-01-24T09:46:15Z] <arturo> T213418 delete tools-docker-registry-02