Page MenuHomePhabricator

Migrate deployment servers (tin/mira) to jessie
Closed, ResolvedPublic

Description

The deployment servers need to be migrated to jessie. Both mira and tin are out of warranty for a year now and the deployment servers seem like great candidates for a Ganeti instance instead of bare metal. Or does anyone insist on using dedicated hardware for those?

Event Timeline

Talked about it in our team meeting, and I think we're inclined towards bare metal here. Couple of reasons:

  1. Disk performance. Deployments require a lot of disk io: git and rsync both touch thousands of files constantly and moving to a VM makes us a little worried. If this isn't a problem in Ganeti (like it definitely would be in OpenStack), then we could be convinced otherwise. Also: since the machines are out of warranty, upgrading to something with SSDs would be a good boost for performance as well.
  2. Paranoia about virtualization for mission-critical services. Being able to deploy needs very high availability--downtime is a major outage and cause for paging and immediate response. There's a general fear of "what if the host machine(s) go down and we can't deploy anymore?"

But if our fears are unfounded please let me know and we can look at using Ganeti instead :)

Now, the machines like terbium that just handle jobs (crons & one off tasks) I think are much better candidates for VMs.

I don't think it would cause problems: The I/O performance of the Ganeti clusters should be adequate for deployments (but of course bare metal is still faster). Wrt availability; we run important services like the LDAP servers on Ganeti, in practice a Ganeti instance provides higher availability than a one-off bare metal system; if a Ganeti node dies, the VMs running on it are quickly available on the backup node, while a bare metal server needs to be reimaged on spare hardware, which takes at least an hour.

But let's disentangle migrating to jessie with the replacement of the hardware with either new hardware or a VM; procedure-wise I would propose we reimage mira to jessie after https://gerrit.wikimedia.org/r/#/c/308132/ is reviewed/merged and use it to tweak/test?

On the note about Ganeti vs bare metal performance for deploys: Before/If we migrate to Ganeti VMs for tin/mira I'd like a performance test of a full scap and other common actions. Just so we know what we're getting into.

I don't think it would cause problems: The I/O performance of the Ganeti clusters should be adequate for deployments (but of course bare metal is still faster). Wrt availability; we run important services like the LDAP servers on Ganeti, in practice a Ganeti instance provides higher availability than a one-off bare metal system; if a Ganeti node dies, the VMs running on it are quickly available on the backup node, while a bare metal server needs to be reimaged on spare hardware, which takes at least an hour.

All this sounds ok to me :)

But let's disentangle migrating to jessie with the replacement of the hardware with either new hardware or a VM; procedure-wise I would propose we reimage mira to jessie after https://gerrit.wikimedia.org/r/#/c/308132/ is reviewed/merged and use it to tweak/test?

Yes, lets!

I've added a new deployment server mira02 based on jessie to deployment-prep. The arming of the keyholder went fine, there's a bug in "keyholder status" which uses the upstart-specific status command, but the keyholder seems to work in general.

@demon could you or anyone else from RelEng run a few more tests on that host? If everything looks fine on your side, I'd proceed with reimaging mira in production tomorrow.

I've added a new deployment server mira02

I thought we had decided that when mira.deployment-prep.eqiad.wmflabs was recreated it'd get the 'deployment-' prefix?

hashar added subscribers: dduvall, mmodell.

@mmodell @thcipriani @demon @dduvall can you check mira02 on beta is all fine ? I dont feel confident double checking that is working properly.

Also scap seems to rely on upstart, so I guess scap3 would need to be adjusted.

I've added a new deployment server mira02 based on jessie to deployment-prep. The arming of the keyholder went fine, there's a bug in "keyholder status" which uses the upstart-specific status command, but the keyholder seems to work in general.

@demon could you or anyone else from RelEng run a few more tests on that host? If everything looks fine on your side, I'd proceed with reimaging mira in production tomorrow.

Mentioned in SAL (#wikimedia-releng) [2016-09-15T10:48:17Z] <hashar> beta: cherry picking moritzm patch https://gerrit.wikimedia.org/r/#/c/310793/ "Also handle systemd in keyholder script" T144578

I've added a new deployment server mira02

I thought we had decided that when mira.deployment-prep.eqiad.wmflabs was recreated it'd get the 'deployment-' prefix?

With labs now having a sub domain named after the project, I dont think the prefix are still needed. eg: mira02.deployment-prep.eqiad.wmflabs.

I also built trebuchet-trigger for jessie and uploaded it to apt.wikimedia.org

Change 311108 had a related patch set uploaded (by Muehlenhoff):
deployment_server: Daemonise redis when running on systemd

https://gerrit.wikimedia.org/r/311108

A few jessie-related changes have been sorted out, mira02.deployment-prep.eqiad.wmflabs should be ready for testing.

Change 311108 merged by Muehlenhoff:
deployment_server: Daemonise redis when running on systemd

https://gerrit.wikimedia.org/r/311108

@mmodell @thcipriani @demon @dduvall can you check mira02 on beta is all fine ? I dont feel confident double checking that is working properly.

Also scap seems to rely on upstart, so I guess scap3 would need to be adjusted.

As @hashar pointed out in the team meeting, it seems that /srv is a bit too small on this instance, already only has 2.3GB of space left:

thcipriani@mira02:/srv/deployment/test/testrepo$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                 10M     0   10M   0% /dev
tmpfs                               792M   87M  705M  11% /run
/dev/vda3                            19G  5.7G   13G  32% /
tmpfs                               2.0G     0  2.0G   0% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk   21G   17G  2.3G  89% /srv

We'll likely need to rebuilt it. While doing so, I feel like the right name is, as @Krenair suggested, deployment-mira02.

From my brief checking, keyholder, scap3, and trebuchet all seem to function correctly. That is, you cannot deploy from this machine currently, but you should be able to if it were moved to the deployment master. It would be good to test that functionality before everything hits production.

I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20 GB partition and deleted the mira02 instance. Please let me know if further tests are successful and we can proceed to reimage mira in production.

I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20 GB partition and deleted the mira02 instance.

I think we need to use an m1.large instance to deployment hosts. mira02 was a m1.medium as well with the role::labs::lvm::srv applied giving it the same 20GB /srv/ partition. On deployment-tin the current size of /srv/ is 18GB, that means when the https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ job is turned back on (following labs database maintenance) deployment-mira02 will be left with very little room on /srv/.

Also, since l10n updates happen on the deployment machines, they're fairly sensitive to the number of cores, 2 cores will make updates a little slower, but that's less of a big deal than running out of space from inception.

Sorry this has taken so long for me to give this the attention it needs to move forward :((

Lets get a custom flavor for the deployment servers.

8 CPUs to get faster l10n rebuild
8 GB RAM: 2G for system, 6G for cache, that is what deployment-tin has (m1.large)
60 GB disk: 20 G for system, 40 G for /srv

Filled as T146209

Mentioned in SAL (#wikimedia-releng) [2016-09-20T20:47:53Z] <hashar> Creating deployment-mira instance with flavor c8.m8.s60 (8 cpu, 8G RAM and 60G disk) T144578

Mentioned in SAL (#wikimedia-releng) [2016-09-20T20:54:29Z] <hashar> from deployment-tin for T144578, accept ssh host key of deployment-mira : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs

Change 311760 had a related patch set uploaded (by Hashar):
Beta: change deployment-mira02 to deployment-mira

https://gerrit.wikimedia.org/r/311760

I did a sprint tonight:

  • Got a new flavor in openstack with larger disk T146209, huge thanks to Andrew to have created it upon request
  • Spawned deployment-mira with proper puppet classes
  • Repurposed Tyler change https://gerrit.wikimedia.org/r/311760 and cherry picked it
  • Ran puppet, did a bunch of random sudo -u johnDoe ssh to validate fingerprints

Scap passed https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120899/

I have deleted deployment-mira02 which had a disk too small. What is left to do:

Then we can look at tin. That would be roughly the same, with the addition of it being a Jenkins slave :(

Change 311760 merged by Muehlenhoff:
Beta: change deployment-mira02 to deployment-mira

https://gerrit.wikimedia.org/r/311760

Change 311946 had a related patch set uploaded (by Hashar):
beta: add hiera deployment_server var from wikitech

https://gerrit.wikimedia.org/r/311946

Change 311947 had a related patch set uploaded (by Hashar):
beta: switch deploy server to deployment-mira

https://gerrit.wikimedia.org/r/311947

Mentioned in SAL (#wikimedia-releng) [2016-09-21T10:07:51Z] <hashar> Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common T144578

Change 311946 merged by Muehlenhoff:
beta: add hiera deployment_server var from wikitech

https://gerrit.wikimedia.org/r/311946

Mentioned in SAL (#wikimedia-releng) [2016-09-21T11:24:59Z] <hashar> removing Jenkins slave deployment-tin , deployment-mira is the new deployment master T144578

status for beta cluster

dpeloyment-mira is the new master running Jessie. The Jenkins jobs are running on it. There are still some details being polished such as Zend extensions missing but that is going to be fixed soonish T146286

deployment-tin is Trusty, going to be reimaged.

My huge thanks to @elukey / @MoritzMuehlenhoff who aced it and @thcipriani for all the support with scap/keyholder.

Change 311947 merged by Muehlenhoff:
beta: switch deploy server to deployment-mira

https://gerrit.wikimedia.org/r/311947

mira is now running jessie. Please give it some more testing, for migrating tin, we could mira temporarily make the primary deployment server, setup tin in the mean time and then switch back?

mira is now running jessie. Please give it some more testing

The next deployment window is the week after next due to the ops offsite. I guess we could test by syncing changes to some README file? Or let it take over running l10nupdate?

Change 312654 had a related patch set uploaded (by Hashar):
beta: drop deployment-tin add deployment-tin02

https://gerrit.wikimedia.org/r/312654

Change 312654 merged by Elukey:
beta: update deployment-tin IP and make it master

https://gerrit.wikimedia.org/r/312654

Change 315205 had a related patch set uploaded (by Hashar):
Switch primary deployment server from tin to mira

https://gerrit.wikimedia.org/r/315205

Change 315205 merged by Muehlenhoff:
Switch primary deployment server from tin to mira

https://gerrit.wikimedia.org/r/315205

mira is now the primary deployment server. tin will be reimaged to jessie on the 18th. After that we can switch back the primary deployment server to tin.

Change 315709 had a related patch set uploaded (by Mobrovac):
Deployment: use mira instead of tin

https://gerrit.wikimedia.org/r/315709

Change 315710 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315710

Change 315721 had a related patch set uploaded (by Mobrovac):
Deployment: Switch to mira

https://gerrit.wikimedia.org/r/315721

Change 315725 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315725

Change 315726 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315726

Change 315727 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315727

Change 315710 merged by Mobrovac:
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315710

Change 315726 merged by Mobrovac:
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315726

Change 315721 merged by Mobrovac:
Deployment: Switch to mira

https://gerrit.wikimedia.org/r/315721

Change 315727 merged by Mobrovac:
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315727

Change 315725 merged by jenkins-bot:
Deployment: Switch from tin to mira

https://gerrit.wikimedia.org/r/315725

status

Both tin.eqiad.wmnet and mira.codfw.wmnet have been reimaged to Jessie which essentially solve this case. Kudos to @MoritzMuehlenhoff @elukey and @demon :]

What is left to do is to switch the primary master to be tin.eqiad.wmnet since the other way around tends to confuse multiple people.