The deployment servers need to be migrated to jessie. Both mira and tin are out of warranty for a year now and the deployment servers seem like great candidates for a Ganeti instance instead of bare metal. Or does anyone insist on using dedicated hardware for those?
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | MoritzMuehlenhoff | T143536 Upgrade all mw* servers to debian jessie | |||
Resolved | MoritzMuehlenhoff | T144578 Migrate deployment servers (tin/mira) to jessie | |||
Resolved | MoritzMuehlenhoff | T144043 Make keyholder work with systemd | |||
Resolved | Andrew | T146209 OpenStack flavor for beta cluster deployment servers | |||
Resolved | hashar | T146286 mwscript on jessie mediawiki fails |
Event Timeline
Talked about it in our team meeting, and I think we're inclined towards bare metal here. Couple of reasons:
- Disk performance. Deployments require a lot of disk io: git and rsync both touch thousands of files constantly and moving to a VM makes us a little worried. If this isn't a problem in Ganeti (like it definitely would be in OpenStack), then we could be convinced otherwise. Also: since the machines are out of warranty, upgrading to something with SSDs would be a good boost for performance as well.
- Paranoia about virtualization for mission-critical services. Being able to deploy needs very high availability--downtime is a major outage and cause for paging and immediate response. There's a general fear of "what if the host machine(s) go down and we can't deploy anymore?"
But if our fears are unfounded please let me know and we can look at using Ganeti instead :)
Now, the machines like terbium that just handle jobs (crons & one off tasks) I think are much better candidates for VMs.
I don't think it would cause problems: The I/O performance of the Ganeti clusters should be adequate for deployments (but of course bare metal is still faster). Wrt availability; we run important services like the LDAP servers on Ganeti, in practice a Ganeti instance provides higher availability than a one-off bare metal system; if a Ganeti node dies, the VMs running on it are quickly available on the backup node, while a bare metal server needs to be reimaged on spare hardware, which takes at least an hour.
But let's disentangle migrating to jessie with the replacement of the hardware with either new hardware or a VM; procedure-wise I would propose we reimage mira to jessie after https://gerrit.wikimedia.org/r/#/c/308132/ is reviewed/merged and use it to tweak/test?
On the note about Ganeti vs bare metal performance for deploys: Before/If we migrate to Ganeti VMs for tin/mira I'd like a performance test of a full scap and other common actions. Just so we know what we're getting into.
All this sounds ok to me :)
But let's disentangle migrating to jessie with the replacement of the hardware with either new hardware or a VM; procedure-wise I would propose we reimage mira to jessie after https://gerrit.wikimedia.org/r/#/c/308132/ is reviewed/merged and use it to tweak/test?
Yes, lets!
I've added a new deployment server mira02 based on jessie to deployment-prep. The arming of the keyholder went fine, there's a bug in "keyholder status" which uses the upstart-specific status command, but the keyholder seems to work in general.
@demon could you or anyone else from RelEng run a few more tests on that host? If everything looks fine on your side, I'd proceed with reimaging mira in production tomorrow.
I thought we had decided that when mira.deployment-prep.eqiad.wmflabs was recreated it'd get the 'deployment-' prefix?
@mmodell @thcipriani @demon @dduvall can you check mira02 on beta is all fine ? I dont feel confident double checking that is working properly.
Also scap seems to rely on upstart, so I guess scap3 would need to be adjusted.
Mentioned in SAL (#wikimedia-releng) [2016-09-15T10:48:17Z] <hashar> beta: cherry picking moritzm patch https://gerrit.wikimedia.org/r/#/c/310793/ "Also handle systemd in keyholder script" T144578
With labs now having a sub domain named after the project, I dont think the prefix are still needed. eg: mira02.deployment-prep.eqiad.wmflabs.
Change 311108 had a related patch set uploaded (by Muehlenhoff):
deployment_server: Daemonise redis when running on systemd
A few jessie-related changes have been sorted out, mira02.deployment-prep.eqiad.wmflabs should be ready for testing.
Change 311108 merged by Muehlenhoff:
deployment_server: Daemonise redis when running on systemd
As @hashar pointed out in the team meeting, it seems that /srv is a bit too small on this instance, already only has 2.3GB of space left:
thcipriani@mira02:/srv/deployment/test/testrepo$ df -h Filesystem Size Used Avail Use% Mounted on udev 10M 0 10M 0% /dev tmpfs 792M 87M 705M 11% /run /dev/vda3 19G 5.7G 13G 32% / tmpfs 2.0G 0 2.0G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup /dev/mapper/vd-second--local--disk 21G 17G 2.3G 89% /srv
We'll likely need to rebuilt it. While doing so, I feel like the right name is, as @Krenair suggested, deployment-mira02.
From my brief checking, keyholder, scap3, and trebuchet all seem to function correctly. That is, you cannot deploy from this machine currently, but you should be able to if it were moved to the deployment master. It would be good to test that functionality before everything hits production.
I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20 GB partition and deleted the mira02 instance. Please let me know if further tests are successful and we can proceed to reimage mira in production.
I think we need to use an m1.large instance to deployment hosts. mira02 was a m1.medium as well with the role::labs::lvm::srv applied giving it the same 20GB /srv/ partition. On deployment-tin the current size of /srv/ is 18GB, that means when the https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ job is turned back on (following labs database maintenance) deployment-mira02 will be left with very little room on /srv/.
Also, since l10n updates happen on the deployment machines, they're fairly sensitive to the number of cores, 2 cores will make updates a little slower, but that's less of a big deal than running out of space from inception.
Sorry this has taken so long for me to give this the attention it needs to move forward :((
Lets get a custom flavor for the deployment servers.
8 CPUs to get faster l10n rebuild
8 GB RAM: 2G for system, 6G for cache, that is what deployment-tin has (m1.large)
60 GB disk: 20 G for system, 40 G for /srv
Filled as T146209
Mentioned in SAL (#wikimedia-releng) [2016-09-20T20:47:53Z] <hashar> Creating deployment-mira instance with flavor c8.m8.s60 (8 cpu, 8G RAM and 60G disk) T144578
Mentioned in SAL (#wikimedia-releng) [2016-09-20T20:54:29Z] <hashar> from deployment-tin for T144578, accept ssh host key of deployment-mira : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs
Change 311760 had a related patch set uploaded (by Hashar):
Beta: change deployment-mira02 to deployment-mira
I did a sprint tonight:
- Got a new flavor in openstack with larger disk T146209, huge thanks to Andrew to have created it upon request
- Spawned deployment-mira with proper puppet classes
- Repurposed Tyler change https://gerrit.wikimedia.org/r/311760 and cherry picked it
- Ran puppet, did a bunch of random sudo -u johnDoe ssh to validate fingerprints
Scap passed https://integration.wikimedia.org/ci/job/beta-scap-eqiad/120899/
I have deleted deployment-mira02 which had a disk too small. What is left to do:
- merge https://gerrit.wikimedia.org/r/311760
- clean up mira / 10.68.17.215 from puppet.git
- delete mira.deployment-prep.eqiad.wmflabs (the old Trusty host)
Then we can look at tin. That would be roughly the same, with the addition of it being a Jenkins slave :(
Change 311760 merged by Muehlenhoff:
Beta: change deployment-mira02 to deployment-mira
Change 311946 had a related patch set uploaded (by Hashar):
beta: add hiera deployment_server var from wikitech
Change 311947 had a related patch set uploaded (by Hashar):
beta: switch deploy server to deployment-mira
Mentioned in SAL (#wikimedia-releng) [2016-09-21T10:07:51Z] <hashar> Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common T144578
Change 311946 merged by Muehlenhoff:
beta: add hiera deployment_server var from wikitech
Mentioned in SAL (#wikimedia-releng) [2016-09-21T11:24:59Z] <hashar> removing Jenkins slave deployment-tin , deployment-mira is the new deployment master T144578
status for beta cluster
dpeloyment-mira is the new master running Jessie. The Jenkins jobs are running on it. There are still some details being polished such as Zend extensions missing but that is going to be fixed soonish T146286
deployment-tin is Trusty, going to be reimaged.
My huge thanks to @elukey / @MoritzMuehlenhoff who aced it and @thcipriani for all the support with scap/keyholder.
mira is now running jessie. Please give it some more testing, for migrating tin, we could mira temporarily make the primary deployment server, setup tin in the mean time and then switch back?
The next deployment window is the week after next due to the ops offsite. I guess we could test by syncing changes to some README file? Or let it take over running l10nupdate?
Change 312654 had a related patch set uploaded (by Hashar):
beta: drop deployment-tin add deployment-tin02
Change 315205 had a related patch set uploaded (by Hashar):
Switch primary deployment server from tin to mira
Change 315205 merged by Muehlenhoff:
Switch primary deployment server from tin to mira
mira is now the primary deployment server. tin will be reimaged to jessie on the 18th. After that we can switch back the primary deployment server to tin.
Change 315709 had a related patch set uploaded (by Mobrovac):
Deployment: use mira instead of tin
Change 315710 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira
Change 315721 had a related patch set uploaded (by Mobrovac):
Deployment: Switch to mira
Change 315725 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira
Change 315726 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira
Change 315727 had a related patch set uploaded (by Mobrovac):
Deployment: Switch from tin to mira
status
Both tin.eqiad.wmnet and mira.codfw.wmnet have been reimaged to Jessie which essentially solve this case. Kudos to @MoritzMuehlenhoff @elukey and @demon :]
What is left to do is to switch the primary master to be tin.eqiad.wmnet since the other way around tends to confuse multiple people.