Soon production will be using Debian Stretch for deployment machines (T175288) we should upgrade (and probably rename and resize) deployment-tin and deployment-mira
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | thcipriani | T191921 mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) | |||
Resolved | thcipriani | T192561 Upgrade deployment-prep deployment servers to stretch |
Event Timeline
As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.
This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.
Thanks!
I created deployment-deploy1001 as a stretch box. Here are my notes:
Create new instance
Via horizon deployment-deploy1001, stretch, flavor = c8.m8.s60 (8 cores, 8 GB memory, 60GB harddisk)
Fix certs
On deployment-deploy1001
sudo rm -rf /var/lib/puppet/ssl sudo mkdir -p /var/lib/puppet/client/ssl/certs sudo puppet agent -t sudo cp /var/lib/puppet/ssl/certs/ca.pem /var/lib/puppet/client/ssl/certs
On deployment-puppetmaster02
sudo puppet cert sign deployment-deploy1001.deployment-prep.eqiad.wmflabs
Back on deployment-deploy1001
sudo puppet agent -t
Apply roles
Via Horizon:
- role::ci::slave::labs::common
- role::labs::lvm::srv
- sudo puppet agent -t
- role::beta::deploymentserver
- role::deployment_server
- sudo puppet agent -t
- role::aptly::server
Project puppet
If this becomes the new deployment master, this will need to be changed in the project puppet yaml on horizon
scap::deployment_server: deployment-deploy1001.deployment-prep.eqiad.wmflabs
Broken stuff
*sigh*
- sudo -u trebuchet -g wikidev git clone inside of scap_source/default.rb fails because /etc/sudoers has root ALL=(ALL) ALL instead of root ALL=(ALL:ALL) ALL, so root can't execute as wikidev.
- We create the file /var/lock/scap-global-lock preventing deploys, then we run scap deploy --init which fails because it checks the lock file
- iegreview has invalid yaml, so scap deploy --init fails. This is because the check command has a : in it, if we change to using > to treat it as a literal, it works fine.
- dumps/dumps had no scap directory. Had to clone git clone https://gerrit.wikimedia.org/r/p/operations/dumps/scap.git manually
- netbox/deploy scap deploy --init failed looking for /etc/dsh/group/librenms. Had to manually create with the same contents as deployment-tin: deployment-netbox.deployment-prep.eqiad.wmflabs
- There is no npm package for stretch
All but the last one should be easy(ish) puppet fixes. The last one looks like it's coming from beta::autoupdater and has something to do with parsoid...may require a new package or maybe an update to puppet.
It seems you used the same flavor for deploy1001 that tin had. This would've been a great time to switch to a different flavor (with bigger disk) and resolve T166492. This comment is probably late for tin->deploy1001, but maybe not for mira->deploy2001 (or whatever it will be called).
(Also, the hostnames in beta usually don't use the 1001/2001 convention, but just 01,02,03,... instead, as they're not distributed to different dcs anyways and it proves to be less confusing to have these names differ from the production ones, but that's just an unimportant detail).
Mentioned in SAL (#wikimedia-releng) [2018-05-30T18:32:52Z] <mutante> created instance deployment-deploy-01 with stretch and flavor x-large (T192561)
deployment-deploy1001 has been deleted by thcipriani.
deployment-deploy-01 has been created with x-large flavor for more disk space @EddieGP
applied the "role(deployment_server)" on it via instance puppet. (like in prod, no other roles yet that would differ from prod but were used here before)
Krenair fixed puppet cert issues and signed on project specific puppet master.
Puppet is running now.
Change 436433 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: Add librenms dsh file
Addressed via D1064
- dumps/dumps had no scap directory. Had to clone git clone https://gerrit.wikimedia.org/r/p/operations/dumps/scap.git manually
Fixed in Horizon
- netbox/deploy scap deploy --init failed looking for /etc/dsh/group/librenms. Had to manually create with the same contents as deployment-tin: deployment-netbox.deployment-prep.eqiad.wmflabs
https://gerrit.wikimedia.org/r/436433
Also, https://gerrit.wikimedia.org/r/#/c/361796/ ended up being necessary since scap deploy --init was failing (beyond the lockfile thing) due to the trebuchet group not existing.
Change 436433 merged by Alexandros Kosiaris:
[operations/puppet@production] Beta: Add librenms dsh file
Unfortunately this new -deploy-01 instance went into emergency mode after the security reboots yesterday. It's responding to ping but not accepting any connections e.g. to SSHd. @Andrew took a look but wasn't able to successfully mount its disk. To replace it do we just need to create a new one, get its puppet working, apply the deployment server role, and wait for puppet to run?
Yes, create a new one, apply puppet role and run it would get us back to the state it was in. That being said, it probably also needs other roles besides the production role.In an ideal world it would be identical to production, just that one role.
Does someone already working on this want to replace the instance or shall I start a deployment-deploy-02?
Alright, should be back to roughly where we were 2 weeks ago now with deployment-deploy01.deployment-prep.eqiad.wmflabs. npm package is still failing, plus dumps/dumps is missing git_repo config.
During the process I did have to modify the scap lock path in /usr/lib/python2.7/dist-packages/scap/lock.py to make it not see the scap lock problem.
While it was running I noticed P7280 which I think is due to a missing dependency in puppet on the relevant Scap::Dsh::Group, as entries like this come up only after that error and I'm pretty sure it's the files that scap was looking for:
Notice: /Stage[main]/Scap::Dsh/Scap::Dsh::Group[librenms]/File[/etc/dsh/group/librenms]/ensure: defined content as '{md5}f2465fd75677ab2cc4e5fdf539f7e3fc'
Anyway that worked on the next puppet run.
Also something missing here:
Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/File[/srv/mediawiki-staging]/ensure: created Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58 Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58 Wrapped exception: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal Error: /Stage[main]/Beta::Autoupdater/File[/srv/mediawiki-staging/docroot/wwwportal/portal-master]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58 Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/Exec[git_clone_operations/mediawiki-config]/returns: executed successfully
This was fine on the next puppet run too.
@thcipriani's https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441491/ will fix broken stuff #4 above, cherry-picked
diff --git a/modules/beta/manifests/autoupdater.pp b/modules/beta/manifests/autoupdater.pp index ad3af6bd06..5f224d87eb 100644 --- a/modules/beta/manifests/autoupdater.pp +++ b/modules/beta/manifests/autoupdater.pp @@ -7,8 +7,21 @@ class beta::autoupdater { $stage_dir = '/srv/mediawiki-staging' # Parsoid JavaScript dependencies are updated on beta via npm + apt::repository { 'node': + uri => 'https://deb.nodesource.com/node_6.x', + dist => $::lsbdistcodename, + components => 'main', + keyfile => 'puppet:///modules/beta/nodesource.gpg', # from https://deb.nodesource.com/gpgkey/nodesource.gpg.key + } + apt::pin { 'nodejs': + package => 'nodejs', + pin => 'version 6.14.3-1nodesource1', + priority => '1002', + require => Apt::Repository['node'], + } package { 'npm': - ensure => 'present', + ensure => 'present', + require => Apt::Pin['nodejs'], } file { '/usr/local/bin/wmf-beta-autoupdate.py':
Change 442229 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Add new deployment host
Change 442229 merged by Andrew Bogott:
[operations/puppet@production] deployment-prep: Add new deployment host
Are there known unresolved issues with the new host? It seems deployment-tin is still used as primary for Jenkins.
Once the migration is done, we should probably phase out the deployment-tin and deployment-mira hosts.
@hashar Can we change Jenkins config to use the new host per Krinkle's question above?
<Krenair> thcipriani, where are we with deployment-deploy01?
<thcipriani> I looked at it Friday, it looks ready to go to me.
<Krenair> cool so we just get someone with integration privileges to swap jenkins to using the new host?
<thcipriani> yep, we'll need to make it the deploy master so that it's unlocked for deployments
<thcipriani> I can add it to jenkins and re-point the labels appropriately, that's on my todo list for this week.
Change 449520 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: ensure deployment-deploy01 is a co-master
Change 449521 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: Make deployment-deploy01 main deploy server
Change 449520 merged by Dzahn:
[operations/puppet@production] Beta: ensure deployment-deploy01 is a co-master
Since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449219/ got merged, puppet has failed to compile catalog on deployment-tin and deployment-mira, and I think the only real way forward is to get rid of them soon. Luckily deploy01 is almost there?
deploy01 is currently handling all the beta-* jobs on jenkins. I tested scap3 deployment from deploy01. I've cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449521/ so deploy01 is now acting as the main deployment server.
Probably ought to make a deploy02 and then shutdown deployment-tin and deployment-mira.
Change 449521 merged by Dzahn:
[operations/puppet@production] Beta: Make deployment-deploy01 main deploy server
I think I misunderstood something about the npm thing and I don't think my patch for it worked after all.
Change 449643 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Set up deployment-deploy02 as deployment-mira stretch replacement
<Krenair> thcipriani, do we also need to do something about the apt source pointing at deployment-tin?
<thcipriani> Krenair: afaik that's only used for scap. We'll want to fix it, but it shouldn't block anything. I don't know how it was setup.
<thcipriani> used to test out the master branch of scap on deployment-prep, that is
<Krenair> well it blocks shutting deployment-tin down right? :)
<thcipriani> so that apt isn't screaming on all the other machines? yes, that does seem correct.
It looks like we probably need to copy /srv/packages across and then update role::aptly::client::servername in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep
Edit: Done
Also need to figure out what to do about hieradata/labs/deployment-prep/host/deployment-tin.yaml in puppet.git
Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.
No wait.. i'm wrong. That would affect other hosts in deployment-prep that aren't deployment-servers since this isn't role-based like in prod.
So just move the files to their new host name, staying in ./hosts. .. unfortunately.
AFAICT those heiradata files may not be needed for deploy0{1,2}. mount_nfs stuff seem to be wrapped in the function if mount_nfs_volume which doesn't seem like it will return anything truthy for the deployment-prep project (also evidenced by no nfs mounts on those hosts). The light_process_count was meant to counter spammy (but otherwise harmless) output for the scap command (see T124956: Rise in "parent, LightProcess exiting" console spam). Since that doesn't appear to be happening on deployment-deploy01 (see beta-scap-eqiad console output in CI: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/218249/console) it doesn't appear to be needed anymore.
Change 450078 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: deployment-deploy02 is deployment host
Change 450079 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: remove deployment-{tin,mira}
Change 450078 merged by Dzahn:
[operations/puppet@production] Beta: deployment-deploy02 is deployment host
Change 449643 abandoned by Alex Monk:
beta: Set up deployment-deploy02 as deployment-mira replacement
Reason:
I1d044711
Change 450079 merged by Dzahn:
[operations/puppet@production] Beta: remove deployment-{tin,mira}
Mentioned in SAL (#wikimedia-releng) [2018-08-13T23:04:58Z] <Krenair> deactivated and cleaned puppet node entries for deployment-{tin,mira} T192561
Mentioned in SAL (#wikimedia-releng) [2018-08-20T15:58:40Z] <hashar> deleting Jenkins slave deployment-tin.eqiad the instance has been replaced | T192561
During cherry-pick review today I realised that my attempt (b59add730544b922e1fb68ec344bc26027aa9e37) to get the npm package working as expected (I think for some beta-specific parsoid auto-deployment mechanism?) by adding an extra repository (https://deb.nodesource.com/node_6.x) with its key and pinning nodejs to a specific version was still cherry-picked. Is that still necessary? Can someone check whether the other stretch packages for nodejs (that are from deb.debian.org or mirrors.wikimedia.org) are sufficient?
Production deployment servers don't have a nodejs installed. I don't know why deployment-prep deployment servers do.
Thanks, i see:
# Parsoid JavaScript dependencies are updated on beta via npm package { 'npm':
So the next questions would be "why do we need parsoid on the deployment server in beta when the deployment_server role in prod doesn't mention parsoid at all" and/or "how are Parsoid Javascript dependencies updated when not in beta".
Hrm. Looks like this is from: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/93939/
Adding @hashar to see if he can provide any information.
A while ago, once a change got merged for parsoid we would trigger a Jenkins job that had to run npm install on the host. That is no more the case however, so I guess npm can be removed entirely.
(note: I have no idea how Parsoid or other mediawiki services are updated on deployment-prep).
Change 456625 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: remove npm from deployment master
Change 456625 merged by Dzahn:
[operations/puppet@production] Beta: remove npm from deployment master