Page MenuHomePhabricator

Upgrade deployment-prep deployment servers to stretch
Closed, ResolvedPublic

Description

Soon production will be using Debian Stretch for deployment machines (T175288) we should upgrade (and probably rename and resize) deployment-tin and deployment-mira

Event Timeline

This would also give us a place to test various mwscripts used by scap with php7

RobH triaged this task as Medium priority.May 3 2018, 4:38 PM
RobH subscribed.

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.

Thanks!

I created deployment-deploy1001 as a stretch box. Here are my notes:

Create new instance

Via horizon deployment-deploy1001, stretch, flavor = c8.m8.s60 (8 cores, 8 GB memory, 60GB harddisk)

Fix certs

On deployment-deploy1001

sudo rm -rf /var/lib/puppet/ssl
sudo mkdir -p /var/lib/puppet/client/ssl/certs
sudo puppet agent -t
sudo cp /var/lib/puppet/ssl/certs/ca.pem /var/lib/puppet/client/ssl/certs

On deployment-puppetmaster02

sudo puppet cert sign deployment-deploy1001.deployment-prep.eqiad.wmflabs

Back on deployment-deploy1001

sudo puppet agent -t

Apply roles

Via Horizon:

  1. role::ci::slave::labs::common
  2. role::labs::lvm::srv
  • sudo puppet agent -t
  1. role::beta::deploymentserver
  2. role::deployment_server
  • sudo puppet agent -t
  1. role::aptly::server

Project puppet

If this becomes the new deployment master, this will need to be changed in the project puppet yaml on horizon

scap::deployment_server: deployment-deploy1001.deployment-prep.eqiad.wmflabs

Broken stuff

*sigh*

  1. sudo -u trebuchet -g wikidev git clone inside of scap_source/default.rb fails because /etc/sudoers has root ALL=(ALL) ALL instead of root ALL=(ALL:ALL) ALL, so root can't execute as wikidev.
  2. We create the file /var/lock/scap-global-lock preventing deploys, then we run scap deploy --init which fails because it checks the lock file
  3. iegreview has invalid yaml, so scap deploy --init fails. This is because the check command has a : in it, if we change to using > to treat it as a literal, it works fine.
  4. dumps/dumps had no scap directory. Had to clone git clone https://gerrit.wikimedia.org/r/p/operations/dumps/scap.git manually
  5. netbox/deploy scap deploy --init failed looking for /etc/dsh/group/librenms. Had to manually create with the same contents as deployment-tin: deployment-netbox.deployment-prep.eqiad.wmflabs
  6. There is no npm package for stretch

All but the last one should be easy(ish) puppet fixes. The last one looks like it's coming from beta::autoupdater and has something to do with parsoid...may require a new package or maybe an update to puppet.

It seems you used the same flavor for deploy1001 that tin had. This would've been a great time to switch to a different flavor (with bigger disk) and resolve T166492. This comment is probably late for tin->deploy1001, but maybe not for mira->deploy2001 (or whatever it will be called).

(Also, the hostnames in beta usually don't use the 1001/2001 convention, but just 01,02,03,... instead, as they're not distributed to different dcs anyways and it proves to be less confusing to have these names differ from the production ones, but that's just an unimportant detail).

Mentioned in SAL (#wikimedia-releng) [2018-05-30T18:32:52Z] <mutante> created instance deployment-deploy-01 with stretch and flavor x-large (T192561)

deployment-deploy1001 has been deleted by thcipriani.

deployment-deploy-01 has been created with x-large flavor for more disk space @EddieGP

applied the "role(deployment_server)" on it via instance puppet. (like in prod, no other roles yet that would differ from prod but were used here before)

Krenair fixed puppet cert issues and signed on project specific puppet master.

Puppet is running now.

Woohoo! Thanks Daniel and Tyler! :)

Change 436433 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: Add librenms dsh file

https://gerrit.wikimedia.org/r/436433

Broken stuff

  1. iegreview has invalid yaml, so scap deploy --init fails. This is because the check command has a : in it, if we change to using > to treat it as a literal, it works fine.

Addressed via D1064

  1. dumps/dumps had no scap directory. Had to clone git clone https://gerrit.wikimedia.org/r/p/operations/dumps/scap.git manually

Fixed in Horizon

  1. netbox/deploy scap deploy --init failed looking for /etc/dsh/group/librenms. Had to manually create with the same contents as deployment-tin: deployment-netbox.deployment-prep.eqiad.wmflabs

https://gerrit.wikimedia.org/r/436433

Also, https://gerrit.wikimedia.org/r/#/c/361796/ ended up being necessary since scap deploy --init was failing (beyond the lockfile thing) due to the trebuchet group not existing.

Change 436433 merged by Alexandros Kosiaris:
[operations/puppet@production] Beta: Add librenms dsh file

https://gerrit.wikimedia.org/r/436433

Unfortunately this new -deploy-01 instance went into emergency mode after the security reboots yesterday. It's responding to ping but not accepting any connections e.g. to SSHd. @Andrew took a look but wasn't able to successfully mount its disk. To replace it do we just need to create a new one, get its puppet working, apply the deployment server role, and wait for puppet to run?

Yes, create a new one, apply puppet role and run it would get us back to the state it was in. That being said, it probably also needs other roles besides the production role.In an ideal world it would be identical to production, just that one role.

Does someone already working on this want to replace the instance or shall I start a deployment-deploy-02?

Alright, should be back to roughly where we were 2 weeks ago now with deployment-deploy01.deployment-prep.eqiad.wmflabs. npm package is still failing, plus dumps/dumps is missing git_repo config.

During the process I did have to modify the scap lock path in /usr/lib/python2.7/dist-packages/scap/lock.py to make it not see the scap lock problem.

While it was running I noticed P7280 which I think is due to a missing dependency in puppet on the relevant Scap::Dsh::Group, as entries like this come up only after that error and I'm pretty sure it's the files that scap was looking for:
Notice: /Stage[main]/Scap::Dsh/Scap::Dsh::Group[librenms]/File[/etc/dsh/group/librenms]/ensure: defined content as '{md5}f2465fd75677ab2cc4e5fdf539f7e3fc'
Anyway that worked on the next puppet run.

Also something missing here:

Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/File[/srv/mediawiki-staging]/ensure: created
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58
Wrapped exception:
No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal
Error: /Stage[main]/Beta::Autoupdater/File[/srv/mediawiki-staging/docroot/wwwportal/portal-master]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58
Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/Exec[git_clone_operations/mediawiki-config]/returns: executed successfully

This was fine on the next puppet run too.

crappy cherry-picked hack to try to get npm installed and puppet happy
diff --git a/modules/beta/manifests/autoupdater.pp b/modules/beta/manifests/autoupdater.pp
index ad3af6bd06..5f224d87eb 100644
--- a/modules/beta/manifests/autoupdater.pp
+++ b/modules/beta/manifests/autoupdater.pp
@@ -7,8 +7,21 @@ class beta::autoupdater {
     $stage_dir = '/srv/mediawiki-staging'
 
     # Parsoid JavaScript dependencies are updated on beta via npm
+    apt::repository { 'node':
+        uri        => 'https://deb.nodesource.com/node_6.x',
+        dist       => $::lsbdistcodename,
+        components => 'main',
+        keyfile    => 'puppet:///modules/beta/nodesource.gpg', # from https://deb.nodesource.com/gpgkey/nodesource.gpg.key
+    }
+    apt::pin { 'nodejs':
+        package  => 'nodejs',
+        pin      => 'version 6.14.3-1nodesource1',
+        priority => '1002',
+        require  => Apt::Repository['node'],
+    }
     package { 'npm':
-        ensure => 'present',
+        ensure  => 'present',
+        require => Apt::Pin['nodejs'],
     }
 
     file { '/usr/local/bin/wmf-beta-autoupdate.py':

I've armed keyholder on the new host. So what next @thcipriani?

Change 442229 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Add new deployment host

https://gerrit.wikimedia.org/r/442229

Change 442229 merged by Andrew Bogott:
[operations/puppet@production] deployment-prep: Add new deployment host

https://gerrit.wikimedia.org/r/442229

Are there known unresolved issues with the new host? It seems deployment-tin is still used as primary for Jenkins.

Once the migration is done, we should probably phase out the deployment-tin and deployment-mira hosts.

Dzahn added a subscriber: hashar.

@hashar Can we change Jenkins config to use the new host per Krinkle's question above?

<Krenair> thcipriani, where are we with deployment-deploy01?
<thcipriani> I looked at it Friday, it looks ready to go to me.
<Krenair> cool so we just get someone with integration privileges to swap jenkins to using the new host?
<thcipriani> yep, we'll need to make it the deploy master so that it's unlocked for deployments
<thcipriani> I can add it to jenkins and re-point the labels appropriately, that's on my todo list for this week.

Change 449520 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: ensure deployment-deploy01 is a co-master

https://gerrit.wikimedia.org/r/449520

Change 449521 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: Make deployment-deploy01 main deploy server

https://gerrit.wikimedia.org/r/449521

Change 449520 merged by Dzahn:
[operations/puppet@production] Beta: ensure deployment-deploy01 is a co-master

https://gerrit.wikimedia.org/r/449520

Since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449219/ got merged, puppet has failed to compile catalog on deployment-tin and deployment-mira, and I think the only real way forward is to get rid of them soon. Luckily deploy01 is almost there?

Luckily deploy01 is almost there?

deploy01 is currently handling all the beta-* jobs on jenkins. I tested scap3 deployment from deploy01. I've cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449521/ so deploy01 is now acting as the main deployment server.

Probably ought to make a deploy02 and then shutdown deployment-tin and deployment-mira.

Change 449521 merged by Dzahn:
[operations/puppet@production] Beta: Make deployment-deploy01 main deploy server

https://gerrit.wikimedia.org/r/449521

Probably ought to make a deploy02 and then shutdown deployment-tin and deployment-mira.

doing

I think I misunderstood something about the npm thing and I don't think my patch for it worked after all.

Change 449643 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Set up deployment-deploy02 as deployment-mira stretch replacement

https://gerrit.wikimedia.org/r/449643

<Krenair> thcipriani, do we also need to do something about the apt source pointing at deployment-tin?
<thcipriani> Krenair: afaik that's only used for scap. We'll want to fix it, but it shouldn't block anything. I don't know how it was setup.
<thcipriani> used to test out the master branch of scap on deployment-prep, that is
<Krenair> well it blocks shutting deployment-tin down right? :)
<thcipriani> so that apt isn't screaming on all the other machines? yes, that does seem correct.

It looks like we probably need to copy /srv/packages across and then update role::aptly::client::servername in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep

Edit: Done

Also need to figure out what to do about hieradata/labs/deployment-prep/host/deployment-tin.yaml in puppet.git

Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.

It looks like we probably need to copy /srv/packages across

In production we use rsync::quickdatacopy for that. It can be automated.

Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.

No wait.. i'm wrong. That would affect other hosts in deployment-prep that aren't deployment-servers since this isn't role-based like in prod.

So just move the files to their new host name, staying in ./hosts. .. unfortunately.

Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.

No wait.. i'm wrong. That would affect other hosts in deployment-prep that aren't deployment-servers since this isn't role-based like in prod.

So just move the files to their new host name, staying in ./hosts. .. unfortunately.

AFAICT those heiradata files may not be needed for deploy0{1,2}. mount_nfs stuff seem to be wrapped in the function if mount_nfs_volume which doesn't seem like it will return anything truthy for the deployment-prep project (also evidenced by no nfs mounts on those hosts). The light_process_count was meant to counter spammy (but otherwise harmless) output for the scap command (see T124956: Rise in "parent, LightProcess exiting" console spam). Since that doesn't appear to be happening on deployment-deploy01 (see beta-scap-eqiad console output in CI: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/218249/console) it doesn't appear to be needed anymore.

Change 450078 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: deployment-deploy02 is deployment host

https://gerrit.wikimedia.org/r/450078

Change 450079 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: remove deployment-{tin,mira}

https://gerrit.wikimedia.org/r/450079

Change 450078 merged by Dzahn:
[operations/puppet@production] Beta: deployment-deploy02 is deployment host

https://gerrit.wikimedia.org/r/450078

Change 449643 abandoned by Alex Monk:
beta: Set up deployment-deploy02 as deployment-mira replacement

Reason:
I1d044711

https://gerrit.wikimedia.org/r/449643

Change 450079 merged by Dzahn:
[operations/puppet@production] Beta: remove deployment-{tin,mira}

https://gerrit.wikimedia.org/r/450079

thcipriani claimed this task.

@thcipriani resolved?

yep!

Mentioned in SAL (#wikimedia-releng) [2018-08-13T23:04:58Z] <Krenair> deactivated and cleaned puppet node entries for deployment-{tin,mira} T192561

Mentioned in SAL (#wikimedia-releng) [2018-08-20T15:58:40Z] <hashar> deleting Jenkins slave deployment-tin.eqiad the instance has been replaced | T192561

During cherry-pick review today I realised that my attempt (b59add730544b922e1fb68ec344bc26027aa9e37) to get the npm package working as expected (I think for some beta-specific parsoid auto-deployment mechanism?) by adding an extra repository (https://deb.nodesource.com/node_6.x) with its key and pinning nodejs to a specific version was still cherry-picked. Is that still necessary? Can someone check whether the other stretch packages for nodejs (that are from deb.debian.org or mirrors.wikimedia.org) are sufficient?

Production deployment servers don't have a nodejs installed. I don't know why deployment-prep deployment servers do.

Production deployment servers don't have a nodejs installed. I don't know why deployment-prep deployment servers do.

looks like it's coming from beta::autoupdater and has something to do with parsoid...may require a new package or maybe an update to puppet.

Thanks, i see:

# Parsoid JavaScript dependencies are updated on beta via npm
 package { 'npm':

So the next questions would be "why do we need parsoid on the deployment server in beta when the deployment_server role in prod doesn't mention parsoid at all" and/or "how are Parsoid Javascript dependencies updated when not in beta".

Thanks, i see:

# Parsoid JavaScript dependencies are updated on beta via npm
 package { 'npm':

So the next questions would be "why do we need parsoid on the deployment server in beta when the deployment_server role in prod doesn't mention parsoid at all" and/or "how are Parsoid Javascript dependencies updated when not in beta".

Hrm. Looks like this is from: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/93939/

Adding @hashar to see if he can provide any information.

A while ago, once a change got merged for parsoid we would trigger a Jenkins job that had to run npm install on the host. That is no more the case however, so I guess npm can be removed entirely.

(note: I have no idea how Parsoid or other mediawiki services are updated on deployment-prep).

Change 456625 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: remove npm from deployment master

https://gerrit.wikimedia.org/r/456625

Change 456625 merged by Dzahn:
[operations/puppet@production] Beta: remove npm from deployment master

https://gerrit.wikimedia.org/r/456625