Maniphest T192561

Upgrade deployment-prep deployment servers to stretch
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thcipriani
	Apr 19 2018, 3:15 PM

Description

Soon production will be using Debian Stretch for deployment machines (T175288) we should upgrade (and probably rename and resize) deployment-tin and deployment-mira

Details

Subject	Repo	Branch	Lines +/-
Beta: remove npm from deployment master	operations/puppet	production	+0 -5
Beta: remove deployment-{tin,mira}	operations/puppet	production	+0 -6
Beta: Make deployment-deploy01 main deploy server	operations/puppet	production	+4 -4
beta: Set up deployment-deploy02 as deployment-mira replacement	operations/puppet	production	+3 -0
Beta: deployment-deploy02 is deployment host	operations/puppet	production	+3 -0
Beta: ensure deployment-deploy01 is a co-master	operations/puppet	production	+2 -0
deployment-prep: Add new deployment host	operations/puppet	production	+1 -0
Beta: Add librenms dsh file	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		thcipriani	T191921 mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5)
		Resolved		thcipriani	T192561 Upgrade deployment-prep deployment servers to stretch

Event Timeline

thcipriani created this task.Apr 19 2018, 3:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2018, 3:15 PM

This would also give us a place to test various mwscripts used by scap with php7

thcipriani added a parent task: T191921: mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5).Apr 19 2018, 3:17 PM

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.

Thanks!

I created deployment-deploy1001 as a stretch box. Here are my notes:

Create new instance

Via horizon deployment-deploy1001, stretch, flavor = c8.m8.s60 (8 cores, 8 GB memory, 60GB harddisk)

Fix certs

On deployment-deploy1001

sudo rm -rf /var/lib/puppet/ssl
sudo mkdir -p /var/lib/puppet/client/ssl/certs
sudo puppet agent -t
sudo cp /var/lib/puppet/ssl/certs/ca.pem /var/lib/puppet/client/ssl/certs

On deployment-puppetmaster02

sudo puppet cert sign deployment-deploy1001.deployment-prep.eqiad.wmflabs

Back on deployment-deploy1001

sudo puppet agent -t

Apply roles

Via Horizon:

role::ci::slave::labs::common
role::labs::lvm::srv

sudo puppet agent -t

role::beta::deploymentserver
role::deployment_server

sudo puppet agent -t

role::aptly::server

Project puppet

If this becomes the new deployment master, this will need to be changed in the project puppet yaml on horizon

scap::deployment_server: deployment-deploy1001.deployment-prep.eqiad.wmflabs

Broken stuff

*sigh*

sudo -u trebuchet -g wikidev git clone inside of scap_source/default.rb fails because /etc/sudoers has root ALL=(ALL) ALL instead of root ALL=(ALL:ALL) ALL, so root can't execute as wikidev.
We create the file /var/lock/scap-global-lock preventing deploys, then we run scap deploy --init which fails because it checks the lock file
iegreview has invalid yaml, so scap deploy --init fails. This is because the check command has a : in it, if we change to using > to treat it as a literal, it works fine.
dumps/dumps had no scap directory. Had to clone git clone https://gerrit.wikimedia.org/r/p/operations/dumps/scap.git manually
netbox/deploy scap deploy --init failed looking for /etc/dsh/group/librenms. Had to manually create with the same contents as deployment-tin: deployment-netbox.deployment-prep.eqiad.wmflabs
There is no npm package for stretch

All but the last one should be easy(ish) puppet fixes. The last one looks like it's coming from beta::autoupdater and has something to do with parsoid...may require a new package or maybe an update to puppet.

Dzahn subscribed.May 4 2018, 11:46 PM

Krenair mentioned this in T194927: deployment-deploy1001 has puppet error.May 25 2018, 5:15 PM

EddieGP merged a task: T194927: deployment-deploy1001 has puppet error.May 25 2018, 7:10 PM

EddieGP moved this task from To Triage to Puppet errors on the Beta-Cluster-Infrastructure board.

EddieGP added subscribers: EddieGP, Krenair.

It seems you used the same flavor for deploy1001 that tin had. This would've been a great time to switch to a different flavor (with bigger disk) and resolve T166492. This comment is probably late for tin->deploy1001, but maybe not for mira->deploy2001 (or whatever it will be called).

(Also, the hostnames in beta usually don't use the 1001/2001 convention, but just 01,02,03,... instead, as they're not distributed to different dcs anyways and it proves to be less confusing to have these names differ from the production ones, but that's just an unimportant detail).

Krenair mentioned this in T195686: Move puppetmaster to Stretch.May 27 2018, 8:56 PM

Mentioned in SAL (#wikimedia-releng) [2018-05-30T18:32:52Z] <mutante> created instance deployment-deploy-01 with stretch and flavor x-large (T192561)

deployment-deploy1001 has been deleted by thcipriani.

deployment-deploy-01 has been created with x-large flavor for more disk space @EddieGP

Dzahn mentioned this in T166492: deployment-tin has disk space issues.May 30 2018, 6:36 PM

applied the "role(deployment_server)" on it via instance puppet. (like in prod, no other roles yet that would differ from prod but were used here before)

Krenair fixed puppet cert issues and signed on project specific puppet master.

Puppet is running now.

Dzahn mentioned this in T175288: setup/install/deploy deploy1001 as deployment server.May 30 2018, 7:03 PM

Woohoo! Thanks Daniel and Tyler! :)

Change 436433 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: Add librenms dsh file

https://gerrit.wikimedia.org/r/436433

gerritbot added a project: Patch-For-Review.May 31 2018, 12:41 AM

In T192561#4183530, @thcipriani wrote:

Broken stuff

iegreview has invalid yaml, so scap deploy --init fails. This is because the check command has a : in it, if we change to using > to treat it as a literal, it works fine.

Addressed via D1064

dumps/dumps had no scap directory. Had to clone git clone https://gerrit.wikimedia.org/r/p/operations/dumps/scap.git manually

Fixed in Horizon

netbox/deploy scap deploy --init failed looking for /etc/dsh/group/librenms. Had to manually create with the same contents as deployment-tin: deployment-netbox.deployment-prep.eqiad.wmflabs

https://gerrit.wikimedia.org/r/436433

Also, https://gerrit.wikimedia.org/r/#/c/361796/ ended up being necessary since scap deploy --init was failing (beyond the lockfile thing) due to the trebuchet group not existing.

Change 436433 merged by Alexandros Kosiaris:
[operations/puppet@production] Beta: Add librenms dsh file

https://gerrit.wikimedia.org/r/436433

thcipriani mentioned this in rWIEG84d1c40e180d: Ensure PyYAML can parse checks.yaml.May 31 2018, 3:36 PM

Unfortunately this new -deploy-01 instance went into emergency mode after the security reboots yesterday. It's responding to ping but not accepting any connections e.g. to SSHd. @Andrew took a look but wasn't able to successfully mount its disk. To replace it do we just need to create a new one, get its puppet working, apply the deployment server role, and wait for puppet to run?

Yes, create a new one, apply puppet role and run it would get us back to the state it was in. That being said, it probably also needs other roles besides the production role.In an ideal world it would be identical to production, just that one role.

Does someone already working on this want to replace the instance or shall I start a deployment-deploy-02?

Replacing it myself

Alright, should be back to roughly where we were 2 weeks ago now with deployment-deploy01.deployment-prep.eqiad.wmflabs. npm package is still failing, plus dumps/dumps is missing git_repo config.

During the process I did have to modify the scap lock path in /usr/lib/python2.7/dist-packages/scap/lock.py to make it not see the scap lock problem.

While it was running I noticed P7280 which I think is due to a missing dependency in puppet on the relevant Scap::Dsh::Group, as entries like this come up only after that error and I'm pretty sure it's the files that scap was looking for:
Notice: /Stage[main]/Scap::Dsh/Scap::Dsh::Group[librenms]/File[/etc/dsh/group/librenms]/ensure: defined content as '{md5}f2465fd75677ab2cc4e5fdf539f7e3fc'
Anyway that worked on the next puppet run.

Also something missing here:

Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/File[/srv/mediawiki-staging]/ensure: created
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58
Wrapped exception:
No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal
Error: /Stage[main]/Beta::Autoupdater/File[/srv/mediawiki-staging/docroot/wwwportal/portal-master]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/mediawiki-staging/docroot/wwwportal at /etc/puppet/modules/beta/manifests/autoupdater.pp:58
Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/Exec[git_clone_operations/mediawiki-config]/returns: executed successfully

This was fine on the next puppet run too.

@thcipriani's https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441491/ will fix broken stuff #4 above, cherry-picked

crappy cherry-picked hack to try to get npm installed and puppet happy

diff --git a/modules/beta/manifests/autoupdater.pp b/modules/beta/manifests/autoupdater.pp
index ad3af6bd06..5f224d87eb 100644
--- a/modules/beta/manifests/autoupdater.pp
+++ b/modules/beta/manifests/autoupdater.pp
@@ -7,8 +7,21 @@ class beta::autoupdater {
     $stage_dir = '/srv/mediawiki-staging'
 
     # Parsoid JavaScript dependencies are updated on beta via npm
+    apt::repository { 'node':
+        uri        => 'https://deb.nodesource.com/node_6.x',
+        dist       => $::lsbdistcodename,
+        components => 'main',
+        keyfile    => 'puppet:///modules/beta/nodesource.gpg', # from https://deb.nodesource.com/gpgkey/nodesource.gpg.key
+    }
+    apt::pin { 'nodejs':
+        package  => 'nodejs',
+        pin      => 'version 6.14.3-1nodesource1',
+        priority => '1002',
+        require  => Apt::Repository['node'],
+    }
     package { 'npm':
-        ensure => 'present',
+        ensure  => 'present',
+        require => Apt::Pin['nodejs'],
     }
 
     file { '/usr/local/bin/wmf-beta-autoupdate.py':

I've armed keyholder on the new host. So what next @thcipriani?

Change 442229 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Add new deployment host

https://gerrit.wikimedia.org/r/442229

Change 442229 merged by Andrew Bogott:
[operations/puppet@production] deployment-prep: Add new deployment host

https://gerrit.wikimedia.org/r/442229

Are there known unresolved issues with the new host? It seems deployment-tin is still used as primary for Jenkins.

Once the migration is done, we should probably phase out the deployment-tin and deployment-mira hosts.

@hashar Can we change Jenkins config to use the new host per Krinkle's question above?

<Krenair> thcipriani, where are we with deployment-deploy01?
<thcipriani> I looked at it Friday, it looks ready to go to me.
<Krenair> cool so we just get someone with integration privileges to swap jenkins to using the new host?
<thcipriani> yep, we'll need to make it the deploy master so that it's unlocked for deployments
<thcipriani> I can add it to jenkins and re-point the labels appropriately, that's on my todo list for this week.

Change 449520 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: ensure deployment-deploy01 is a co-master

https://gerrit.wikimedia.org/r/449520

Change 449521 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: Make deployment-deploy01 main deploy server

https://gerrit.wikimedia.org/r/449521

Change 449520 merged by Dzahn:
[operations/puppet@production] Beta: ensure deployment-deploy01 is a co-master

https://gerrit.wikimedia.org/r/449520

hashar unsubscribed.Jul 31 2018, 9:24 PM

Since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449219/ got merged, puppet has failed to compile catalog on deployment-tin and deployment-mira, and I think the only real way forward is to get rid of them soon. Luckily deploy01 is almost there?

In T192561#4467456, @Krenair wrote:

Luckily deploy01 is almost there?

deploy01 is currently handling all the beta-* jobs on jenkins. I tested scap3 deployment from deploy01. I've cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449521/ so deploy01 is now acting as the main deployment server.

Probably ought to make a deploy02 and then shutdown deployment-tin and deployment-mira.

Change 449521 merged by Dzahn:
[operations/puppet@production] Beta: Make deployment-deploy01 main deploy server

https://gerrit.wikimedia.org/r/449521

In T192561#4467532, @thcipriani wrote:

Probably ought to make a deploy02 and then shutdown deployment-tin and deployment-mira.

doing

I think I misunderstood something about the npm thing and I don't think my patch for it worked after all.

Change 449643 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Set up deployment-deploy02 as deployment-mira stretch replacement

https://gerrit.wikimedia.org/r/449643

<Krenair> thcipriani, do we also need to do something about the apt source pointing at deployment-tin?
<thcipriani> Krenair: afaik that's only used for scap. We'll want to fix it, but it shouldn't block anything. I don't know how it was setup.
<thcipriani> used to test out the master branch of scap on deployment-prep, that is
<Krenair> well it blocks shutting deployment-tin down right? :)
<thcipriani> so that apt isn't screaming on all the other machines? yes, that does seem correct.

It looks like we probably need to copy /srv/packages across and then update role::aptly::client::servername in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep

Edit: Done

Also need to figure out what to do about hieradata/labs/deployment-prep/host/deployment-tin.yaml in puppet.git

Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.

In T192561#4467794, @Krenair wrote:

It looks like we probably need to copy /srv/packages across

In production we use rsync::quickdatacopy for that. It can be automated.

In T192561#4467858, @Dzahn wrote:

Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.

No wait.. i'm wrong. That would affect other hosts in deployment-prep that aren't deployment-servers since this isn't role-based like in prod.

So just move the files to their new host name, staying in ./hosts. .. unfortunately.

In T192561#4467865, @Dzahn wrote:

In T192561#4467858, @Dzahn wrote:

Move the content to hieradata/labs/deployment-prep-host/common.yaml and delete ./host/deployment-tin.yaml and also ./host/deployment-mira.yaml. The content looks identical, so can as well move to common and not have host name specific files.

No wait.. i'm wrong. That would affect other hosts in deployment-prep that aren't deployment-servers since this isn't role-based like in prod.

So just move the files to their new host name, staying in ./hosts. .. unfortunately.

AFAICT those heiradata files may not be needed for deploy0{1,2}. mount_nfs stuff seem to be wrapped in the function if mount_nfs_volume which doesn't seem like it will return anything truthy for the deployment-prep project (also evidenced by no nfs mounts on those hosts). The light_process_count was meant to counter spammy (but otherwise harmless) output for the scap command (see T124956: Rise in "parent, LightProcess exiting" console spam). Since that doesn't appear to be happening on deployment-deploy01 (see beta-scap-eqiad console output in CI: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/218249/console) it doesn't appear to be needed anymore.

Change 450078 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: deployment-deploy02 is deployment host

https://gerrit.wikimedia.org/r/450078

Change 450079 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: remove deployment-{tin,mira}

https://gerrit.wikimedia.org/r/450079

Change 450078 merged by Dzahn:
[operations/puppet@production] Beta: deployment-deploy02 is deployment host

https://gerrit.wikimedia.org/r/450078

Change 449643 abandoned by Alex Monk:
beta: Set up deployment-deploy02 as deployment-mira replacement

Reason:
I1d044711

https://gerrit.wikimedia.org/r/449643

thcipriani edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team.Aug 13 2018, 4:51 PM

thcipriani moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

Change 450079 merged by Dzahn:
[operations/puppet@production] Beta: remove deployment-{tin,mira}

https://gerrit.wikimedia.org/r/450079

@thcipriani resolved?

In T192561#4499987, @Dzahn wrote:

@thcipriani resolved?

yep!

Dzahn awarded a token.Aug 13 2018, 10:12 PM

Mentioned in SAL (#wikimedia-releng) [2018-08-13T23:04:58Z] <Krenair> deactivated and cleaned puppet node entries for deployment-{tin,mira} T192561

Mentioned in SAL (#wikimedia-releng) [2018-08-20T15:58:40Z] <hashar> deleting Jenkins slave deployment-tin.eqiad the instance has been replaced | T192561

During cherry-pick review today I realised that my attempt (b59add730544b922e1fb68ec344bc26027aa9e37) to get the npm package working as expected (I think for some beta-specific parsoid auto-deployment mechanism?) by adding an extra repository (https://deb.nodesource.com/node_6.x) with its key and pinning nodejs to a specific version was still cherry-picked. Is that still necessary? Can someone check whether the other stretch packages for nodejs (that are from deb.debian.org or mirrors.wikimedia.org) are sufficient?

Production deployment servers don't have a nodejs installed. I don't know why deployment-prep deployment servers do.

In T192561#4530411, @Dzahn wrote:

Production deployment servers don't have a nodejs installed. I don't know why deployment-prep deployment servers do.

In T192561#4183530, @thcipriani wrote:

looks like it's coming from beta::autoupdater and has something to do with parsoid...may require a new package or maybe an update to puppet.

Thanks, i see:

# Parsoid JavaScript dependencies are updated on beta via npm
 package { 'npm':

So the next questions would be "why do we need parsoid on the deployment server in beta when the deployment_server role in prod doesn't mention parsoid at all" and/or "how are Parsoid Javascript dependencies updated when not in beta".

In T192561#4540050, @Dzahn wrote:
Thanks, i see:
# Parsoid JavaScript dependencies are updated on beta via npm
 package { 'npm':
So the next questions would be "why do we need parsoid on the deployment server in beta when the deployment_server role in prod doesn't mention parsoid at all" and/or "how are Parsoid Javascript dependencies updated when not in beta".

Hrm. Looks like this is from: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/93939/

Adding @hashar to see if he can provide any information.

A while ago, once a change got merged for parsoid we would trigger a Jenkins job that had to run npm install on the host. That is no more the case however, so I guess npm can be removed entirely.

(note: I have no idea how Parsoid or other mediawiki services are updated on deployment-prep).

Change 456625 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: remove npm from deployment master

https://gerrit.wikimedia.org/r/456625

Change 456625 merged by Dzahn:
[operations/puppet@production] Beta: remove npm from deployment master

https://gerrit.wikimedia.org/r/456625

Upgrade deployment-prep deployment servers to stretchClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create new instance

Fix certs

Apply roles

Project puppet

Broken stuff

Broken stuff

Upgrade deployment-prep deployment servers to stretch
Closed, ResolvedPublic
Actions

Related Objects
Search...