Page MenuHomePhabricator

setup/install/deploy deploy1001 as deployment server
Closed, ResolvedPublic

Description

This task will track the setup and deployment of system WMF4748 as deploy1001.eqiad.wmnet, for use as a deployment server in eqiad. Tin is out of warranty, and failing hardware has called for its replacement on parent task T174452.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -7
operations/puppetproduction+9 -6
operations/puppetproduction+8 -0
operations/dnsmaster+2 -2
operations/puppetproduction+3 -4
operations/puppetproduction+0 -1
operations/puppetproduction+1 -2
operations/puppetproduction+2 -2
operations/puppetproduction+3 -3
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/puppetproduction+0 -2
operations/puppetproduction+2 -1
operations/puppetproduction+2 -0
operations/dnsmaster+2 -2
operations/puppetproduction+5 -6
operations/puppetproduction+2 -0
operations/puppetproduction+8 -0
operations/puppetproduction+9 -1
operations/puppetproduction+42 -1
operations/puppetproduction+5 -2
operations/puppetproduction+1 -7
operations/puppetproduction+2 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+14 -5
operations/dnsmaster+4 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 422349 merged by Dzahn:
[operations/dns@master] Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet

https://gerrit.wikimedia.org/r/422349

Mentioned in SAL (#wikimedia-operations) [2018-03-27T22:28:21Z] <mutante> DNS - switching deployment service name to deploy1001 (T175288)

There has been a deploy from it, and all the changes above, incl. DNS service name, Mukunda confirmed things looking good, announced to ops list, created wiki page on wikitech and pasted fingerprints.. It's up and running.

Change 422479 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: set deploy1001 to use jessie

https://gerrit.wikimedia.org/r/422479

Change 422479 merged by Dzahn:
[operations/puppet@production] install_server: set deploy1001 to use jessie

https://gerrit.wikimedia.org/r/422479

Mentioned in SAL (#wikimedia-operations) [2018-03-28T19:54:29Z] <mutante> deploy1001 - schedule downtime for reinstall with jessie, reinstalling (T175288)

per the last ops meeting and joe's comments:

  • reinstall it one more time. back to stretch instead of jessie

Change 425331 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] deploy1001: reinstall with stretch instead of jessie

https://gerrit.wikimedia.org/r/425331

Change 425331 merged by Dzahn:
[operations/puppet@production] deploy1001: reinstall with stretch instead of jessie

https://gerrit.wikimedia.org/r/425331

Mentioned in SAL (#wikimedia-operations) [2018-04-10T20:30:47Z] <mutante> deploy1001 - reinstalled with jessie - re-adding to puppet (T175288)

Mentioned in SAL (#wikimedia-operations) [2018-04-10T20:30:58Z] <mutante> deploy1001 - reinstalled with stretch - re-adding to puppet (T175288)

Mentioned in SAL (#wikimedia-operations) [2018-04-11T18:11:47Z] <mutante> deploy1001 is back on stretch once again - it has been removed from scap hosts though (T175288 T185275)

Mentioned in SAL (#wikimedia-operations) [2018-04-19T08:14:22Z] <ema> reboot deploy1001 and arm keyholder T175288

Can this task be closed? (since everything in the task description is checked off)

No, it can't be closed since it's not done and tin is still the deployment server. Issue is that checkbox(es) missing.

I suggest we do the following:

  • Pick a date/time frame of a few hours where no deployments are happening (or cancel existing ones)
  • We switch the deployment server to deploy1001 and test deployments using the PHP7 setup present there
  • If anything breaks, we revert to tin and fix whatever problem we found with PHP 7 and deploy1001 and re-attempt at a later stage
  • If everything works fine, we keep tin for a few weeks as a fallback and then decom it.

I suggest we do the following:

  • Pick a date/time frame of a few hours where no deployments are happening (or cancel existing ones)
  • We switch the deployment server to deploy1001 and test deployments using the PHP7 setup present there
  • If anything breaks, we revert to tin and fix whatever problem we found with PHP 7 and deploy1001 and re-attempt at a later stage
  • If everything works fine, we keep tin for a few weeks as a fallback and then decom it.

I would propose that if there are issues with PHP7 we just use HHVM on deploy1001 instead.

And, let's do this on friday, that leaves us until monday's SWAT (if any).

The plan sounds good.

How do you guys feel about the maintenance server, terbium -> mwmaint1001 (T192092), should we do that first or after tin/deploy1001?

How do you guys feel about the maintenance server, terbium -> mwmaint1001 (T192092), should we do that first or after tin/deploy1001?

We should do tin first, replacing terbium is more complex.

And, let's do this on friday, that leaves us until monday's SWAT (if any).

How about Friday, 25th of May?

@20after4 @greg We are trying to find a time to replace tin with deploy1001 (this time for real) and are thinking of Friday May 25th. Any thoughts on that? Maybe it would be interesting who is going to be the first deployer (mw and other) after that switch.

Change 434821 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: grant deploy1001 access to labswiki

https://gerrit.wikimedia.org/r/434821

We'll need the database grants above added or that might block the switch. That being said, not sure why exactly the deployment server needs to talk to the 'labswiki" (wikitech) database. What would break without that?

Change 434821 merged by Dzahn:
[operations/puppet@production] mariadb: grant deploy1001 access to labswiki

https://gerrit.wikimedia.org/r/434821

We'll need the database grants above added or that might block the switch

This is unblocked. All 3 DB grant related changes have been merged/deployed/tested. (thanks Manuel)

@MoritzMuehlenhoff @Dzahn: The 25th should be fine.

We canceled today's planned migration. We are waiting because we need to be able to deploy at any time while T195520 is being resolved.

We will schedule a new date after that.

The new planned window for this migration is the upcoming Friday, June 1st. (with thcipriani hoping he gets the train back on track, so there won't be too many deploys)

deployment-prep now also has a new instance using stretch with more disk space to match this (T192561#4243810)

Change 436814 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap/dsh: add deploy1001 to scap masters

https://gerrit.wikimedia.org/r/436814

Change 436814 merged by Dzahn:
[operations/puppet@production] scap/dsh: add deploy1001 to scap masters

https://gerrit.wikimedia.org/r/436814

Change 436816 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: switch scap::deployment server to deploy1001

https://gerrit.wikimedia.org/r/436816

--- /etc/dsh/group/scap-masters	2018-05-24 14:25:47.608760286 +0000
+deploy1001.eqiad.wmnet

..

[deploy1001:~] $ scap pull
14:52:35 Copying to deploy1001.eqiad.wmnet from tin.eqiad.wmnet
14:52:35 Started rsync common

[deploy1001:~] $ scap pull-master tin.eqiad.wmnet
15:17:08 Copying to deploy1001.eqiad.wmnet from tin.eqiad.wmnet
15:17:08 Started rsync master
..
..

10:52 < mutante> !log deploy1001 - scap pull
11:05 < mutante> !log rsyncing /srv/mediawiki-staging to /srv/mediawiki-staging-before-backup/ on tin as a backup
11:12 < mutante> !log tin umask 022 && echo 'switching deploy servers' > /var/lock/scap-global-lock
11:17 < mutante> !log [deploy1001:~] $ scap pull-master tin.eqiad.wmnet

Change 436818 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch deployment_server from tin to deploy1001

https://gerrit.wikimedia.org/r/436818

11:41 < mutante> !log root@deploy1001:/srv/mediawiki-staging# find . -uid 996 -exec chown mwdeploy {} \;

11:47 < mutante> !log @deploy1001:/srv/deployment# find . -uid 997 -exec chown trebuchet {} \;

Change 436816 merged by Dzahn:
[operations/puppet@production] scap: switch scap::deployment server to deploy1001

https://gerrit.wikimedia.org/r/436816

Change 436818 merged by Dzahn:
[operations/puppet@production] switch deployment_server from tin to deploy1001

https://gerrit.wikimedia.org/r/436818

Mentioned in SAL (#wikimedia-operations) [2018-06-01T16:21:53Z] <mutante> deployment server has switched away from tin to deploy1001. set global scap lock on deploy1001, re-enabled puppet and ran puppet, disabled tin as deployment server (T175288)

Change 436827 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: remove tin from scap masters and hosts

https://gerrit.wikimedia.org/r/436827

Change 436827 merged by Dzahn:
[operations/puppet@production] scap: rm tin from masters,hosts, add deploy1001 to hosts

https://gerrit.wikimedia.org/r/436827

Change 436830 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] remove tin from hosts kubernetes master is accessible to

https://gerrit.wikimedia.org/r/436830

Change 436830 merged by Dzahn:
[operations/puppet@production] remove tin from hosts kubernetes master is accessible to

https://gerrit.wikimedia.org/r/436830

Change 436831 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install/cumin/scap: update/remove tin-related comments

https://gerrit.wikimedia.org/r/436831

Change 436831 merged by Dzahn:
[operations/puppet@production] install/cumin/scap: update/remove tin-related comments

https://gerrit.wikimedia.org/r/436831

Dzahn changed the status of subtask T196175: decom/reclaim tin from Open to Stalled.Jun 1 2018, 5:11 PM

deploy1001 is now the active deployment server. from here it should just be about removing tin. we will wait a grace period and then this continues on T185275 and finally T196175

Change 436835 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch deployment.eqiad from tin to deploy1001

https://gerrit.wikimedia.org/r/436835

Change 436835 merged by Dzahn:
[operations/dns@master] switch deployment.[eqiad|codfw] from tin to deploy1001

https://gerrit.wikimedia.org/r/436835

Change 436992 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] deployment::server: add rsync for home dirs

https://gerrit.wikimedia.org/r/436992

Change 436992 merged by Dzahn:
[operations/puppet@production] deployment::server: add rsync for home dirs

https://gerrit.wikimedia.org/r/436992

Change 420917 merged by Dzahn:
[operations/puppet@production] decom and remove remnants of tin.eqiad.wmnet

https://gerrit.wikimedia.org/r/420917

Dzahn changed the status of subtask T196175: decom/reclaim tin from Stalled to Open.

Mentioned in SAL (#wikimedia-operations) [2018-06-11T10:52:48Z] <mutante> phab1002 - editing cached scap config /srv/deployment/phabricator/deployment-cache/.config to replace tin.eqiad with deploy1001.eqiad deployment server, run puppet. other options: run scap with --refresh-config, delet cached .config file (T196019) (T175288)

Change 440100 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-deployment: remove rsync for tin home dirs

https://gerrit.wikimedia.org/r/440100

Change 440100 merged by Dzahn:
[operations/puppet@production] mw-deployment: remove rsync for tin home dirs

https://gerrit.wikimedia.org/r/440100

Mentioned in SAL (#wikimedia-operations) [2019-08-27T11:51:31Z] <mutante> miscweb1001 - manually remove tin.eqiad.wmnet (!) from /srv/iegreview/iegreview-cache/.config and replace with deploy1001 after first puppet run. still existing bug that tin is not fully removed (T224247, T175288, T197470)

RobH mentioned this in Unknown Object (Task).Oct 14 2020, 9:09 PM