Maniphest T208101

Migrate deployment-prep to eqiad1
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Oct 26 2018, 10:20 PM

Description

This will take a long time (possibly 2-3 workdays) and contains a lot of edge cases.

Details

Subject	Repo	Branch	Lines +/-
deployment-prep: Update IPs for Varnish	operations/mediawiki-config	master	+2 -2
openstack: remove out of date deployment-cache-upload04 IPs	operations/puppet	production	+0 -3
deployment-prep: Update parsoid09 IP	operations/mediawiki-config	master	+1 -1
deployment-prep: Update deployment-db* IPs	operations/mediawiki-config	master	+8 -8
deployment-prep: Update cache-upload private IP	operations/mediawiki-config	master	+2 -2
Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r)	operations/mediawiki-config	master	+8 -8
deployment-prep: Update BounceHandler deployment-mx02 IP for migration	operations/mediawiki-config	master	+1 -1
shinken: temporarily remove monitoring for deployment-prep	operations/puppet	production	+1 -1
deployment-prep: Update some IPs for the migration	operations/puppet	production	+3 -3
Horizon: enable deployment-prep in eqiad1-r	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Open	None	T87220 Minimize infrastructure differences between Beta Cluster and production
Open	None	T196662 Set up LVS in beta like prod
Resolved	bd808	T166396 Program 1 Outcome 4: VPS hosting
Resolved	None	T167293 Nova-network to Neutron migration
Resolved	Krenair	T208101 Migrate deployment-prep to eqiad1
Resolved	• mmodell	T208262 Ensure there are no hard-coded IPs in use for beta
Resolved	bd808	T210030 RedisBagOStuff is broken on beta
Resolved	Krenair	T210214 deployment-cache-text04 decomissioning
Resolved	• dduvall	T210301 Status of deployment-redis0[56]?

Event Timeline

Andrew triaged this task as Medium priority.Oct 26 2018, 10:20 PM

Andrew created this task.

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 26 2018, 10:20 PM

@Krenair just rattled off a list of things we'll probably have to tweak by hand:

wmf-config/CommonSettings-labs.php: $wgBounceHandlerInternalIPs = [ '127.0.0.1', '::1', '10.68.23.220' ]; deployment-mx02.deployment-prep.eqiad.wmflabs
5:13 PM wmf-config/InitialiseSettings-labs.php: '10.68.20.142' => true deployment-parsoid09.deployment-prep.eqiad.wmflabs
5:13 PM lots of stuff in wmf-config/db-labs.php
5:13 PM and some in wmf-config/reverse-proxy-staging.php
5:13 PM plus security groups
5:15 PM oh and in puppet there's that nova_dnsmasq_aliases stuff that I don't understand, couple of entries in there
5:16 PM bunch of stuff in hieradata/labs/deployment-prep/common.yaml, probably also the horizon and wikitech pages, possibly cherry-picks
5:18 PM deployment hosts in modules/network/manifests/constants.pp

Krenair added a project: Beta-Cluster-Infrastructure.Oct 26 2018, 10:20 PM

This was mostly based on grepping for 10.68 by the way, there's likely more stuff out there lurking.

Andrew's list of the first batch of VMs, (these currently live on labvirt1017):
deployment-cpjobqueue
deployment-elastic06
deployment-kafka-jumbo-1
deployment-kafka-main-2
deployment-redis05

of these, redis05 has a reference in puppet.git that is difficult to spot because it has a bad comment - https://gerrit.wikimedia.org/r/470095

Krenair mentioned this in T208262: Ensure there are no hard-coded IPs in use for beta.Oct 29 2018, 8:33 PM

Krenair added a subtask: T208262: Ensure there are no hard-coded IPs in use for beta.Oct 29 2018, 8:36 PM

• mmodell subscribed.Nov 5 2018, 5:14 PM

In order to keep this ball rolling, I propose that we schedule this move for November 27th, 28th, and 29th. Any objections? We could try to cram it in the week before but then we'd run up against Thanksgiving if it takes longer than expected.

...is anybody there?

@Andrew: I won't be around on those days to help out as I'm on vacation from the 26th through the 30th. Maybe @dan or @hashar will be available?

I think you mean @dduvall :)

hahah good catch @Krenair, thanks

I do not have any available bandwidth for the next 4 weeks at least, sorry. I am running the MediaWiki train from 11/12 to 11/16 and from 11/26 to 11/30. We then have an offsite first week of December which I have to prepare next week.

Andrew added a subscriber: greg.Nov 16 2018, 8:45 PM

So, by the time y'all are back from your offsite I will need to be powering down labvirt1010 and 1011 (they are leased and their leases are expiring), and repurposing labvirt1014. Those three servers contain 15 deployment-prep instances.

I can keep doing what I'm doing (shuffling things about within the old region) but that just means double the downtime for you since they WILL need to get moved again so we can decom the whole nova-network region. It's also double the work for me.

The option remains to do this one server at a time, adjusting for IP changes as needed and as we go. I'm starting to think that that's a more realistic option since I can't get anyone to commit to a migration window.

@Andrew: Can it be done next week? I know that's pushing it given that it's a 3 day work week.

Yep, next week would work just fine for me. I'm available until at least Wednesday evening, and can multi-task if it runs longer than that.

Change 474756 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: enable deployment-prep in eqiad1-r

https://gerrit.wikimedia.org/r/474756

Change 474756 merged by Andrew Bogott:
[operations/puppet@production] Horizon: enable deployment-prep in eqiad1-r

https://gerrit.wikimedia.org/r/474756

Change 474758 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] shinken: temporarily remove monitoring for deployment-prep

https://gerrit.wikimedia.org/r/474758

This is now underway!

Change 474820 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Update some IPs for the migration

https://gerrit.wikimedia.org/r/474820

Change 474820 merged by Andrew Bogott:
[operations/puppet@production] deployment-prep: Update some IPs for the migration

https://gerrit.wikimedia.org/r/474820

Change 474758 merged by Andrew Bogott:
[operations/puppet@production] shinken: temporarily remove monitoring for deployment-prep

https://gerrit.wikimedia.org/r/474758

Change 474823 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update BounceHandler deployment-mx02 IP for migration

https://gerrit.wikimedia.org/r/474823

Change 474823 merged by Andrew Bogott:
[operations/mediawiki-config@master] deployment-prep: Update BounceHandler deployment-mx02 IP for migration

https://gerrit.wikimedia.org/r/474823

Mentioned in SAL (#wikimedia-releng) [2018-11-20T11:00:34Z] <hashar> deployment-deploy01 got migrated to a new region but the Jenkins configuration had not been updated. Adjusting IP address from 10.68.23.38 to 172.16.4.18 | T208101

Mentioned in SAL (#wikimedia-releng) [2018-11-20T11:47:23Z] <hashar> Armed keyholder on deployment-deploy01 Got shutdown while being migrated a new cloud region # T208101

Change 474890 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update cache-upload private IP

https://gerrit.wikimedia.org/r/474890

Change 474891 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack: remove out of date deployment-cache-upload04 IPs

https://gerrit.wikimedia.org/r/474891

Change 474892 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update deployment-db* IPs

https://gerrit.wikimedia.org/r/474892

Mentioned in SAL (#wikimedia-releng) [2018-11-20T12:11:21Z] <twentyafterfour> scap failures on deployment-mediawiki-07 are related to uid/gid mismatch of the mwdeploy user, specifically the owner of that user's home dir is uid 603 but /etc/passwd|group have a different uid/gid for the same username. T208101

Change 474898 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/mediawiki-config@master] Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r)

https://gerrit.wikimedia.org/r/474898

Change 474898 abandoned by 20after4:
Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r)

Reason:
duplicate of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/ /474892

https://gerrit.wikimedia.org/r/474898

Change 474899 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update parsoid09 IP

https://gerrit.wikimedia.org/r/474899

Change 474890 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update cache-upload private IP

https://gerrit.wikimedia.org/r/474890

Change 474892 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update deployment-db* IPs

https://gerrit.wikimedia.org/r/474892

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:45:38Z] <zfilipin@deploy1001> Synchronized wmf-config/reverse-proxy-staging.php: SWAT: [[gerrit:474890|deployment-prep: Update cache-upload private IP (T208101)]] (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:55:22Z] <zfilipin@deploy1001> Synchronized wmf-config/db-labs.php: SWAT: [[gerrit:474892|deployment-prep: Update deployment-db* IPs (T208101)]] (duration: 00m 47s)

Mentioned in SAL (#wikimedia-releng) [2018-11-20T13:01:47Z] <twentyafterfour> PHP Startup: Unable to load dynamic library '/usr/lib/php/20151012/luasandbox.so' - /usr/lib/php/20151012/luasandbox.so: cannot open shared object file: No such file or directory T208101

Change 474899 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update parsoid09 IP

https://gerrit.wikimedia.org/r/474899

Mentioned in SAL (#wikimedia-operations) [2018-11-20T13:03:24Z] <zfilipin@deploy1001> Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:474899|deployment-prep: Update parsoid09 IP (T208101)]] (duration: 00m 47s)

Change 474891 merged by Andrew Bogott:
[operations/puppet@production] openstack: remove out of date deployment-cache-upload04 IPs

https://gerrit.wikimedia.org/r/474891

Andrew added a subscriber: aborrero.Nov 20 2018, 4:53 PM

greg moved this task from INBOX to Kanban on the Release-Engineering-Team board.Nov 21 2018, 12:00 AM

greg edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team.

greg moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

• mmodell closed subtask T210030: RedisBagOStuff is broken on beta as Resolved.Nov 21 2018, 8:13 PM

I just moved the A records over from deployment-cache-text04 to deployment-cache-text05. It seems to be working fine.
So we're left with deployment-cache-text04 in the old region to be shut down when all traffic is fully gone, and deployment-zotero01 which is going away with the trusty deprecation.

PerfektesChaos reopened subtask T210030: RedisBagOStuff is broken on beta as Open.Nov 22 2018, 6:07 PM

Krenair closed subtask T210030: RedisBagOStuff is broken on beta as Resolved.Nov 22 2018, 8:17 PM

Change 475363 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update IPs for Varnish

https://gerrit.wikimedia.org/r/475363

Also:

DBs are RO
VE is out of action (HTTP 503s)

2018-11-22 20:48:10 [W-cWCawQBHcAAErbfF0AAAAH] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadMonitor::getServerStates: host 172.16.5.5:3306 is not replicating? {"db_server":"172.16.5.5:3306"}
2018-11-22 20:48:10 [W-cWCawQBHcAAErbfF0AAAAH] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server 172.16.5.5:3306 is not replicating? {"host":"172.16.5.5:3306"}
2018-11-22 20:48:17 [5397214e708af0ce4fc3f26b] deployment-mwmaint01 wikidatawiki 1.32.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadMonitor::getServerStates: host deployment-db04 is unreachable {"db_server":"deployment-db04"}
2018-11-22 20:48:17 [5397214e708af0ce4fc3f26b] deployment-mwmaint01 wikidatawiki 1.32.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server deployment-db04 is not replicating? {"host":"deployment-db04"}
2018-11-22 20:48:17 [5397214e708af0ce4fc3f26b] deployment-mwmaint01 wikidatawiki 1.32.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
2018-11-22 20:48:31 [W-cWH6wQBHcAAErbfF4AAAAM] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadMonitor::getServerStates: host deployment-db04 is not replicating? {"db_server":"deployment-db04"}
2018-11-22 20:48:31 [W-cWH6wQBHcAAErbfF4AAAAM] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server deployment-db04 is not replicating? {"host":"deployment-db04"}

PerfektesChaos subscribed.Nov 22 2018, 9:13 PM

root@BETA[(none)]> show slave status  \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: deployment-db03.deployment-prep.eqiad.wmflabs
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: deployment-db03-bin.000103
          Read_Master_Log_Pos: 504
               Relay_Log_File: deployment-db04-relay-bin.000002
                Relay_Log_Pos: 1840130
        Relay_Master_Log_File: deployment-db03-bin.000102
             Slave_IO_Running: Yes
[...]
        Seconds_Behind_Master: NULL
[...]

Wait, what? NULL?

Turns out that Slave_SQL_Running: No is a Bad Thing - despite it saying its waiting for master to send to event, replication is actually broken.

Alright so I made it skip insertion of duplicate things until replication recovered and Seconds_Behind_Master is now back down to 0.

As for VE, URLs like https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/User%3AKrenair%2Fsandbox/387145?redirect=false fail:

root@deployment-cache-text05:/etc/varnish# grep deployment-restbase * -A1
wikimedia-common_misc-backend.inc.vcl:	.host = "deployment-restbase01.deployment-prep.eqiad.wmflabs";
wikimedia-common_misc-backend.inc.vcl-	.port = "7231";
--
wikimedia-common_text-backend.inc.vcl:	.host = "deployment-restbase01.deployment-prep.eqiad.wmflabs";
wikimedia-common_text-backend.inc.vcl-	.port = "7231";
root@deployment-cache-text05:/etc/varnish# curl http://deployment-restbase01.deployment-prep.eqiad.wmflabs:7231 -v
* Rebuilt URL to: http://deployment-restbase01.deployment-prep.eqiad.wmflabs:7231/
*   Trying 172.16.5.26...
* TCP_NODELAY set
* connect to 172.16.5.26 port 7231 failed: Connection refused
* Failed to connect to deployment-restbase01.deployment-prep.eqiad.wmflabs port 7231: Connection refused
* Closing connection 0
curl: (7) Failed to connect to deployment-restbase01.deployment-prep.eqiad.wmflabs port 7231: Connection refused

krenair@deployment-restbase01:~$ sudo lsof -i :7231
krenair@deployment-restbase01:~$ sudo service restbase status
● restbase.service - "restbase service"
   Loaded: loaded (/lib/systemd/system/restbase.service; enabled)
   Active: active (running) since Thu 2018-11-22 22:58:32 UTC; 21min ago
 Main PID: 18100 (firejail)
   CGroup: /system.slice/restbase.service
           ├─18100 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml
           ├─18102 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml
           ├─18107 /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml
           ├─24196 /usr/bin/nodejs /srv/deployment/restbase/deploy-cache/revs/5b8ad3c0a8b6a52a7eacb731055ea4c3e392aa3f/node_modules/service-runner/service-runner.js -c /etc/restbase/config.yaml
           ├─24212 /usr/bin/nodejs /srv/deployment/restbase/deploy-cache/revs/5b8ad3c0a8b6a52a7eacb731055ea4c3e392aa3f/node_modules/service-runner/service-runner.js -c /etc/restbase/config.yaml
           └─24218 /usr/bin/nodejs /srv/deployment/restbase/deploy-cache/revs/5b8ad3c0a8b6a52a7eacb731055ea4c3e392aa3f/node_modules/service-runner/service-runner.js -c /etc/restbase/config.yaml

Nov 22 22:58:32 deployment-restbase01 systemd[1]: Started "restbase service".
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: Reading profile /etc/firejail/default.profile
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: Reading profile /etc/firejail/disable-common.inc
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: Reading profile /etc/firejail/disable-programs.inc
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: ** Note: you can use --noprofile to disable default.profile **
krenair@deployment-restbase01:~$ grep 7231 /etc/restbase/config.yaml
      port: 7231

Krenair mentioned this in T210092: UnderflowException when creating an account, FancyCaptcha.php: Ran out of captcha images.Nov 23 2018, 12:23 AM

I do confirm that dewiki@BETA works properly now.

@fgiunchedi: Why is puppet disabled on deployment-mediawiki-07? The reason given is just filippo

Krenair moved this task from To Triage to Epics / Tracking on the Beta-Cluster-Infrastructure board.Nov 25 2018, 3:39 AM

In T208101#4770736, @Krenair wrote:

@fgiunchedi: Why is puppet disabled on deployment-mediawiki-07? The reason given is just filippo

I was performing testing for T205851, I've reenabled puppet now

Thanks @fgiunchedi

Change 475363 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update IPs for Varnish

https://gerrit.wikimedia.org/r/475363

PerfektesChaos unsubscribed.Nov 26 2018, 10:46 AM

In T208101#4768886, @Krenair wrote:

As for VE, URLs like https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/User%3AKrenair%2Fsandbox/387145?redirect=false fail:

Should be fixed now. Cassandra wasn't starting properly.

In T208101#4774370, @mobrovac wrote:

In T208101#4768886, @Krenair wrote:

As for VE, URLs like https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/User%3AKrenair%2Fsandbox/387145?redirect=false fail:

Should be fixed now. Cassandra wasn't starting properly.

Thanks @mobrovac. For future reference the command tail-restbase will show logs the indicate this sort of failure. I was looking at sudo service restbase status which didn't show anything wrong.

While @mmodell was the point person from Release-Engineering-Team, @Krenair did much of the follow-up so I'm assigning this task to him (just to reflect reality).