Page MenuHomePhabricator

Migrate deployment-prep to eqiad1
Closed, ResolvedPublic

Description

This will take a long time (possibly 2-3 workdays) and contains a lot of edge cases.

Event Timeline

Andrew triaged this task as Medium priority.Oct 26 2018, 10:20 PM
Andrew created this task.

@Krenair just rattled off a list of things we'll probably have to tweak by hand:

wmf-config/CommonSettings-labs.php: $wgBounceHandlerInternalIPs = [ '127.0.0.1', '::1', '10.68.23.220' ]; deployment-mx02.deployment-prep.eqiad.wmflabs
5:13 PM wmf-config/InitialiseSettings-labs.php: '10.68.20.142' => true
deployment-parsoid09.deployment-prep.eqiad.wmflabs
5:13 PM lots of stuff in wmf-config/db-labs.php
5:13 PM and some in wmf-config/reverse-proxy-staging.php
5:13 PM plus security groups
5:15 PM oh and in puppet there's that nova_dnsmasq_aliases stuff that I don't understand, couple of entries in there
5:16 PM bunch of stuff in hieradata/labs/deployment-prep/common.yaml, probably also the horizon and wikitech pages, possibly cherry-picks
5:18 PM deployment hosts in modules/network/manifests/constants.pp

This was mostly based on grepping for 10.68 by the way, there's likely more stuff out there lurking.

Andrew's list of the first batch of VMs, (these currently live on labvirt1017):
deployment-cpjobqueue
deployment-elastic06
deployment-kafka-jumbo-1
deployment-kafka-main-2
deployment-redis05

of these, redis05 has a reference in puppet.git that is difficult to spot because it has a bad comment - https://gerrit.wikimedia.org/r/470095

In order to keep this ball rolling, I propose that we schedule this move for November 27th, 28th, and 29th. Any objections? We could try to cram it in the week before but then we'd run up against Thanksgiving if it takes longer than expected.

mmodell added subscribers: Unknown Object (User), hashar.Nov 13 2018, 9:07 PM

@Andrew: I won't be around on those days to help out as I'm on vacation from the 26th through the 30th. Maybe @dan or @hashar will be available?

Krenair edited subscribers, added: dduvall; removed: Unknown Object (User).Nov 13 2018, 9:08 PM

I think you mean @dduvall :)

I do not have any available bandwidth for the next 4 weeks at least, sorry. I am running the MediaWiki train from 11/12 to 11/16 and from 11/26 to 11/30. We then have an offsite first week of December which I have to prepare next week.

So, by the time y'all are back from your offsite I will need to be powering down labvirt1010 and 1011 (they are leased and their leases are expiring), and repurposing labvirt1014. Those three servers contain 15 deployment-prep instances.

I can keep doing what I'm doing (shuffling things about within the old region) but that just means double the downtime for you since they WILL need to get moved again so we can decom the whole nova-network region. It's also double the work for me.

The option remains to do this one server at a time, adjusting for IP changes as needed and as we go. I'm starting to think that that's a more realistic option since I can't get anyone to commit to a migration window.

@Andrew: Can it be done next week? I know that's pushing it given that it's a 3 day work week.

Yep, next week would work just fine for me. I'm available until at least Wednesday evening, and can multi-task if it runs longer than that.

Change 474756 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: enable deployment-prep in eqiad1-r

https://gerrit.wikimedia.org/r/474756

Change 474756 merged by Andrew Bogott:
[operations/puppet@production] Horizon: enable deployment-prep in eqiad1-r

https://gerrit.wikimedia.org/r/474756

Change 474758 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] shinken: temporarily remove monitoring for deployment-prep

https://gerrit.wikimedia.org/r/474758

Change 474820 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] deployment-prep: Update some IPs for the migration

https://gerrit.wikimedia.org/r/474820

Change 474820 merged by Andrew Bogott:
[operations/puppet@production] deployment-prep: Update some IPs for the migration

https://gerrit.wikimedia.org/r/474820

Change 474758 merged by Andrew Bogott:
[operations/puppet@production] shinken: temporarily remove monitoring for deployment-prep

https://gerrit.wikimedia.org/r/474758

Change 474823 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update BounceHandler deployment-mx02 IP for migration

https://gerrit.wikimedia.org/r/474823

Change 474823 merged by Andrew Bogott:
[operations/mediawiki-config@master] deployment-prep: Update BounceHandler deployment-mx02 IP for migration

https://gerrit.wikimedia.org/r/474823

Mentioned in SAL (#wikimedia-releng) [2018-11-20T11:00:34Z] <hashar> deployment-deploy01 got migrated to a new region but the Jenkins configuration had not been updated. Adjusting IP address from 10.68.23.38 to 172.16.4.18 | T208101

Mentioned in SAL (#wikimedia-releng) [2018-11-20T11:47:23Z] <hashar> Armed keyholder on deployment-deploy01 Got shutdown while being migrated a new cloud region # T208101

Change 474890 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update cache-upload private IP

https://gerrit.wikimedia.org/r/474890

Change 474891 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack: remove out of date deployment-cache-upload04 IPs

https://gerrit.wikimedia.org/r/474891

Change 474892 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update deployment-db* IPs

https://gerrit.wikimedia.org/r/474892

Mentioned in SAL (#wikimedia-releng) [2018-11-20T12:11:21Z] <twentyafterfour> scap failures on deployment-mediawiki-07 are related to uid/gid mismatch of the mwdeploy user, specifically the owner of that user's home dir is uid 603 but /etc/passwd|group have a different uid/gid for the same username. T208101

Change 474898 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/mediawiki-config@master] Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r)

https://gerrit.wikimedia.org/r/474898

Change 474898 abandoned by 20after4:
Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r)

Reason:
duplicate of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/ /474892

https://gerrit.wikimedia.org/r/474898

Change 474899 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update parsoid09 IP

https://gerrit.wikimedia.org/r/474899

Change 474890 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update cache-upload private IP

https://gerrit.wikimedia.org/r/474890

Change 474892 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update deployment-db* IPs

https://gerrit.wikimedia.org/r/474892

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:45:38Z] <zfilipin@deploy1001> Synchronized wmf-config/reverse-proxy-staging.php: SWAT: [[gerrit:474890|deployment-prep: Update cache-upload private IP (T208101)]] (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:55:22Z] <zfilipin@deploy1001> Synchronized wmf-config/db-labs.php: SWAT: [[gerrit:474892|deployment-prep: Update deployment-db* IPs (T208101)]] (duration: 00m 47s)

Mentioned in SAL (#wikimedia-releng) [2018-11-20T13:01:47Z] <twentyafterfour> PHP Startup: Unable to load dynamic library '/usr/lib/php/20151012/luasandbox.so' - /usr/lib/php/20151012/luasandbox.so: cannot open shared object file: No such file or directory T208101

Change 474899 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update parsoid09 IP

https://gerrit.wikimedia.org/r/474899

Mentioned in SAL (#wikimedia-operations) [2018-11-20T13:03:24Z] <zfilipin@deploy1001> Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:474899|deployment-prep: Update parsoid09 IP (T208101)]] (duration: 00m 47s)

Change 474891 merged by Andrew Bogott:
[operations/puppet@production] openstack: remove out of date deployment-cache-upload04 IPs

https://gerrit.wikimedia.org/r/474891

I just moved the A records over from deployment-cache-text04 to deployment-cache-text05. It seems to be working fine.
So we're left with deployment-cache-text04 in the old region to be shut down when all traffic is fully gone, and deployment-zotero01 which is going away with the trusty deprecation.

Change 475363 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] deployment-prep: Update IPs for Varnish

https://gerrit.wikimedia.org/r/475363

Also:

  • DBs are RO
  • VE is out of action (HTTP 503s)

2018-11-22 20:48:10 [W-cWCawQBHcAAErbfF0AAAAH] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadMonitor::getServerStates: host 172.16.5.5:3306 is not replicating? {"db_server":"172.16.5.5:3306"}
2018-11-22 20:48:10 [W-cWCawQBHcAAErbfF0AAAAH] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server 172.16.5.5:3306 is not replicating? {"host":"172.16.5.5:3306"}
2018-11-22 20:48:17 [5397214e708af0ce4fc3f26b] deployment-mwmaint01 wikidatawiki 1.32.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadMonitor::getServerStates: host deployment-db04 is unreachable {"db_server":"deployment-db04"}
2018-11-22 20:48:17 [5397214e708af0ce4fc3f26b] deployment-mwmaint01 wikidatawiki 1.32.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server deployment-db04 is not replicating? {"host":"deployment-db04"}
2018-11-22 20:48:17 [5397214e708af0ce4fc3f26b] deployment-mwmaint01 wikidatawiki 1.32.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
2018-11-22 20:48:31 [W-cWH6wQBHcAAErbfF4AAAAM] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadMonitor::getServerStates: host deployment-db04 is not replicating? {"db_server":"deployment-db04"}
2018-11-22 20:48:31 [W-cWH6wQBHcAAErbfF4AAAAM] deployment-mediawiki-07 enwiki 1.33.0-alpha DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server deployment-db04 is not replicating? {"host":"deployment-db04"}

root@BETA[(none)]> show slave status  \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: deployment-db03.deployment-prep.eqiad.wmflabs
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: deployment-db03-bin.000103
          Read_Master_Log_Pos: 504
               Relay_Log_File: deployment-db04-relay-bin.000002
                Relay_Log_Pos: 1840130
        Relay_Master_Log_File: deployment-db03-bin.000102
             Slave_IO_Running: Yes
[...]
        Seconds_Behind_Master: NULL
[...]

Wait, what? NULL?

Turns out that Slave_SQL_Running: No is a Bad Thing - despite it saying its waiting for master to send to event, replication is actually broken.

Alright so I made it skip insertion of duplicate things until replication recovered and Seconds_Behind_Master is now back down to 0.

As for VE, URLs like https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/User%3AKrenair%2Fsandbox/387145?redirect=false fail:

root@deployment-cache-text05:/etc/varnish# grep deployment-restbase * -A1
wikimedia-common_misc-backend.inc.vcl:	.host = "deployment-restbase01.deployment-prep.eqiad.wmflabs";
wikimedia-common_misc-backend.inc.vcl-	.port = "7231";
--
wikimedia-common_text-backend.inc.vcl:	.host = "deployment-restbase01.deployment-prep.eqiad.wmflabs";
wikimedia-common_text-backend.inc.vcl-	.port = "7231";
root@deployment-cache-text05:/etc/varnish# curl http://deployment-restbase01.deployment-prep.eqiad.wmflabs:7231 -v
* Rebuilt URL to: http://deployment-restbase01.deployment-prep.eqiad.wmflabs:7231/
*   Trying 172.16.5.26...
* TCP_NODELAY set
* connect to 172.16.5.26 port 7231 failed: Connection refused
* Failed to connect to deployment-restbase01.deployment-prep.eqiad.wmflabs port 7231: Connection refused
* Closing connection 0
curl: (7) Failed to connect to deployment-restbase01.deployment-prep.eqiad.wmflabs port 7231: Connection refused
krenair@deployment-restbase01:~$ sudo lsof -i :7231
krenair@deployment-restbase01:~$ sudo service restbase status
● restbase.service - "restbase service"
   Loaded: loaded (/lib/systemd/system/restbase.service; enabled)
   Active: active (running) since Thu 2018-11-22 22:58:32 UTC; 21min ago
 Main PID: 18100 (firejail)
   CGroup: /system.slice/restbase.service
           ├─18100 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml
           ├─18102 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml
           ├─18107 /usr/bin/nodejs restbase/server.js -c /etc/restbase/config.yaml
           ├─24196 /usr/bin/nodejs /srv/deployment/restbase/deploy-cache/revs/5b8ad3c0a8b6a52a7eacb731055ea4c3e392aa3f/node_modules/service-runner/service-runner.js -c /etc/restbase/config.yaml
           ├─24212 /usr/bin/nodejs /srv/deployment/restbase/deploy-cache/revs/5b8ad3c0a8b6a52a7eacb731055ea4c3e392aa3f/node_modules/service-runner/service-runner.js -c /etc/restbase/config.yaml
           └─24218 /usr/bin/nodejs /srv/deployment/restbase/deploy-cache/revs/5b8ad3c0a8b6a52a7eacb731055ea4c3e392aa3f/node_modules/service-runner/service-runner.js -c /etc/restbase/config.yaml

Nov 22 22:58:32 deployment-restbase01 systemd[1]: Started "restbase service".
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: Reading profile /etc/firejail/default.profile
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: Reading profile /etc/firejail/disable-common.inc
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: Reading profile /etc/firejail/disable-programs.inc
Nov 22 22:58:32 deployment-restbase01 restbase[18100]: ** Note: you can use --noprofile to disable default.profile **
krenair@deployment-restbase01:~$ grep 7231 /etc/restbase/config.yaml
      port: 7231

I do confirm that dewiki@BETA works properly now.

@fgiunchedi: Why is puppet disabled on deployment-mediawiki-07? The reason given is just filippo

@fgiunchedi: Why is puppet disabled on deployment-mediawiki-07? The reason given is just filippo

I was performing testing for T205851, I've reenabled puppet now

Change 475363 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update IPs for Varnish

https://gerrit.wikimedia.org/r/475363

Should be fixed now. Cassandra wasn't starting properly.

Thanks @mobrovac. For future reference the command tail-restbase will show logs the indicate this sort of failure. I was looking at sudo service restbase status which didn't show anything wrong.

While @mmodell was the point person from Release-Engineering-Team, @Krenair did much of the follow-up so I'm assigning this task to him (just to reflect reality).