Page MenuHomePhabricator

switch prod Phabricator from phab1003 to phab1001
Closed, ResolvedPublic

Description

In T190568 both phab1001 and phab2001 have been reimaged with buster.

Currently the temporary server phab1003 is the production Phabricator server and on stretch.

Set a maintenance window and switch over from phab1003 to phab1001.

After a little while shut down phab1003 and decom it / give it back to dcops.


https://etherpad.wikimedia.org/p/Phabricator-migration-20191203


https://phabricator.wikimedia.org/T238956

switch prod Phabricator from phab1003 to phab1001

2019-12-03

branches:

phab-buster (the actual switch)
https://gerrit.wikimedia.org/r/q/topic:%22phab-buster%22+(status:open%20OR%20status:merged)

phab1003-decom: (later)
https://gerrit.wikimedia.org/r/q/topic:%22phab1003-decom%22+(status:open%20OR%20status:merged)

phab-buster:

PREPARE:

site/phabricator: apply phab role on phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/536712 (MERGED)
phabricator: support buster with PHP 7.3 packages - https://gerrit.wikimedia.org/r/c/operations/puppet/+/541666 (MERGED)
phabricator::httpd: support stretch/buster with/without php-fpm - https://gerrit.wikimedia.org/r/c/operations/puppet/+/541930 (MERGED)
phabricator: install s-nail instead of heirloom-mailx on buster - https://gerrit.wikimedia.org/r/c/operations/puppet/+/541967 (MERGED)
phabricator: install s-nail instead of heirloom-mailx on any distro - https://gerrit.wikimedia.org/r/c/operations/puppet/+/542191 (MERGED)
log downtime in icinga - DONE - scheduled downtime for host and all services on phab1003 until in 2 days. MORE IS NEEDED, the PAGING service https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=phabricator.wikimedia.org , also downtimed
stop phd puppet agent --disable "https://phabricator.wikimedia.org/T238956" and "systemctl stop phd" on both phab1001 and phab1003
change the "phabricator_failover_server" in Hiera common.yaml to the new server to allow it to rsync repo data from active server https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554390
rsync /srv/repos from phab1003 to phab1001 IN PROGRESS

SWITCH:

phab1003-decom:

DECOM:
phabricator: remove phab1003 from list of phab servers - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552592
remove service IPs and IPv6 for phab1003 - https://gerrit.wikimedia.org/r/c/operations/dns/+/552599
remove production IPs for phab1003 - https://gerrit.wikimedia.org/r/c/operations/dns/+/552601
mariadb: remove grants for users on phab1003 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552607
site: turn phab1003 into a spare::system - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552603
mtail: stop using phab1003 for tests, use phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552604 <--- part of SWITCH ?? | No, but replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/554403
install_server: remove phab1003 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552609

ABANDON?!

phabricator-new: https://gerrit.wikimedia.org/r/c/operations/puppet/+/551286 (abandoned)
phabricator-new: https://gerrit.wikimedia.org/r/c/operations/dns/+/551284 (abandoned)

DAY 2 (We have to switch back again to phab1003, change a BIOS setting on phab1001, reimage phab1001 and finally switch over again :/)

SIMPLIFY SETUP FOR MIGRATION (rsync, hiera, puppet)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/554610 (merged)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/554628 (merged)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/554643 (merged)

SWITCH BACK to phab1003:

  • change BIOS settings for ATA mode to AHCI on phab1001
  • reimage phab1001
  • rsync /srv/repos. pull from phab1003 on phab1001 with --delete
  • reboot phab1001 to clear "microcode vulns not fixed" Icinga alert
  • rsync again
  • verify code in /srv/phab is up to date and both servers are on the same git tag

FINAL SWITCH BACK TO PHAB1001

  • restart ssh-phab service to make it listten on IPv6 (it wasn't because puppet starts the service before it adds the v6 IP on the interface) that cleared Icinga alerts
  • delete stale confd files on puppetmaster to clear more Icinga alerts about confd template compilation failing (because reimage script crashed so it did not get to delete them)
  • check status of git-ssh (that wasn't switched twice) https://gerrit.wikimedia.org/r/c/operations/puppet/+/554957 (!) and removed the lo:LVS IPs (v4 and v6!) from interface on phab1003, restart ssh-phab
  • make phd run on correct server https://gerrit.wikimedia.org/r/c/operations/puppet/+/554960 to avoid breakage of repos
  • check Icinga for any alerts and remove downtimes (don't forget the "phabricator" meta virtual host)

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+6 -6
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
operations/puppetproduction+6 -10
operations/puppetproduction+14 -18
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/dnsmaster+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn triaged this task as High priority.Nov 22 2019, 10:09 PM
Dzahn added a subscriber: 20after4.

@20after4 Let's find a date for this. Preferably relatively soon before the holidays and not during the holidays as we initially planned.

Change 552589 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator/conftool: switch phab-vcs (git-ssh) service to phab1001

https://gerrit.wikimedia.org/r/552589

Change 552591 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: switch "active server" from phab1003 to phab1001

https://gerrit.wikimedia.org/r/552591

Change 552593 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] dumps/phabricator: switch dumps host from phab1003 to phab1001

https://gerrit.wikimedia.org/r/552593

Change 552595 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] varnish: switch phabricator backend to phab1001

https://gerrit.wikimedia.org/r/552595

Change 552597 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: switch mail destination to phab1001

https://gerrit.wikimedia.org/r/552597

Change 552598 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch discovery record for phabricator to 1001 for ATS

https://gerrit.wikimedia.org/r/552598

Dzahn updated the task description. (Show Details)

This is happening today, starting in about 15 minutes.

Change 554390 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: switch failover server to phab1001

https://gerrit.wikimedia.org/r/554390

Change 554390 merged by Dzahn:
[operations/puppet@production] phabricator: switch failover server to phab1001

https://gerrit.wikimedia.org/r/554390

Change 552593 merged by Dzahn:
[operations/puppet@production] dumps/phabricator: switch dumps host from phab1003 to phab1001

https://gerrit.wikimedia.org/r/552593

Change 552589 merged by Dzahn:
[operations/puppet@production] phabricator/conftool: switch phab-vcs (git-ssh) service to phab1001

https://gerrit.wikimedia.org/r/552589

Phab1001 disk I/O seems a lot slower than phab1003. Running lshw -class storage yields one obvious difference: phab1001 is running in legacy IDE mode with the ata_piix driver while phab1003 was using AHCI which has always been the preferred mode for accessing SATA disks afaik.

IDE:

phab1001$ lshw -class storage
*-ide:0                   
      description: IDE interface
      product: C610/X99 series chipset sSATA Controller [IDE mode]
      vendor: Intel Corporation
      physical id: 11.4
      bus info: pci@0000:00:11.4
      logical name: scsi0
      logical name: scsi1
      version: 05
      width: 32 bits
      clock: 66MHz
      capabilities: ide pm pci_native_mode bus_master cap_list emulated
      configuration: driver=ata_piix latency=0
      resources: irq:16 ioport:20a8(size=8) ioport:20c4(size=4) ioport:20a0(size=8) ioport:20c0(size=4) ioport:2070(size=16) ioport:2060(size=16)

AHCI:

phab1003$ lshw -class storage
*-storage:0               
     description: SATA controller
     product: Lewisburg SSATA Controller [AHCI mode]
     vendor: Intel Corporation
     physical id: 11.5
     bus info: pci@0000:00:11.5
     version: 09
     width: 32 bits
     clock: 66MHz
     capabilities: storage msi pm ahci_1.0 bus_master cap_list
     configuration: driver=ahci latency=0
     resources: irq:29 memory:92e16000-92e17fff memory:92e1f000-92e1f0ff ioport:2068(size=8) ioport:2074(size=4) ioport:2040(size=32) memory:92d80000-92dfffff

I'm not really sure that is the reason for the poor performance but it seems like a likely culprit.

@Dzahn: Can we switch it over to AHCI in the bios? Do we need DC-Ops for that?

Slow disk is manifesting with tasks blocked for extended periods of time waiting for I/O:

/var/log/kern.log
Dec  4 02:08:53 phab1001 kernel: [7006399.950911] INFO: task jbd2/dm-0-8:908 blocked for more than 120 seconds.
Dec  4 02:08:53 phab1001 kernel: [7006399.958692]       Not tainted 4.19.0-6-amd64 #1 Debian 4.19.67-2
Dec  4 02:08:53 phab1001 kernel: [7006399.965597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  4 02:08:53 phab1001 kernel: [7006399.974531] jbd2/dm-0-8     D    0   908      2 0x80000000
Dec  4 02:08:53 phab1001 kernel: [7006399.974534] Call Trace:
Dec  4 02:08:53 phab1001 kernel: [7006399.974549]  ? __schedule+0x2a2/0x870
Dec  4 02:08:53 phab1001 kernel: [7006399.974551]  schedule+0x28/0x80
Dec  4 02:08:53 phab1001 kernel: [7006399.974552]  io_schedule+0x12/0x40
Dec  4 02:08:53 phab1001 kernel: [7006399.974557]  wait_on_page_bit_common+0xfd/0x180
Dec  4 02:08:53 phab1001 kernel: [7006399.974558]  ? page_cache_tree_insert+0xe0/0xe0
Dec  4 02:08:53 phab1001 kernel: [7006399.974560]  __filemap_fdatawait_range+0xe1/0x130
Dec  4 02:08:53 phab1001 kernel: [7006399.974564]  ? guard_bio_eod+0x32/0x100
Dec  4 02:08:53 phab1001 kernel: [7006399.974566]  filemap_fdatawait_range_keep_errors+0xe/0x30
Dec  4 02:08:53 phab1001 kernel: [7006399.974577]  jbd2_journal_commit_transaction+0xa6c/0x1890 [jbd2]
Dec  4 02:08:53 phab1001 kernel: [7006399.974585]  ? __switch_to_asm+0x35/0x70
Dec  4 02:08:53 phab1001 kernel: [7006399.974594]  kjournald2+0xbd/0x270 [jbd2]
Dec  4 02:08:53 phab1001 kernel: [7006399.974600]  ? finish_wait+0x80/0x80
Dec  4 02:08:53 phab1001 kernel: [7006399.974612]  ? commit_timeout+0x10/0x10 [jbd2]
Dec  4 02:08:53 phab1001 kernel: [7006399.974618]  kthread+0x112/0x130
Dec  4 02:08:53 phab1001 kernel: [7006399.974622]  ? kthread_bind+0x30/0x30
Dec  4 02:08:53 phab1001 kernel: [7006399.974629]  ret_from_fork+0x35/0x40

Note, this hasn't happened in the past hour but it was happening quite a bit when rsync was running.

@Dzahn: Can we switch it over to AHCI in the bios? Do we need DC-Ops for that?

@mmodell I did reboot into BIOS and looked at it but when switching the mode from ATA to AHCI it said "Data lass will occur" so i did not save the change, restored settings and left BIOS again.. for tonight.

Change 554411 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: temp. disable automatic rsync of repo data

https://gerrit.wikimedia.org/r/554411

Change 554411 merged by Dzahn:
[operations/puppet@production] phabricator: temp. disable automatic rsync of repo data

https://gerrit.wikimedia.org/r/554411

Change 554412 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: temp. make phab1001 the "failover" again for rsync

https://gerrit.wikimedia.org/r/554412

Change 554412 merged by Dzahn:
[operations/puppet@production] phabricator: temp. make phab1001 the "failover" again for rsync

https://gerrit.wikimedia.org/r/554412

https://integration.wikimedia.org/ci/job/mobileapps-periodic-test/ has been failing since this happened. I'm guessing that is not a coincidence. Looks like it is hitting a 500 when attempting a git fetch from https://phabricator.wikimedia.org/diffusion/GMOA.

It's not terribly urgent, as this job isn't used as a gate for development work, but I thought it worth reporting.

Mentioned in SAL (#wikimedia-operations) [2019-12-04T06:01:38Z] <mutante> rsyncing /srv/repos data once again. pulling from phab1003 to phab1001 (T238956)

Mentioned in SAL (#wikimedia-operations) [2019-12-04T06:13:39Z] <mutante> phab1001 - running rsync of /srv/repos with --delete because it's larger than the source by about 5GB - deleting objects to match phab1003, former prod server. now both 50G (T238956)

The recently added Gerrit integration directly beneath a tasks description seems to be gone, could that be related to this switch?

Edit: I can’t find the Phabricator task for that feature, but I think rPHEX7a526af8b46c: Add GerritPatchesCustomField is related to it?
Edit 2: It was T229934: Enable semantic relationship between code review changesets and maniphest tasks in phabricator (show "Related Gerrit Patches").

Unfortunately we have to switch back to the server before, change a BIOS setting in the current server, reimage it and then switch back a third and hopefully final time.

Change 554610 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: simplify rsync setup, step 1

https://gerrit.wikimedia.org/r/554610

Change 554610 merged by Dzahn:
[operations/puppet@production] phabricator: simplify rsync setup, step 1

https://gerrit.wikimedia.org/r/554610

Change 554628 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: simplify rsync setup for migration, part 2

https://gerrit.wikimedia.org/r/554628

Change 554628 merged by Dzahn:
[operations/puppet@production] phabricator: simplify rsync setup for migration, part 2

https://gerrit.wikimedia.org/r/554628

Change 554643 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: get rid of "failover" server Hiera key, further simplify

https://gerrit.wikimedia.org/r/554643

Change 554643 merged by Dzahn:
[operations/puppet@production] phabricator: get rid of "failover" server Hiera key, further simplify

https://gerrit.wikimedia.org/r/554643

Change 554644 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: switch prod server to phab1003, enables dumps and ferm holes

https://gerrit.wikimedia.org/r/554644

Change 554644 merged by Dzahn:
[operations/puppet@production] phabricator: switch prod server to phab1003, enables dumps and ferm holes

https://gerrit.wikimedia.org/r/554644

Mentioned in SAL (#wikimedia-operations) [2019-12-05T01:07:41Z] <mutante> phab1001 - System BIOS Settings > SATA Settings > Embedded SATA: switch from ATA to AHCI mode (T238956)

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201912050124_dzahn_252965_phab1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['phab1001.eqiad.wmnet']

Of which those FAILED:

['phab1001.eqiad.wmnet']

Change 554957 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable vcs listen addresses on phab1001, disable on phab1003

https://gerrit.wikimedia.org/r/554957

Change 554957 merged by Dzahn:
[operations/puppet@production] phabricator: enable vcs listen addresses on phab1001, disable on phab1003

https://gerrit.wikimedia.org/r/554957

Mentioned in SAL (#wikimedia-operations) [2019-12-05T22:03:26Z] <mutante> phabricator - git-ssh.wikimedia.org has been fixed and is up again (T238956)

Change 554960 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] Run phd on phab1001 instead of 1003

https://gerrit.wikimedia.org/r/554960

Change 554960 merged by Dzahn:
[operations/puppet@production] Run phd on phab1001 instead of 1003

https://gerrit.wikimedia.org/r/554960

I suspect that this move inadvertently caused T239786: Viewing MediaWiki repository in diffusion results in an Unhandled Exception ("CommandException") as I only started seeing it today.

Yes, indeed. It has been fixed now. Sorry for the inconvenience. I closed the linked ticket.

Summary:

We switched from phab1003 to phab1001.

Then we realized phab1001 had wrong BIOS settings (disks in legacy IDE mode and no write cache, which was really bad for performance) that could not be changed without data loss.

We had to switch back from phab1001 to phab1003, then fix BIOS settings and then reimage phab1001, then rsync data again.

Then, today, we switched once again from phab1003 to phab1001 finally concluding this.

The recently added Gerrit integration directly beneath a tasks description seems to be gone, could that be related to this switch?

Edit: I can’t find the Phabricator task for that feature, but I think rPHEX7a526af8b46c: Add GerritPatchesCustomField is related to it?
Edit 2: It was T229934: Enable semantic relationship between code review changesets and maniphest tasks in phabricator (show "Related Gerrit Patches").

I see the "related gerrit patches" box on this very page.

I see the "related gerrit patches" box on this very page.

Yup, it’s back now. Thanks!

Volans subscribed.

I've noticed that Phabricator emails are failing the SPF check, re-opening to add details, feel free to move it to a separate task if needed.

GMail interpreted block:

SPF:	FAIL with IP 208.80.154.76

Directly from the source email:

ARC-Authentication-Results: i=1; mx.google.com;
       spf=fail (google.com: domain of no-reply@phabricator.wikimedia.org does not designate 2620:0:861:102:10:64:16:8 as permitted sender) smtp.mailfrom=no-reply@phabricator.wikimedia.org

Received-SPF: fail (google.com: domain of no-reply@phabricator.wikimedia.org does not designate 2620:0:861:102:10:64:16:8 as permitted sender) client-ip=2620:0:861:102:10:64:16:8;

Authentication-Results: mx.google.com;
       spf=fail (google.com: domain of no-reply@phabricator.wikimedia.org does not designate 2620:0:861:102:10:64:16:8 as permitted sender) smtp.mailfrom=no-reply@phabricator.wikimedia.org

Received: from phab1001.eqiad.wmnet ([2620:0:861:102:10:64:16:8]:52524) by mx1001.wikimedia.org with esmtp (Exim 4.89) (envelope-from <no-reply@phabricator.wikimedia.org>) id 1idHF9-0004ps-HO for [...SNIP...]; Fri, 06 Dec 2019 17:18:15 +0000

Change 555611 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Switch phab SPF back to phab1001

https://gerrit.wikimedia.org/r/555611

Change 555611 merged by Dzahn:
[operations/dns@master] Switch phab SPF back to phab1001

https://gerrit.wikimedia.org/r/555611

@Volans Thanks for reporting! The SPF record has been updated now in DNS. This should go away now.

Dzahn changed the status of subtask T238957: decommission phab1003.eqiad.wmnet from Stalled to Open.

I get

$:acko\> ssh phab1001
The authenticity of host 'phab1001 (<no hostip for proxy command>)' can't be established.
ECDSA key fingerprint is SHA256:PijYsmTq9dQ5qNa1sS4LwBy8H/bJjWnODYjnSlykZ5Q.

but https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/phab1001.eqiad.wmnet lists X9iy8Eiy7s7nhNqV613AKOp7qyGp5zDVuDwkZSjuspw=

Does that page need to get updated? (Or am I doing something wrong, as usual?)

@Aklapper yes, as the host got reimaged I think the page was not updated, but I cannot edit it unfortunately.

The current fingerprints are:

root@phab1001:~# gen_fingerprints
 +---------+---------+-----------------------------------------------------+
 | Cipher  | Algo    | Fingerprint                                         |
 +---------+---------+-----------------------------------------------------+
 | RSA     | SHA-256 | SHA256:AJVNvxAF9IPqEl8cwDSkeMHBsJW6IXpnhmfM7hgy5DA  |
 +---------+---------+-----------------------------------------------------+
 | ECDSA   | SHA-256 | SHA256:PijYsmTq9dQ5qNa1sS4LwBy8H/bJjWnODYjnSlykZ5Q  |
 +---------+---------+-----------------------------------------------------+
 | ED25519 | SHA-256 | SHA256:ptmUDuPCjBXh09FPEqZK1mp6lB6UaZD8+dTf2oLB3xA  |
 +---------+---------+-----------------------------------------------------+

Thanks for updating the page. This was the same as T224677#5753016