Page MenuHomePhabricator

setup/install phab1001.eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the setup and installation of phab1001.eqiad.wmnet. This was allocated via hardware-requests T156970, and approved to use system wmf4747.

  • - create sub-task for on-site update of hostname labels and visible label field in racktables
  • - update dns (mgmt and production) https://gerrit.wikimedia.org/r/#/c/350476/
  • - update switch port (description, enable, set internal vlan)
  • - operations/puppet repo updates (install_server, site.pp, basic replication of what exists for iridium/phab2001)
  • - resolution of hard disk failure on T163960
  • - os installation
  • - puppet/salt sign/accept/initial run
  • - handoff to release engineering for implementation

This plan was outdated:

Implementation plan:

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+20 -14
operations/puppetproduction+18 -17
operations/puppetproduction+7 -0
operations/puppetproduction+9 -1
operations/puppetproduction+4 -5
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -4
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+5 -0
operations/puppetproduction+3 -0
operations/puppetproduction+0 -2
operations/puppetproduction+0 -4
operations/puppetproduction+4 -0
operations/puppetproduction+1 -6
operations/puppetproduction+5 -1
operations/puppetproduction+57 -42
operations/dnsmaster+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+15 -0
operations/puppetproduction+10 -2
operations/dnsmaster+4 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So this tries to load into the installer, and fails for sdb:

┌─────────────┤ [!!] Partition disks ├─────────────┐
│                                                  │
│ Input/output error during read on /dev/sdb       │
│                                                  │
│ ERROR!!!                                         │
│                                                  │
│                    Retry                         │
│                    Ignore                        │
│                    Cancel                        │
│                                                  │
│     <Go Back>                                    │

Additionally, when reading the installation debug logs, the following relevant errors show up:

Jul 19 17:47:28 kernel: [  133.714952] ata2.00: configured for UDMA/133
Jul 19 17:47:28 kernel: [  133.714966] sd 1:0:0:0: [sdb]  
Jul 19 17:47:28 kernel: [  133.714968] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 19 17:47:28 kernel: [  133.714971] sd 1:0:0:0: [sdb]  
Jul 19 17:47:28 kernel: [  133.714972] Sense Key : Aborted Command [current] [descriptor]
Jul 19 17:47:28 kernel: [  133.714976] Descriptor sense data with sense descriptors (in hex):
Jul 19 17:47:28 kernel: [  133.714978]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Jul 19 17:47:28 kernel: [  133.714988]         74 70 6d 00 
Jul 19 17:47:28 kernel: [  133.714993] sd 1:0:0:0: [sdb]  
Jul 19 17:47:28 kernel: [  133.714995] Add. Sense: No additional sense information
Jul 19 17:47:28 kernel: [  133.714997] sd 1:0:0:0: [sdb] CDB: 
Jul 19 17:47:28 kernel: [  133.714999] Read(10): 28 00 74 70 6d 00 00 00 08 00
Jul 19 17:47:28 kernel: [  133.715008] Buffer I/O error on device sdb, logical block 244190624
Jul 19 17:47:28 kernel: [  133.715021] ata2: EH complete


Jul 19 17:46:07 kernel: [   52.741935] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Jul 19 17:46:07 kernel: [   52.741973] sd 1:0:0:0: [sdb] Write Protect is off
Jul 19 17:46:07 kernel: [   52.741976] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Jul 19 17:46:07 kernel: [   52.741993] sd 1:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jul 19 17:46:07 kernel: [   52.742024] sd 0:0:0:0: [sda] Write Protect is off
Jul 19 17:46:07 kernel: [   52.742029] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jul 19 17:46:07 kernel: [   52.742064] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jul 19 17:46:07 kernel: [   52.746292] random: nonblocking pool is initialized

Jul 19 17:46:08 kernel: [   52.836485] ata2.00: BMDMA stat 0x25
Jul 19 17:46:08 kernel: [   52.836489] ata2.00: failed command: READ DMA
Jul 19 17:46:08 kernel: [   52.836495] ata2.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jul 19 17:46:08 kernel: [   52.836495]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Jul 19 17:46:08 kernel: [   52.836498] ata2.00: status: { DRDY ERR }
Jul 19 17:46:08 kernel: [   52.836500] ata2.00: error: { ABRT }
Jul 19 17:46:08 kernel: [   52.860923] ata2.00: configured for UDMA/133
Jul 19 17:46:08 kernel: [   52.860927] ata2: EH complete
Jul 19 17:46:08 kernel: [   52.868454] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 19 17:46:08 kernel: [   52.868459] ata2.00: BMDMA stat 0x25
Jul 19 17:46:08 kernel: [   52.868462] ata2.00: failed command: READ DMA
Jul 19 17:46:08 kernel: [   52.868468] ata2.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jul 19 17:46:08 kernel: [   52.868468]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Jul 19 17:46:08 kernel: [   52.868472] ata2.00: status: { DRDY ERR }
Jul 19 17:46:08 kernel: [   52.868474] ata2.00: error: { ABRT }
Jul 19 17:46:08 kernel: [   52.892898] ata2.00: configured for UDMA/133
Jul 19 17:46:08 kernel: [   52.892903] ata2: EH complete
Jul 19 17:46:08 kernel: [   52.900432] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 19 17:46:08 kernel: [   52.900437] ata2.00: BMDMA stat 0x25
Jul 19 17:46:08 kernel: [   52.900440] ata2.00: failed command: READ DMA
Jul 19 17:46:08 kernel: [   52.900447] ata2.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jul 19 17:46:08 kernel: [   52.900447]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Jul 19 17:46:08 kernel: [   52.900450] ata2.00: status: { DRDY ERR }
Jul 19 17:46:08 kernel: [   52.900452] ata2.00: error: { ABRT }
Jul 19 17:46:08 kernel: [   52.924870] ata2.00: configured for UDMA/133
Jul 19 17:46:08 kernel: [   52.924876] sd 1:0:0:0: [sdb]  
Jul 19 17:46:08 kernel: [   52.924878] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 19 17:46:08 kernel: [   52.924880] sd 1:0:0:0: [sdb]  
Jul 19 17:46:08 kernel: [   52.924882] Sense Key : Aborted Command [current] [descriptor]
Jul 19 17:46:08 kernel: [   52.924885] Descriptor sense data with sense descriptors (in hex):
Jul 19 17:46:08 kernel: [   52.924886]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Jul 19 17:46:08 kernel: [   52.924893]         00 00 00 00 
Jul 19 17:46:08 kernel: [   52.924896] sd 1:0:0:0: [sdb]  
Jul 19 17:46:08 kernel: [   52.924897] Add. Sense: No additional sense information
Jul 19 17:46:08 kernel: [   52.924899] sd 1:0:0:0: [sdb] CDB: 
Jul 19 17:46:08 kernel: [   52.924901] Read(10): 28 00 00 00 00 00 00 00 08 00
Jul 19 17:46:08 kernel: [   52.924906] end_request: I/O error, dev sdb, sector 0
Jul 19 17:46:08 kernel: [   52.924909] Buffer I/O error on device sdb, logical block 0
Jul 19 17:46:08 kernel: [   52.924916] ata2: EH complete
Jul 19 17:46:08 kernel: [   52.932406] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 19 17:46:08 kernel: [   52.932411] ata2.00: BMDMA stat 0x25
Jul 19 17:46:08 kernel: [   52.932415] ata2.00: failed command: READ DMA
Jul 19 17:46:08 kernel: [   52.932421] ata2.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jul 19 17:46:08 kernel: [   52.932421]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Jul 19 17:46:08 kernel: [   52.932424] ata2.00: status: { DRDY ERR }
Jul 19 17:46:08 kernel: [   52.932426] ata2.00: error: { ABRT }
Jul 19 17:46:08 kernel: [   52.956848] ata2.00: configured for UDMA/133
Jul 19 17:46:08 kernel: [   52.956852] ata2: EH complete

It appears that the sdb has gone bad, so the re-image cannot continue. I'm escalating this to @Cmjohnson for him to have a warranty disk replacement done for sdb.

Once that is done, this system will have to be fully re-imaged, as it appears it was somehow was booted into the installer and wiped the mbr of the original installation. Once the disk replacement is done, please assign back to me for followup.

Thanks!

RobH renamed this task from setup/install phab1001.eqiad.wmnet to replace sdb and then setup/install phab1001.eqiad.wmnet.Jul 19 2017, 5:53 PM

This could also simply be a bad or lose cable on the drive bay! Basic troubleshooting of swapping sda and sdb to see if the error follows is recommended.

dell dispatch SR951188562 for the replacement hard disk, shipping to eqiad.

Disk has been replaced:

Return shipping info is

USPS 9202 3946 5301 2436 1520 81
FEDEX 9611918 2393026 72902102

Change 366885 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: make phab1001 use role::spare for now

https://gerrit.wikimedia.org/r/366885

Change 366885 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: make phab1001 use role::spare for now

https://gerrit.wikimedia.org/r/366885

Change 366885 merged by Dzahn:
[operations/puppet@production] phabricator: make phab1001 use role::spare for now

https://gerrit.wikimedia.org/r/366885

So this had an issue when the new role assigned an IP to it that was in use in iridium. So we've put it back to spare, and reimaged to remove the bad role info. Reimage is still in progress.

RobH updated the task description. (Show Details)

back to @Dzahn for service implementation

@Dzahn: anything I can do to help get this one moving? I tried to log in to phab1001 so that I could verify that puppet has set things up correctly but I am not able to log in. I assume it hasn't had a puppet run yet?

@mmodell It doesn't have the puppet role for phab on it because we had to remove it. The role just isn't ready for being used on multiple hosts and it never worked before. It hardcodes an additional IP address for the second SSH server. So if you apply it you will get the same IP on multiple servers. We need to get rid of the hardcoded IP in there.

@Dzahn: Ok, I can fix that. Thanks! I think there is a lot of room for improvement in the way we handle IP addresses.

So what should we do instead of having host-specific IPs in hieradata/role/[datacenter]/phabricator_server.yaml? Should puppet look up the info from DNS or should we have a host-specific hiera level?

role/[datacenter]/ seems actually correct and better than host names.

(well of course until iridium is gone we need a special solution and so yea, hieradata/hosts/ until we have one per datacenter again)

@Dzahn: Do we have an IP assigned for git-ssh on phab1001?

@mmodell Here's the thing.

There is the git-ssh IP for eqiad 208.80.154.250 and git-ssh for codfw 208.80.153.250. This IP is on the "lo" loopback interface. The one for eqiad is on iridium.

Then there is the "phab1001-vcs" 10.64.32.186 and "phab2001-vcs" 10.192.32.149. This IP is on the eth0 interface. Also here the "1001" one is on iridium.

When we added phab1001-vcs on _iridium_ we did not name it iridium-vcs because we knew iridium was supposed to be reinstalled as phab1001.

Then plans changed from "reinstall iridum as phab1001" to "setup separate phab1001 in parallel" and now we have "phab1001-vcs" on iridium but NOT on phab1001.

So now we have 2 options:

a) schedule downtime, take down iridium, remove the additional IPs, add them on phab1001. use phab1001, shut down iridium. done

or

b) add additional IPs with some random name that isn't phab1001-vcs even though it's on phab1001, keep the existing names/IPs on iridium., be able to test, switch to phab1001, then rename things again.. hmm.. this seems like we would barely get less downtime out of it but would have more work.

Yeah I think scheduled downtime to switch the IP is reasonable. I'll make a patch and we can do it this week if you're up for it.

Change 368947 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator/admins: give phab admins access to phab1001

https://gerrit.wikimedia.org/r/368947

Change 368947 merged by Dzahn:
[operations/puppet@production] phabricator/admins: give phab admins access to phab1001

https://gerrit.wikimedia.org/r/368947

Change 368841 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] phabricator: rsync /srv/repos on iridium to phab1001

https://gerrit.wikimedia.org/r/368841

Change 368841 merged by Dzahn:
[operations/puppet@production] phabricator: rsync /srv/repos on iridium to phab1001

https://gerrit.wikimedia.org/r/368841

Change 368957 had a related patch set uploaded (by Paladox; owner: Dzahn):
[operations/puppet@production] phab1001: add interface::add_ip6_mapped

https://gerrit.wikimedia.org/r/368957

Change 368957 merged by Dzahn:
[operations/puppet@production] phab1001: add interface::add_ip6_mapped

https://gerrit.wikimedia.org/r/368957

Change 368969 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPv6 records for phab1001.eqiad.wmnet.

https://gerrit.wikimedia.org/r/368969

Change 368969 merged by Dzahn:
[operations/dns@master] add IPv6 records for phab1001.eqiad.wmnet.

https://gerrit.wikimedia.org/r/368969

Change 369001 had a related patch set uploaded (by Dzahn; owner: 20after4):
[operations/puppet@production] phabricator: Allow listen_address to be empty for migration from iridium

https://gerrit.wikimedia.org/r/369001

Change 369001 merged by Dzahn:
[operations/puppet@production] phabricator: Allow listen_address to be empty for migration from iridium

https://gerrit.wikimedia.org/r/369001

Change 369444 had a related patch set uploaded (by Dzahn; owner: Paladox):
[operations/puppet@production] Revert "phabricator: make phab1001 use role::spare for now"

https://gerrit.wikimedia.org/r/369444

Change 369444 merged by Dzahn:
[operations/puppet@production] phabricator: make phab1001 use role::phabricator_server

https://gerrit.wikimedia.org/r/369444

Change 369445 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] Revert "phabricator/admins: give phab admins access to phab1001"

https://gerrit.wikimedia.org/r/369445

Change 369447 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable stats mail and dumps on phab1001

https://gerrit.wikimedia.org/r/369447

Change 369445 merged by Dzahn:
[operations/puppet@production] Revert "phabricator/admins: give phab admins access to phab1001"

https://gerrit.wikimedia.org/r/369445

Change 369447 merged by Dzahn:
[operations/puppet@production] phabricator: enable stats mail and dumps on phab1001

https://gerrit.wikimedia.org/r/369447

Mentioned in SAL (#wikimedia-operations) [2017-08-03T00:05:31Z] <mutante> rsyncing /srv/repos from iridium to phab1001 (T163938)

Change 369820 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc/phabricator: switch from iridium to phab1001 backend

https://gerrit.wikimedia.org/r/369820

mmodell renamed this task from replace sdb and then setup/install phab1001.eqiad.wmnet to setup/install phab1001.eqiad.wmnet.Aug 3 2017, 1:48 AM

Change 369821 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc/phabricator: add director for phabricator-new, staging

https://gerrit.wikimedia.org/r/369821

Change 369823 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: set phab_domain to phabricator-new for phab1001

https://gerrit.wikimedia.org/r/369823

Change 369821 merged by Dzahn:
[operations/puppet@production] cache::misc/phabricator: add director for phabricator-new, staging

https://gerrit.wikimedia.org/r/369821

Change 369829 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc/phabricator: add director for phab-new

https://gerrit.wikimedia.org/r/369829

Change 369829 merged by Dzahn:
[operations/puppet@production] cache::misc/phabricator: add director for phab-new

https://gerrit.wikimedia.org/r/369829

Change 369823 merged by Dzahn:
[operations/puppet@production] phabricator: set phab_domain to phabricator-new for phab1001

https://gerrit.wikimedia.org/r/369823

Change 369831 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] datasets/phabricator: allow phab1001 as rsync host for dumps

https://gerrit.wikimedia.org/r/369831

Change 369831 merged by Dzahn:
[operations/puppet@production] datasets/phabricator: allow phab1001 as rsync host for dumps

https://gerrit.wikimedia.org/r/369831

Change 369832 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb/phabricator: update GRANTS from iridium to phab1001

https://gerrit.wikimedia.org/r/369832

I'm making this comment from phab1001.eqiad.wmnet :)

Change 369834 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] exim/phabricator: send mail to phab1001, not iridium

https://gerrit.wikimedia.org/r/369834

Change 369836 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: set phab1001 to active phab server

https://gerrit.wikimedia.org/r/369836

I found https://phabricator-new.wikimedia.org/ continuously redirecting which then the browser short circuits it. But https://phabricator-new.wikimedia.org/diffusion/ works :)

Change 370104 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phab: phabricator-new to phab2001, phab1001 using normal domain

https://gerrit.wikimedia.org/r/370104

steps for migration:

  • stop phd and puppet on iridium
  • rsync /srv/repos

/usr/bin/rsync -av rsync://iridium.eqiad.wmnet/srv-repos /srv/repos/

  • move IP addresses:

on iridium:

phab1001-vcs.eqiad.wmnet.
sudo ip addr del 10.64.32.186/32 dev eth0
sudo ip addr del 2620:0:861:103:10:64:32:186/128 dev eth0

git-ssh.eqiad.wikimedia.org.
sudo ip addr del 208.80.154.250/32 dev lo
sudo ip addr del 2620:0:861:ed1a::3:16/128 dev lo

on phab1001:

sudo ip addr add 10.64.32.186/32 dev eth0
sudo ip addr add 2620:0:861:103:10:64:32:186/128 dev eth0
(already done by puppet) # sudo ip addr add 208.80.154.250/32 dev lo
(already done by puppet) # sudo ip addr add 2620:0:861:ed1a::3:16/128 dev lo
...

Change 370119 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: switch service IPs to phab1001

https://gerrit.wikimedia.org/r/370119

Change 370122 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/phabricator: remove phab role from iridium, make it a spare

https://gerrit.wikimedia.org/r/370122

Change 370123 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator/dumps: remove iridium as allowed dumps host

https://gerrit.wikimedia.org/r/370123

Change 370119 had a related patch set uploaded (by Paladox; owner: Dzahn):
[operations/puppet@production] phabricator: switch service IPs to phab1001

https://gerrit.wikimedia.org/r/370119

Change 370119 merged by Dzahn:
[operations/puppet@production] phabricator: switch service IPs to phab1001

https://gerrit.wikimedia.org/r/370119

Mentioned in SAL (#wikimedia-operations) [2017-08-04T02:15:02Z] <mutante> phab1001 can't talk to mx servers via IPv6, but works via IPv4. iridium and other mailservers can also talk IPv6 to it. why? it did not change even when stopping ferm on client and on server it allows from anywhere. workaround for now was to hardcode IPv4 IP in phab config. (T163938)

Change 370123 merged by Dzahn:
[operations/puppet@production] phabricator/dumps: remove iridium as allowed dumps host

https://gerrit.wikimedia.org/r/370123

Change 370140 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: remove iridium remnants, replace with phab1001

https://gerrit.wikimedia.org/r/370140

Change 370140 merged by Dzahn:
[operations/puppet@production] phabricator: remove iridium remnants, replace with phab1001

https://gerrit.wikimedia.org/r/370140

Dzahn updated the task description. (Show Details)

Change 370122 merged by Dzahn:
[operations/puppet@production] site/phabricator: remove phab role from iridium, make it a spare

https://gerrit.wikimedia.org/r/370122

Change 370607 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: ensure /srv/dumps exists

https://gerrit.wikimedia.org/r/370607

Change 370607 merged by Dzahn:
[operations/puppet@production] phabricator: ensure /srv/dumps exists

https://gerrit.wikimedia.org/r/370607

https://gerrit.wikimedia.org/r/370518 by paladox merged as well, to fix another cron that sends the stats mail to admins

Mentioned in SAL (#wikimedia-operations) [2017-08-08T03:55:06Z] <mutante> phab1001 /usr/local/bin/community_metrics.sh | /usr/local/bin/project_changes.sh creating stats mails to admins (which failed before) (T163938)

we should probably only use @Aklapper email for /usr/local/bin/project_changes.sh on prod. I was testing /usr/local/bin/project_changes.sh and it sent it to him which then forwarded to the list without me realising.

Change 369832 merged by Jcrespo:
[operations/puppet@production] mariadb/phabricator: update GRANTS from iridium to phab1001

https://gerrit.wikimedia.org/r/369832

Change 377703 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb - phabricator: Remove public hashes from configuration files

https://gerrit.wikimedia.org/r/377703

Change 377703 merged by Jcrespo:
[operations/puppet@production] mariadb - phabricator: Remove public hashes from configuration files

https://gerrit.wikimedia.org/r/377703