Page MenuHomePhabricator

setup/install bast1002(WMF4749)
Closed, ResolvedPublic

Description

This task will track the setup, installation, and deployment of bast1002.wikimedia.org(WMF4749). This will largely consist of two parts, 1) the normal setup and deployment of the server with OS and 2) the refactoring of the bastion role as it has to support a lot of possible sub-services (like tftp and the like).

bast1002:

Event Timeline

RobH triaged this task as Normal priority.Feb 6 2018, 5:45 PM
RobH created this task.
RobH renamed this task from setup/install bast1002 to setup/install bast1002(WMF4749).Feb 6 2018, 5:48 PM
RobH updated the task description. (Show Details)

Change 408565 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] set dns entries for bast1002

https://gerrit.wikimedia.org/r/408565

Change 408565 merged by RobH:
[operations/dns@master] set dns entries for bast1002

https://gerrit.wikimedia.org/r/408565

RobH updated the task description. (Show Details)Feb 6 2018, 6:04 PM

Change 408586 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting bast1002 install params

https://gerrit.wikimedia.org/r/408586

Change 408586 merged by RobH:
[operations/puppet@production] setting bast1002 install params

https://gerrit.wikimedia.org/r/408586

Change 408598 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] fixing bast1002 entry

https://gerrit.wikimedia.org/r/408598

Change 408598 merged by RobH:
[operations/dns@master] fixing bast1002 entry

https://gerrit.wikimedia.org/r/408598

RobH added a comment.Feb 6 2018, 8:24 PM

I'm getting an SDA error in the installer. I removed all auto partitioning, and it still has the error, so it is a defective disk.

I'm rebooting into the ePSA to test things. Otherwise we'll just report the bad disk via the self dispatch and get a new one sent out.

RobH reassigned this task from RobH to Cmjohnson.Feb 6 2018, 8:49 PM
RobH added a subscriber: Cmjohnson.

I put in self dispatch case # SR960490901 and the replacement disk will ship directly to eqiad with a return tag for the old one. Escalating to Chris for the swap.

@Cmjohnson please swap the sda and assign back to me for followup, thanks!

Dell now requires a new report with self dispatches and this was kicked back. I resubmitted just now

Cmjohnson reassigned this task from Cmjohnson to RobH.Feb 16 2018, 3:33 PM

@RobH the disk has been replaced. Assigning back to you

Return tracking info
USPS 9202 3946 5301 2437 9854 35
FEDEX 9611918 2393026 74735456

@RobH does this still need idrac setup?

@RobH i see it's in site.pp with role spare::system. are some more of the checkboxes done meanwhile?

Dzahn claimed this task.Feb 26 2018, 10:59 PM

Change 414848 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: turn bast1002 into a bastion host

https://gerrit.wikimedia.org/r/414848

i was able to login on DRAC and get a console, then i saw

Build : 4239.35 ePSA Pre-boot System Assessment

and shortly after the system rebooted into Debian installer

There it still got an I/O error:

│ Input/output error during read on /dev/sdb       │
Dzahn reassigned this task from Dzahn to RobH.Feb 26 2018, 11:35 PM

Change 414882 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] partman: fix recipes for bastion servers

https://gerrit.wikimedia.org/r/414882

Change 414882 merged by Dzahn:
[operations/puppet@production] partman: fix recipes for bastion servers

https://gerrit.wikimedia.org/r/414882

Dzahn added a comment.Feb 27 2018, 1:47 AM

I found an issue with the selection of partman recipes but after that the next thing happened.

While i was on the console my connection froze and i can't get back on it :

root@bast1002.mgmt.eqiad.wmnet's password: 

No more sessions are available for this type of connection!

Connection to bast1002.mgmt.eqiad.wmnet closed.
Dzahn added a comment.EditedMar 5 2018, 5:13 PM

could get console again, installer fails at:

       │ Input/output error during write on /dev/sdb       │              

            Error while setting up RAID                   │       
│ An unexpected error occurred while setting up a preseeded RAID  │       
│ configuration.

Change 417463 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netboot: temp remove bast1002 for debugging

https://gerrit.wikimedia.org/r/417463

Change 417463 merged by Dzahn:
[operations/puppet@production] netboot: temp remove bast1002 for debugging

https://gerrit.wikimedia.org/r/417463

Dzahn added a comment.Mar 8 2018, 11:34 PM

I booted it without partman recipe to get into manual partitioning. And, as opposed to deploy1001, this time it worked for some reason.

It went past the partioning and is now installing base system.. but with the wrong schema..

deploy1001 though is showing actual "SDA failed" errors, so not the same thing after all?

Change 417469 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] fix erroneous records for bast1002

https://gerrit.wikimedia.org/r/417469

Change 417469 merged by Dzahn:
[operations/dns@master] fix erroneous records for bast1002

https://gerrit.wikimedia.org/r/417469

Dzahn added a comment.Mar 9 2018, 1:28 AM

after reverting the changes to let it auto-partition again:

  • fails as before

tried another partman recipe with "gpt"

  • fails as before

switched SATA controller to ATA mode

  • fails to even detect hardware
Dzahn added a comment.Mar 9 2018, 5:47 PM

from /var/log/syslog from installer shell

Mar  9 01:25:26 kernel: [   72.856447] ata2.00: configured for UDMA/133
Mar  9 01:25:26 kernel: [   72.856460] ata2: EH complete
Mar  9 01:25:26 kernel: [   72.868665] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar  9 01:25:26 kernel: [   72.868667] ata2.00: BMDMA stat 0x25
[            (1*installer)  2 shell  3 shell  4- log           ][ Mar 09 17:41 ]
Mar  9 01:25:26 kernel: [   72.868676] ata2.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Mar  9 01:25:26 kernel: [   72.868676]          res 51/04:08:00:00:00/04:00:74:00:00/e0 Emask 0x1 (device error)
Mar  9 01:25:26 kernel: [   72.868678] ata2.00: status: { DRDY ERR }
Mar  9 01:25:26 kernel: [   72.868679] ata2.00: error: { ABRT }
Mar  9 01:25:26 kernel: [   72.876464] ata2.00: configured for UDMA/133
Mar  9 01:25:26 kernel: [   72.876477] ata2: EH complete
Mar  9 01:25:26 kernel: [   72.888662] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar  9 01:25:26 kernel: [   72.888664] ata2.00: BMDMA stat 0x25
Mar  9 01:25:26 kernel: [   72.888667] ata2.00: failed command: READ DMA
Mar  9 01:25:26 kernel: [   72.888672] ata2.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Mar  9 01:25:26 kernel: [   72.888672]          res 51/04:08:00:00:00/04:00:74:00:00/e0 Emask 0x1 (device error)
Mar  9 01:25:26 kernel: [   72.888674] ata2.00: status: { DRDY ERR }
Mar  9 01:25:26 kernel: [   72.888676] ata2.00: error: { ABRT }
Mar  9 01:25:26 kernel: [   72.896438] ata2.00: configured for UDMA/133
Mar  9 01:25:26 kern
RobH mentioned this in Unknown Object (Task).Mar 9 2018, 5:58 PM

Change 418034 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add wmf4727 as bast1003

https://gerrit.wikimedia.org/r/418034

Change 418034 merged by Dzahn:
[operations/dns@master] add wmf4727 as bast1003

https://gerrit.wikimedia.org/r/418034

Change 418036 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add wmf4727 as bast1003.wikimedia.org

https://gerrit.wikimedia.org/r/418036

Change 418036 merged by Dzahn:
[operations/dns@master] add wmf4727 as bast1003.wikimedia.org

https://gerrit.wikimedia.org/r/418036

Change 418157 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add MAC of wmf4727 as bast1003

https://gerrit.wikimedia.org/r/418157

Change 418157 merged by Dzahn:
[operations/puppet@production] DHCP: add MAC of wmf4727 as bast1003

https://gerrit.wikimedia.org/r/418157

Change 418598 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] temp use bast-test.eqiad.wmnet for wmf4727

https://gerrit.wikimedia.org/r/418598

Change 418598 merged by Dzahn:
[operations/dns@master] temp use bast-test.eqiad.wmnet for wmf4727

https://gerrit.wikimedia.org/r/418598

Change 418600 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: test wmf4727 as bast-test.eqiad

https://gerrit.wikimedia.org/r/418600

Change 418600 merged by Dzahn:
[operations/puppet@production] DHCP: test wmf4727 as bast-test.eqiad

https://gerrit.wikimedia.org/r/418600

Dzahn added a comment.EditedMar 10 2018, 7:12 PM

when using wmf4727 and testing stretch install:

No kernel modules were found. This probably is due to a mismatch      │    
   │ between the kernel used by this version of the installer and the      │    
   │ kernel version available in the archive.                              │    
   │                                                                       │    
   │ If you're installing from a mirror, you can work around this problem  │    
   │ by choosing to install a different version of Debian. The install     │    
   │ will probably fail to work if you continue without kernel modules.    │    
   │                                                                       │    
   │ Continue the install without loading kernel modules?                  │    
   │                                                                       │    
   │     <Go Back>                                       <Yes>    <No>
                                                                         │   
  │ No disk drive was detected. If you know the name of the driver needed   │   
  │ by your disk drive, you can select it from the list.                    │   
  │                                                                         │   
  │ Driver needed for your disk drive:                                      │   
  │                                                                         │   
  │                     continue with no disk drive  -                      │   
  │                     hv_storvsc                   0          
...
──────┤ [!!] Partition disks ├─────────────────────┐──┐   
  │   │                                                                  │  │   
  │   │                   Software RAID not available                    │  │   
  │   │ The current kernel doesn't seem to support software RAID (MD)    │  │   
  │ Pl│ devices. This should be solved by loading the necessary modules. │  │   
  │   │                                                                  │
Dzahn claimed this task.Mar 19 2018, 7:40 PM

Chris also replaced disks in this one today.

Dzahn added a comment.EditedMar 19 2018, 8:26 PM

re: bast1002

I started the install process again and this time it worked. So confirmed that replacing the disks solved it.

re: wmf4727 / bast1003:

I started the install process again and it went through partitioning and installing the base system, so we can confirm the hard disks themselves work.

That being said, the installation did not finish successfully because at the very end the GRUB install failed for so far unknown reasons.

Robh mentioned i should leave the ticket open and that he'd get back to it.

Mentioned in SAL (#wikimedia-operations) [2018-03-19T20:32:06Z] <mutante> signing puppet certs for new host bast1002. initial puppet run, will replace bast1001 soon (T186623)

Change 414848 merged by Dzahn:
[operations/puppet@production] site: turn bast1002 into a bastion host

https://gerrit.wikimedia.org/r/414848

Dzahn closed this task as Resolved.Mar 19 2018, 9:37 PM
Dzahn updated the task description. (Show Details)

handed-over to self ;)

Change 422339 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bastionhost: add MOTD warning of imminent bast1001 shutdown

https://gerrit.wikimedia.org/r/422339

Change 422339 merged by Dzahn:
[operations/puppet@production] bastionhost: add MOTD warning of imminent bast1001 shutdown

https://gerrit.wikimedia.org/r/422339

Change 436680 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: remove bast-test from DHCP

https://gerrit.wikimedia.org/r/436680

Change 436680 merged by Dzahn:
[operations/puppet@production] install_server: remove bast-test from DHCP

https://gerrit.wikimedia.org/r/436680

Change 436681 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove bast-test.eqiad.wmnet

https://gerrit.wikimedia.org/r/436681

Change 436681 merged by Dzahn:
[operations/dns@master] remove bast-test.eqiad.wmnet

https://gerrit.wikimedia.org/r/436681