Page MenuHomePhabricator

rack/setup/install backup2001
Closed, ResolvedPublic

Description

This task will track the racking, setup, installation, and deployment of the new backup2001.codfw.wmnet. This host is a direct replacement of heze.codfw.wmnet. It does have more storage capacity (shelves).

Racking Proposal: This needs to be in a 10G networked rack, but can be in ANY 10G rack. It's location in relation to heze is immaterial, since heze will be decommissioned when this is fully online. Just put it in any 10G rack where you have the most power/space/network/access.

Disk Shelf Cabling: These should be wired in a series, taking up only one of the two ports of the external SAS controller. This leaves the other port open for other shelf additions at a later date.

backup2001 + backup2001-array1 + backup2001-array2:

  • - receive in system on procurement task T194977
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2018, 4:40 PM
Papaul updated the task description. (Show Details)Jun 7 2018, 8:00 PM

Change 438277 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: ADD mgmt & prod DNS entries for backup2001

https://gerrit.wikimedia.org/r/438277

Papaul updated the task description. (Show Details)Jun 8 2018, 5:28 PM
Papaul updated the task description. (Show Details)

Change 438277 merged by RobH:
[operations/dns@master] DNS: ADD mgmt & prod DNS entries for backup2001

https://gerrit.wikimedia.org/r/438277

Papaul updated the task description. (Show Details)Jun 8 2018, 8:48 PM

Change 439830 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address and netboot entries for backup2001

https://gerrit.wikimedia.org/r/439830

Change 439830 merged by Dzahn:
[operations/puppet@production] DHCP: Add MAC address and netboot entries for backup2001

https://gerrit.wikimedia.org/r/439830

Change 440485 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change backup2001 MAC address from 1G MAC to 10G MAC

https://gerrit.wikimedia.org/r/440485

Change 440485 merged by Dzahn:
[operations/puppet@production] DHCP: Change backup2001 MAC address from 1G MAC to 10G MAC

https://gerrit.wikimedia.org/r/440485

@MoritzMuehlenhoff we are missing in our installer network drivers for the NIC card on this system (QLogic 10GE 2P QL41112HxCU-DE Adapter )

                                                                             
┌─────────────────────┤ [!] Detect network hardware ├─────────────────────┐   
│                                                                         │   
│ No Ethernet card was detected. If you know the name of the driver       │   
│ needed by your Ethernet card, you can select it from the list.          │   
│                                                                         │   
│ Driver needed by your Ethernet card:                                    │   
│                                                                         │   
│  no ethernet card                                                   -   │   
│  3c574_cs: 3Com 3c574 series PCMCIA Ethernet                        0   │   
│  3c589_cs: 3Com 3c589 series PCMCIA Ethernet                        ▒   │   
│  3c59x: 3Com 3c59x/3c9xx PCI Ethernet                               ▒   │   
│  8139cp: RealTek RTL-8139C+ series 10/100 PCI Ethernet              ▒   │   
│  8139too: RealTek RTL-8139 Fast Ethernet                            ▒   │   
│  8390: National Semiconductor 8390 Ethernet                         ▒   │   
│  acenic: AceNIC/3C985/GA620 Gigabit Ethernet driver                 ▒   │   
│  adm8211: Driver for IEEE 802.11b wireless cards based on ADMtek A  .   │   
│                                                                         │   
│     <Go Back>                                                           │   
│                                                                         │   
└─────────────────────────────────────────────────────────────────────────┘
Papaul updated the task description. (Show Details)Jun 15 2018, 4:55 PM

Could you open a maintenance shell and attach a screenshot of the output for "Network controller" of "lspci -v"? We need to figure out whether it misses a driver or firmware.

@MoritzMuehlenhoff please see below for the out put you requested

Vvjjkkii renamed this task from rack/setup/install backup2001 to mlbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii removed Papaul as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Papaul; removed: gerritbot, Aklapper.
CommunityTechBot renamed this task from mlbaaaaaaa to rack/setup/install backup2001.Jul 2 2018, 6:59 AM
CommunityTechBot assigned this task to Papaul.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: gerritbot, Aklapper; removed: Papaul.
Papaul added a subscriber: Papaul.

@MoritzMuehlenhoff assigning you this task if you have time to look at it while I am gone for vacation. Thanks

I took a shot at that.

~ # lspci |grep net
3b:00.0 Ethernet controller: QLogic Corp. Device 8070 (rev 02)
3b:00.1 Ethernet controller: QLogic Corp. Device 8070 (rev 02)

Seems like the qede module should work for them but only after 4.12 [1]. @MoritzMuehlenhoff could we start the installer with a newer kernel ?

[1] https://cateee.net/lkddb/web-lkddb/QEDE.html

There's no simple way to start the stretch installer with a more recent kernel. Some options were discussed in this recent talk at DebConf: https://meetings-archive.debian.net/pub/debian-meetings/2018/DebConf18/2018-07-31/backporting-hardware-support-in-debian.webm but there's no good solution at this point.

We could try backporting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c79ddaa0f962d1f26537a670b0652ff509a6fe0, but it's a bit of work and we'll need to investigate whether we also need more recent firmware from firmware-qlogic.

Does that server have additional NICs we could use (onboard or via an addon adapter)?

There's no simple way to start the stretch installer with a more recent kernel. Some options were discussed in this recent talk at DebConf: https://meetings-archive.debian.net/pub/debian-meetings/2018/DebConf18/2018-07-31/backporting-hardware-support-in-debian.webm but there's no good solution at this point.

Ouch.

We could try backporting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c79ddaa0f962d1f26537a670b0652ff509a6fe0, but it's a bit of work and we'll need to investigate whether we also need more recent firmware from firmware-qlogic.

Could we just setup a buster debian-installer and abuse the linux kernel from it just for the duration of the installation of this host ? Overriding temporarily in dhcp config stretch-installer/debian-installer/amd64/linux (and possibly the initrd as well) ? just for the installer phase.

Does that server have additional NICs we could use (onboard or via an addon adapter)?

Yes, but they are 1Gb and not connected (and were disabled in the bios)

Papaul added a comment.Aug 2 2018, 2:46 PM

@akosiaris @MoritzMuehlenhoff yes we do have 2x1GB NIC' on the server. since the server is in a rack with 10G switch, we can use a 1000base-T-SEP copper adapter to connect one of the 1GB NIC to the switch

There's no simple way to start the stretch installer with a more recent kernel. Some options were discussed in this recent talk at DebConf: https://meetings-archive.debian.net/pub/debian-meetings/2018/DebConf18/2018-07-31/backporting-hardware-support-in-debian.webm but there's no good solution at this point.

Ouch.

We could try backporting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c79ddaa0f962d1f26537a670b0652ff509a6fe0, but it's a bit of work and we'll need to investigate whether we also need more recent firmware from firmware-qlogic.

Could we just setup a buster debian-installer and abuse the linux kernel from it just for the duration of the installation of this host ? Overriding temporarily in dhcp config stretch-installer/debian-installer/amd64/linux (and possibly the initrd as well) ? just for the installer phase.

With some kludges that should be possible, but that still leaves us with a non-working NIC once Stretch is installed?

Can't we simply install the host using the currently disabled second NIC?

Does that server have additional NICs we could use (onboard or via an addon adapter)?

Yes, but they are 1Gb and not connected (and were disabled in the bios)

I assume we need 10G for day-to-day operation?

There's no simple way to start the stretch installer with a more recent kernel. Some options were discussed in this recent talk at DebConf: https://meetings-archive.debian.net/pub/debian-meetings/2018/DebConf18/2018-07-31/backporting-hardware-support-in-debian.webm but there's no good solution at this point.

Ouch.

We could try backporting https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c79ddaa0f962d1f26537a670b0652ff509a6fe0, but it's a bit of work and we'll need to investigate whether we also need more recent firmware from firmware-qlogic.

Could we just setup a buster debian-installer and abuse the linux kernel from it just for the duration of the installation of this host ? Overriding temporarily in dhcp config stretch-installer/debian-installer/amd64/linux (and possibly the initrd as well) ? just for the installer phase.

With some kludges that should be possible, but that still leaves us with a non-working NIC once Stretch is installed?

Unless we also upgrade to a kernel from say stretch-backports (4.16+94~bpo9+1 from what I see currently), yes it does.

Can't we simply install the host using the currently disabled second NIC?

Yes (with some DC work as @Papaul says above), but then we are back into the same state as above. Unless we upgrade to a kernel from stretch-backports as above

Does that server have additional NICs we could use (onboard or via an addon adapter)?

Yes, but they are 1Gb and not connected (and were disabled in the bios)

I assume we need 10G for day-to-day operation?

Strictly need, no. But crunching the numbers back when we did the procurement did point out that we should expect to need it in the midterm future.

Unless we also upgrade to a kernel from say stretch-backports (4.16+94~bpo9+1 from what I see currently), yes it does.

stretch-backports has no sensible security support. We'd essentially need to wait until a fixed version trickles into testing and then someone making a backport. That's still somewhat (reluctantly) acceptable for a backup host, but hardly for any other host. It also means that we need to follow kernels (until it eventually turns into buster).

So we have essentially three options:

  1. Go with the 1GB NIC for now and enable it when that host is migrated to buster. That way we can use all our default software stacks.
  2. Install with the 1 GB NIC and then install the stretch-backports kernel post d-i, enabling the 10GB NIC
  3. Hack the buster kernel into d-i from stretch and the install the stretch-backports kernel post d-i, enabling the 10 GB NIC

If we go for the 10 GB option, 2. seems preferable to avoid subtle errors, though.

Related link (co-indidentally from today!) wrt steps needed in d-i to support installing from backports: https://lists.debian.org/debian-boot/2018/08/msg00015.html

ayounsi added a subscriber: ayounsi.Aug 8 2018, 4:30 PM

Change 455888 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address to embedded NIC 1 for Moritzm to test 10GB drivers

https://gerrit.wikimedia.org/r/455888

Change 455888 merged by Dzahn:
[operations/puppet@production] DHCP: Change MAC address to embedded NIC 1 for Moritzm to test 10GB drivers

https://gerrit.wikimedia.org/r/455888

Dzahn added a subscriber: Dzahn.Aug 28 2018, 7:08 PM

MAC in DHCP has changed to embedded NIC 1 and puppet ran on install2002 to update config.

@Papaul : Does this maybe need some additional changein the BIOS to make the server PXE-boot from the internal NIC?

When I'm trying to install it, I still see that it's trying the Qlogic NIC for the PXE boot:

CLIENT MAC ADDR: F4 E9 D4 74 FD 78  GUID: 4C4C4544-0058-3510-8032-B2C04F525032
PXE-E51: No DHCP or proxyDHCP offers were received.

PXE-M0F: Exiting QLogic PXE ROM.

@MoritzMuehlenhoff I changed the switch to use the ge-2/0/12 instance of xe-2/0/12 since we are using a 1GB transceiver. the installation is in progress i will let you know when installation is done.

show ethernet-switching table interface ge-2/0/12    

MAC database for interface ge-2/0/12

MAC database for interface ge-2/0/12.0

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC, O - ovsdb MAC)


Ethernet switching table : 109 entries, 109 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical
    name                address             flags              interface
    private1-d-codfw    d0:94:66:5f:4c:7c   D             -   ge-2/0/12.0
show interfaces ge-2/0/12 descriptions 
Interface       Admin Link Description
ge-2/0/12       up    up   backup2001:eth1

@MoritzMuehlenhoff the installation is complete it is all yours

First puppet run complete

Papaul updated the task description. (Show Details)Aug 29 2018, 3:42 PM

Thanks! I've installed my backported test kernel and figured out why additional firmware we need, it looks promising, the driver gets loaded along with the firmware:

jmm@backup2001:~$ uname -a
Linux backup2001 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4+wmf1 (2018-08-28) x86_64 GNU/Linux
jmm@backup2001:~$ sudo dmesg | grep QLogic
[    5.047239] QLogic FastLinQ 4xxxx Core Module qed 8.10.9.20
[    5.073768] qede_init: QLogic FastLinQ 4xxxx Ethernet Driver qede 8.10.9.20
jmm@backup2001:~$

@Papaul: Now that we have an installed system with that NIC, can you please switch back the cable to the 10G card, so that I can run some tests with it?

@MoritzMuehlenhoff both the 10GB and 1GB NIC's are already connected to the switch

10 GB NIC is on xe-2/0/11
1GB NIC is on ge-2/0/12

The hardware side is fixed, but I'm seeing a kernel error, looking into it.

Dzahn added a comment.Aug 31 2018, 7:31 PM

icinga is reporting that on backup2001 there is "enp59s0f1 reporting no carrier." since about 11h 9m

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=backup2001&service=configured+eth

@Dzahn you can disable those alerts @MoritzMuehlenhoff is running some test on that server.

Dzahn added a comment.Aug 31 2018, 8:31 PM

Yep, thanks Papaul. I realized after making the comment here. Done.

I've created a custom Linux 4.14 kernel which worked fine in my tests with an updated firmware-qlogic. I've also created a netboot image based on Linux 4.14.
It's based on the last version which was in unstable for 4.14.x (4.14.17), but that's good enough for initial tests. If it's working fine and we decide to keep using it, I'll update the packages to the latest 4.14.x kernel.

@Papaul Could you please revert the network config so that it tries to PXE-boot from the QLogic NIC (MAC: F4 E9 D4 74 FD 78) again?

Change 457923 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address to test OS install on 10GB NIC

https://gerrit.wikimedia.org/r/457923

Change 457923 merged by Dzahn:
[operations/puppet@production] DHCP: Change MAC address to test OS install on 10GB NIC

https://gerrit.wikimedia.org/r/457923

@Papaul: That's expected, this also need a change to the DHCP config to use the netboot image based on 4.14, e.g. by using the patch at https://gerrit.wikimedia.org/r/457930 or setting this manually on install2002. I'll test this tomorrow (or feel free to go ahead!), the installation still won't be 100% complete as the 4.14 kernel it not yet uploaded to apt.wikimedia.org and we need another patch to install it in late-setup. With the current image it uses 4.14 in the installer, but then install the 4.9 kernel in the end which lacks the updated driver.

Papaul added a comment.Sep 4 2018, 4:15 PM

@MoritzMuehlenhoff ok I will change the install in DHCP

Change 457939 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: let backup2001 use stretch414-installer

https://gerrit.wikimedia.org/r/457939

Change 457939 merged by Dzahn:
[operations/puppet@production] install_server: let backup2001 use stretch414-installer

https://gerrit.wikimedia.org/r/457939

Papaul added a comment.Sep 4 2018, 4:38 PM

Here what I get now

Papaul added a comment.Sep 4 2018, 4:40 PM

@MoritzMuehlenhoff i will leave it to you so you can play with it tomorrow

Change 465434 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove tweaks to use Linux 4.14 on backup2001

https://gerrit.wikimedia.org/r/465434

Papaul added a comment.Oct 9 2018, 4:15 PM

@MoritzMuehlenhoff the new NIC is in place

Pasting the (trimmed down) IRC discussion to this task to keep everyone in the loop:

<robh> there is no 830 replacement any longer for the 840 controller
<robh> they support the 730 ad 740 swapping
<robh> but the 840 runs the new shelves at 12G not 6G and cannot backward compatible swap
<moritzm> the 740/840 controllers are not yet supported in stretch, I backported the driver to the 4.9 kernel, but while it got merged in git, there's not yet a stretch install image with that driver, scheduling for the next point release is still WIP, maybe in a month
<moritzm> https://lists.debian.org/debian-release/2018/10/msg00168.html
<moritzm> I'd also like to use backup2001 as a test host once a new official 4.9 kernel has been uploaded to stretch-proposed-updates, so far my tests have been done with an older test kernel I built about a month ago

Linux 4.9.130-1 (which also contains the backport of the H840 Perc controller I made) has now been uploaded to "stretch-proposed-updates", the staging directory for packages destined for a stretch point release. I just tried to install it, but unfortunately it's currently non-available as it ended up in the NEW queue (some package queue mechanism which gets triggered when package names change, which was the case for the kernel due to changes in the kernel ABI). Once that has been resolved (it needs manual processing by the Debian FTP masters), I can install the new kernel and make some tests and given that Papaul switched the NIC to the Broadcom model is should then all be supported in stretch.

The revised Debian installer will only be available by the time of the point release, but given that the system has been installed via our Stretch 4.14 prototype already, that's not an issue for us. We could even use the same hack to install backup1001 if it's needed before the Stretch 9.6 point release (which is not yet finally scheduled).

Linux 4.9.130-1 (which also contains the backport of the H840 Perc controller I made) has now been uploaded to "stretch-proposed-updates", the staging directory for packages destined for a stretch point release. I just tried to install it, but unfortunately it's currently non-available as it ended up in the NEW queue (some package queue mechanism which gets triggered when package names change, which was the case for the kernel due to changes in the kernel ABI). Once that has been resolved (it needs manual processing by the Debian FTP masters), I can install the new kernel and make some tests and given that Papaul switched the NIC to the Broadcom model is should then all be supported in stretch.

4.9.130-1 seems fine on that host.

That point release has happened and I upgraded our netinst images earlier the day, so this should be fine to re-install now.

Change 465434 merged by Muehlenhoff:
[operations/puppet@production] Remove tweaks to use Linux 4.14 on backup2001

https://gerrit.wikimedia.org/r/465434

Script wmf-auto-reimage was launched by banyek on cumin1001.eqiad.wmnet for hosts:

['backup2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811271341_banyek_3136.log.

Banyek added a subscriber: Banyek.Nov 27 2018, 1:43 PM

Script wmf-auto-reimage was launched by banyek on cumin1001.eqiad.wmnet for hosts:

['backup2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811271341_banyek_3136.log.

It wasn't initialited by me, I don't know how my user shown up

Completed auto-reimage of hosts:

['backup2001.codfw.wmnet']

Of which those FAILED:

['backup2001.codfw.wmnet']
akosiaris closed this task as Resolved.Nov 27 2018, 3:54 PM

Box is reimaged and is up and running. megacli seems the controller and the disks

akosiaris@backup2001:~$ sudo megacli -AdpAllInfo -a0
                                     
Adapter #0

==============================================================================
                    Versions
                ================
Product Name    : PERC H840 Adapter 
Serial No       : 825001H
FW Package Build: 50.3.0-1022
...
                Device Present
                ================
Virtual Drives    : 0 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 26 
  Disks           : 24 
  Critical Disks  : 0 
  Failed Disks    : 0

So i'd say this is successfully resolved. Thanks @MoritzMuehlenhoff !

akosiaris updated the task description. (Show Details)Nov 27 2018, 3:55 PM