Page MenuHomePhabricator

(Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking and setup of kafka-jumbo100[789]

Info off procurement task

Hostnames: kafka-jumbo100[789]
Racking Proposal: Existing systems are in A1, A2, B1, B2, C4, D1, try to avoid these racks. (try to put 2 of these in row D, as it is the least populated for kafka-jumbo.)
Networking/Subnet/VLAN/IP: 10G network, single connection, internal vlan per row.
Partitioning/Raid: match existing kafka-jumbo

Server setup checklists

kafka-jumbo1007:

  • - receive in system on procurement task T232016
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

kafka-jumbo1008:

  • - receive in system on procurement task T232016
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

kafka-jumbo1009:

  • - receive in system on procurement task T232016
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system has been installed, calling into puppet, and staged in netbox, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH added a parent task: Unknown Object (Task).Feb 6 2020, 5:23 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
RobH renamed this task from rack/setup/install kafka-jumbo100[789].eqiad.wmnet to (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet.Feb 24 2020, 9:10 PM

racked host , cabled ,updated netbox

handing off to Chris for bios configuration

host rack unit switchport
kafka-jumbo1007 c6 33 35
kafka-jumbo1009 d1 7 7
kafka-jumbo1009 d4 15 11

Change 579295 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Add mgmt/production dns for kafka-jumbo100[789]

https://gerrit.wikimedia.org/r/579295

Change 579295 merged by Cmjohnson:
[operations/dns@master] Add mgmt/production dns for kafka-jumbo100[789]

https://gerrit.wikimedia.org/r/579295

Change 579394 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Fix typo for kafka-jumbo1008

https://gerrit.wikimedia.org/r/579394

Change 579394 merged by Cmjohnson:
[operations/dns@master] Fix typo for kafka-jumbo1008

https://gerrit.wikimedia.org/r/579394

We have wrong mgmt password on all 3 nodes

Change 580132 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file

https://gerrit.wikimedia.org/r/580132

Change 580132 merged by Cmjohnson:
[operations/puppet@production] Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file

https://gerrit.wikimedia.org/r/580132

Change 584673 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port

https://gerrit.wikimedia.org/r/584673

Change 584673 merged by Cmjohnson:
[operations/puppet@production] Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port

https://gerrit.wikimedia.org/r/584673

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003301918_cmjohnson_250432_kafka-jumbo1009_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003301919_cmjohnson_250534_kafka-jumbo1008_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003301920_cmjohnson_250672_kafka-jumbo1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1008.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1008.eqiad.wmnet']

Completed auto-reimage of hosts:

['kafka-jumbo1009.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1009.eqiad.wmnet']

Completed auto-reimage of hosts:

['kafka-jumbo1007.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1007.eqiad.wmnet']

These are failing during install. @elukey can you verify the raid configuration please

Failed to partition the selected disk │ │

│     │ This probably happened because there are too many (primary)  │    │
│ Comp│ partitions in the partition table.                           │    │
│     │

These are failing during install. @elukey can you verify the raid configuration please

Failed to partition the selected disk │ │

│     │ This probably happened because there are too many (primary)  │    │
│ Comp│ partitions in the partition table.                           │    │
│     │

Didn't have the time but I'll do it tomorrow and I'll report back!

FYI, kafka-jumbo1008 switch port has been flapping and flooding logs.

Please disable the switch port if the host is neither in production nor being worked on.

FYI, kafka-jumbo1008 switch port has been flapping and flooding logs.

Please disable the switch port if the host is neither in production nor being worked on.

The host seems to have some issue, namely when I force PXE boot after a powercycle it gets stuck after trying to boot debian d-i. From install1003's logs I can see the DHCP happening correctly, I am wondering if it is a problem of console redirection not working.

Nope serial settings are good, but I have powered it down to avoid spamming logs while we work on partman.

this is what is displayed before the error msg that Chris pointed out:

┌─────────────────────────┤ [!] Partition disks ├─────────────────────────┐
│                                                                         │
│ You may use the whole volume group for guided partitioning, or part     │
│ of it. If you use only part of it, or if you add more disks later,      │
│ then you will be able to grow logical volumes later using the LVM       │
│ tools, so using a smaller part of the volume group at installation      │
│ time may offer more flexibility.                                        │
│                                                                         │
│ The minimum size of the selected partitioning recipe is 22.1 TB (or     │
│ 92%); please note that the packages you choose to install may require   │
│ more space than this. The maximum available size is 24.0 TB.            │
│                                                                         │
│ Hint: "max" can be used as a shortcut to specify the maximum size, or   │
│ enter a percentage (e.g. "20%") to use that percentage of the maximum   │
│                                                                         │
│ 24.0 TB______________________________________________________________   │
│                                                                         │
│     <Go Back>                                            <Continue>

┌─────────────────────────┤ [!] Partition disks ├─────────────────────────┐
│                                                                         │
│ You may use the whole volume group for guided partitioning, or part     │
│ of it. If you use only part of it, or if you add more disks later,      │
│ then you will be able to grow logical volumes later using the LVM       │
│ tools, so using a smaller part of the volume group at installation      │
│ time may offer more flexibility.                                        │
│                                                                         │
│ The minimum size of the selected partitioning recipe is 700.0 GB (or    │
│ 147%); please note that the packages you choose to install may          │
│ require more space than this. The maximum available size is 474.6 GB.   │
│                                                                         │
│ Hint: "max" can be used as a shortcut to specify the maximum size, or   │
│ enter a percentage (e.g. "20%") to use that percentage of the maximum   │
│                                                                         │
│ 474.6 GB_____________________________________________________________   │
│                                                                         │
│     <Go Back>                                            <Continue>

Change 587560 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] autoinstall: fix kafka-jumbo.cfg for Buster

https://gerrit.wikimedia.org/r/587560

@Cmjohnson I powered up again 1008 and I don't see any DHCP ACK in syslog when PXE installing:

Apr  8 17:14:09 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.3
Apr  8 17:14:09 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.2
Apr  8 17:14:26 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.2
Apr  8 17:14:26 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.3

From the F2 menu:

Integrated NIC 1 Port 1: Broadcom Adv. Dual 10Gb Ethernet -
 B0:26:28:F0:8E:80
Integrated NIC 1 Port 2: Broadcom Adv. Dual 10Gb Ethernet -
 B0:26:28:F0:8E:81
Integrated NIC 1 Port 3: Broadcom Gigabit Ethernet BCM5720 -
 B0:26:28:F0:8E:7E
Integrated NIC 1 Port 4: Broadcom Gigabit Ethernet BCM5720 -
 B0:26:28:F0:8E:7F

B0:26:28:F0:8E:7E seems to be the 1g NIC, that shows up as connected, but we would need 10g no?

I am also not able to ssh to kafka-jumbo1007.mgmt.eqiad.wmnet :(

Change 587560 merged by Elukey:
[operations/puppet@production] autoinstall: fix kafka-jumbo.cfg for Buster

https://gerrit.wikimedia.org/r/587560

Summary:

  • the partman recipe is fixed
  • 1009 seems good
  • 1007's mgmt is not reachable
  • 1008's mgmt works, but I can't pxe boot. There seems to be an issue with DHCP, see T244506#6040746

These are on 1G racks. If you need 10G they will have to be moved.

These are on 1G racks. If you need 10G they will have to be moved.

Yep we'd need 10G, but regardless it seems that mgmt consoles are not reachable or can't pxe boot :(

The servers have been moved to 10G racks, in order to keep 2 in row D, KJ1008/1009 are in the same rack, D7. Once we are able to get a 3rd switch in there we can move one of them to D4. I verified the raid configurations. The 2 smaller disks are raid 1 and the 12 disks are raid 10 according to the partman recipe.

Network switch has been updated, old entries removed and ports disabled.

Change 595208 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Fixing dhcp entries for kafkajumbo1007-9 to 10G

https://gerrit.wikimedia.org/r/595208

Change 595208 merged by Cmjohnson:
[operations/puppet@production] Fixing dhcp entries for kafkajumbo1007-9 to 10G

https://gerrit.wikimedia.org/r/595208

Change 595209 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding kafka-jumbo100[789] to site.pp insetup role

https://gerrit.wikimedia.org/r/595209

Change 595209 merged by Cmjohnson:
[operations/puppet@production] Adding kafka-jumbo100[789] to site.pp insetup role

https://gerrit.wikimedia.org/r/595209

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005081856_cmjohnson_99703_kafka-jumbo1007_eqiad_wmnet.log.

unnamed.jpg (3×4 px, 1 MB)
@elukey, it doesn't appear to be a partman thing. Attached is a picture of the console monitor during the initial image.

Completed auto-reimage of hosts:

['kafka-jumbo1007.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005111609_elukey_44472_kafka-jumbo1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1007.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1007.eqiad.wmnet']

I had to abort the wmf reimage script because it wasn't getting to the point of running puppet, then I accepted manually the new puppet cert and ran puppet on 1007.

Partitions looks good:

elukey@kafka-jumbo1007:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   32G     0   32G   0% /dev
tmpfs                 6.3G   17M  6.3G   1% /run
/dev/mapper/vg0-root  351G  1.6G  332G   1% /
tmpfs                  32G     0   32G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                  32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/vg1-srv    18T   24K   17T   1% /srv
tmpfs                 6.3G     0  6.3G   0% /run/user/13926

elukey@kafka-jumbo1007:~$ sudo pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda2  vg0 lvm2 a--  446.34g 89.27g
  /dev/sdb1  vg1 lvm2 a--  <21.83t <4.37t

elukey@kafka-jumbo1007:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao---- 357.07g
  srv  vg1 -wi-ao----  17.46t

NIC is also 10G, all good, will proceed with 1008/9.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005111701_elukey_49836_kafka-jumbo1008_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1008.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005111736_elukey_56660_kafka-jumbo1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1009.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)

@elukey looks like kafka-jumbo1007 is failing to execute any of the NREP commands, while, for instance kafka-jumbo1008 or 1009 are all green.
I have ack'ed 1007 on icinga for now, can you double check it?

Thanks for the ping! Restarted the nagios server on the host and forced a recheck from icinga, let's see if it works.

Looks good now, removed also the downtime/acks!