⚓ T244506 (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet

Subject	Repo	Branch	Lines +/-
Adding kafka-jumbo100[789] to site.pp insetup role	operations/puppet	production	+4 -0
Fixing dhcp entries for kafkajumbo1007-9 to 10G	operations/puppet	production	+3 -3
autoinstall: fix kafka-jumbo.cfg for Buster	operations/puppet	production	+32 -42
Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port	operations/puppet	production	+3 -3
Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file	operations/puppet	production	+16 -1
Fix typo for kafka-jumbo1008	operations/dns	master	+1 -1
Add mgmt/production dns for kafka-jumbo100[789]	operations/dns	master	+18 -2

Restricted Application added a project: SRE. · View Herald TranscriptFeb 6 2020, 5:23 PM

RobH added a parent task: Unknown Object (Task).Feb 6 2020, 5:23 PM

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

RobH unsubscribed.

test

These hosts need to be in 10G racks (see https://phabricator.wikimedia.org/T236327) :)

elukey mentioned this in T244211: Analytics Hardware for Fiscal Year 2019/2020.Feb 6 2020, 6:02 PM

Jclark-ctr updated the task description. (Show Details)Feb 13 2020, 12:44 AM

• Nuria moved this task from Incoming to Radar on the Analytics board.Feb 17 2020, 5:00 PM

RobH renamed this task from rack/setup/install kafka-jumbo100[789].eqiad.wmnet to (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet.Feb 24 2020, 9:10 PM

racked host , cabled ,updated netbox

handing off to Chris for bios configuration

host rack unit switchport
kafka-jumbo1007 c6 33 35
kafka-jumbo1009 d1 7 7
kafka-jumbo1009 d4 15 11

Jclark-ctr reassigned this task from Jclark-ctr to Christopher.Mar 11 2020, 9:45 PM

Jclark-ctr subscribed.

Change 579295 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Add mgmt/production dns for kafka-jumbo100[789]

https://gerrit.wikimedia.org/r/579295

gerritbot added a project: Patch-For-Review.Mar 12 2020, 3:29 PM

Change 579295 merged by Cmjohnson:
[operations/dns@master] Add mgmt/production dns for kafka-jumbo100[789]

https://gerrit.wikimedia.org/r/579295

Maintenance_bot removed a project: Patch-For-Review.Mar 12 2020, 4:11 PM

• Cmjohnson updated the task description. (Show Details)Mar 12 2020, 4:54 PM

Ottomata mentioned this in T247561: kafka-jumbo1006 and stat1005 network issues.Mar 12 2020, 8:27 PM

Change 579394 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Fix typo for kafka-jumbo1008

https://gerrit.wikimedia.org/r/579394

Change 579394 merged by Cmjohnson:
[operations/dns@master] Fix typo for kafka-jumbo1008

https://gerrit.wikimedia.org/r/579394

Maintenance_bot removed a project: Patch-For-Review.Mar 12 2020, 9:10 PM

We have wrong mgmt password on all 3 nodes

elukey updated the task description. (Show Details)Mar 16 2020, 11:06 AM

wiki_willy reassigned this task from Christopher to • Cmjohnson.Mar 16 2020, 7:40 PM

wiki_willy added a subscriber: Christopher.

Change 580132 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file

https://gerrit.wikimedia.org/r/580132

gerritbot added a project: Patch-For-Review.Mar 16 2020, 10:07 PM

Change 580132 merged by Cmjohnson:
[operations/puppet@production] Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file

https://gerrit.wikimedia.org/r/580132

Maintenance_bot removed a project: Patch-For-Review.Mar 16 2020, 11:10 PM

• Cmjohnson updated the task description. (Show Details)Mar 30 2020, 6:08 PM

Change 584673 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port

https://gerrit.wikimedia.org/r/584673

gerritbot added a project: Patch-For-Review.Mar 30 2020, 6:21 PM

Change 584673 merged by Cmjohnson:
[operations/puppet@production] Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port

https://gerrit.wikimedia.org/r/584673

Maintenance_bot removed a project: Patch-For-Review.Mar 30 2020, 7:10 PM

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003301918_cmjohnson_250432_kafka-jumbo1009_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003301919_cmjohnson_250534_kafka-jumbo1008_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003301920_cmjohnson_250672_kafka-jumbo1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1008.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1008.eqiad.wmnet']

Completed auto-reimage of hosts:

['kafka-jumbo1009.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1009.eqiad.wmnet']

Completed auto-reimage of hosts:

['kafka-jumbo1007.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1007.eqiad.wmnet']

These are failing during install. @elukey can you verify the raid configuration please

Failed to partition the selected disk │ │

│     │ This probably happened because there are too many (primary)  │    │
│ Comp│ partitions in the partition table.                           │    │
│     │

In T244506#6022851, @Cmjohnson wrote:
These are failing during install. @elukey can you verify the raid configuration please

Failed to partition the selected disk │ │
│     │ This probably happened because there are too many (primary)  │    │
│ Comp│ partitions in the partition table.                           │    │
│     │

Didn't have the time but I'll do it tomorrow and I'll report back!

FYI, kafka-jumbo1008 switch port has been flapping and flooding logs.

Please disable the switch port if the host is neither in production nor being worked on.

In T244506#6038826, @ayounsi wrote:

FYI, kafka-jumbo1008 switch port has been flapping and flooding logs.

Please disable the switch port if the host is neither in production nor being worked on.

The host seems to have some issue, namely when I force PXE boot after a powercycle it gets stuck after trying to boot debian d-i. From install1003's logs I can see the DHCP happening correctly, I am wondering if it is a problem of console redirection not working.

Nope serial settings are good, but I have powered it down to avoid spamming logs while we work on partman.

this is what is displayed before the error msg that Chris pointed out:

┌─────────────────────────┤ [!] Partition disks ├─────────────────────────┐
│                                                                         │
│ You may use the whole volume group for guided partitioning, or part     │
│ of it. If you use only part of it, or if you add more disks later,      │
│ then you will be able to grow logical volumes later using the LVM       │
│ tools, so using a smaller part of the volume group at installation      │
│ time may offer more flexibility.                                        │
│                                                                         │
│ The minimum size of the selected partitioning recipe is 22.1 TB (or     │
│ 92%); please note that the packages you choose to install may require   │
│ more space than this. The maximum available size is 24.0 TB.            │
│                                                                         │
│ Hint: "max" can be used as a shortcut to specify the maximum size, or   │
│ enter a percentage (e.g. "20%") to use that percentage of the maximum   │
│                                                                         │
│ 24.0 TB______________________________________________________________   │
│                                                                         │
│     <Go Back>                                            <Continue>

┌─────────────────────────┤ [!] Partition disks ├─────────────────────────┐
│                                                                         │
│ You may use the whole volume group for guided partitioning, or part     │
│ of it. If you use only part of it, or if you add more disks later,      │
│ then you will be able to grow logical volumes later using the LVM       │
│ tools, so using a smaller part of the volume group at installation      │
│ time may offer more flexibility.                                        │
│                                                                         │
│ The minimum size of the selected partitioning recipe is 700.0 GB (or    │
│ 147%); please note that the packages you choose to install may          │
│ require more space than this. The maximum available size is 474.6 GB.   │
│                                                                         │
│ Hint: "max" can be used as a shortcut to specify the maximum size, or   │
│ enter a percentage (e.g. "20%") to use that percentage of the maximum   │
│                                                                         │
│ 474.6 GB_____________________________________________________________   │
│                                                                         │
│     <Go Back>                                            <Continue>

Change 587560 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] autoinstall: fix kafka-jumbo.cfg for Buster

https://gerrit.wikimedia.org/r/587560

gerritbot added a project: Patch-For-Review.Apr 8 2020, 4:59 PM

@Cmjohnson I powered up again 1008 and I don't see any DHCP ACK in syslog when PXE installing:

Apr  8 17:14:09 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.3
Apr  8 17:14:09 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.2
Apr  8 17:14:26 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.2
Apr  8 17:14:26 install1003 dhcpd[6278]: DHCPOFFER on 10.64.48.121 to b0:26:28:f0:8e:7e via 10.64.48.3

From the F2 menu:

Integrated NIC 1 Port 1: Broadcom Adv. Dual 10Gb Ethernet -
 B0:26:28:F0:8E:80
Integrated NIC 1 Port 2: Broadcom Adv. Dual 10Gb Ethernet -
 B0:26:28:F0:8E:81
Integrated NIC 1 Port 3: Broadcom Gigabit Ethernet BCM5720 -
 B0:26:28:F0:8E:7E
Integrated NIC 1 Port 4: Broadcom Gigabit Ethernet BCM5720 -
 B0:26:28:F0:8E:7F

B0:26:28:F0:8E:7E seems to be the 1g NIC, that shows up as connected, but we would need 10g no?

I am also not able to ssh to kafka-jumbo1007.mgmt.eqiad.wmnet :(

Change 587560 merged by Elukey:
[operations/puppet@production] autoinstall: fix kafka-jumbo.cfg for Buster

https://gerrit.wikimedia.org/r/587560

ayounsi unsubscribed.Apr 9 2020, 8:54 AM

Summary:

the partman recipe is fixed
1009 seems good
1007's mgmt is not reachable
1008's mgmt works, but I can't pxe boot. There seems to be an issue with DHCP, see T244506#6040746

Maintenance_bot removed a project: Patch-For-Review.Apr 9 2020, 9:10 AM

These are on 1G racks. If you need 10G they will have to be moved.

In T244506#6053265, @Cmjohnson wrote:

These are on 1G racks. If you need 10G they will have to be moved.

Yep we'd need 10G, but regardless it seems that mgmt consoles are not reachable or can't pxe boot :(

The servers have been moved to 10G racks, in order to keep 2 in row D, KJ1008/1009 are in the same rack, D7. Once we are able to get a 3rd switch in there we can move one of them to D4. I verified the raid configurations. The 2 smaller disks are raid 1 and the 12 disks are raid 10 according to the partman recipe.

Network switch has been updated, old entries removed and ports disabled.

Change 595208 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Fixing dhcp entries for kafkajumbo1007-9 to 10G

https://gerrit.wikimedia.org/r/595208

Change 595208 merged by Cmjohnson:
[operations/puppet@production] Fixing dhcp entries for kafkajumbo1007-9 to 10G

https://gerrit.wikimedia.org/r/595208

Change 595209 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding kafka-jumbo100[789] to site.pp insetup role

https://gerrit.wikimedia.org/r/595209

Change 595209 merged by Cmjohnson:
[operations/puppet@production] Adding kafka-jumbo100[789] to site.pp insetup role

https://gerrit.wikimedia.org/r/595209

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005081856_cmjohnson_99703_kafka-jumbo1007_eqiad_wmnet.log.

@elukey, it doesn't appear to be a partman thing. Attached is a picture of the console monitor during the initial image.

Maintenance_bot removed a project: Patch-For-Review.May 8 2020, 7:11 PM

Completed auto-reimage of hosts:

['kafka-jumbo1007.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1007.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005111609_elukey_44472_kafka-jumbo1007_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1007.eqiad.wmnet']

Of which those FAILED:

['kafka-jumbo1007.eqiad.wmnet']

I had to abort the wmf reimage script because it wasn't getting to the point of running puppet, then I accepted manually the new puppet cert and ran puppet on 1007.

Partitions looks good:

elukey@kafka-jumbo1007:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   32G     0   32G   0% /dev
tmpfs                 6.3G   17M  6.3G   1% /run
/dev/mapper/vg0-root  351G  1.6G  332G   1% /
tmpfs                  32G     0   32G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                  32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/vg1-srv    18T   24K   17T   1% /srv
tmpfs                 6.3G     0  6.3G   0% /run/user/13926

elukey@kafka-jumbo1007:~$ sudo pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda2  vg0 lvm2 a--  446.34g 89.27g
  /dev/sdb1  vg1 lvm2 a--  <21.83t <4.37t

elukey@kafka-jumbo1007:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao---- 357.07g
  srv  vg1 -wi-ao----  17.46t

NIC is also 10G, all good, will proceed with 1008/9.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005111701_elukey_49836_kafka-jumbo1008_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1008.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

kafka-jumbo1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202005111736_elukey_56660_kafka-jumbo1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kafka-jumbo1009.eqiad.wmnet']

and were ALL successful.

elukey closed this task as Resolved.May 11 2020, 6:07 PM

elukey updated the task description. (Show Details)

@elukey looks like kafka-jumbo1007 is failing to execute any of the NREP commands, while, for instance kafka-jumbo1008 or 1009 are all green.
I have ack'ed 1007 on icinga for now, can you double check it?

Thanks for the ping! Restarted the nagios server on the host and forced a recheck from icinga, let's see if it works.

Looks good now, removed also the downtime/acks!

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

(Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet
Closed, ResolvedPublic
Actions

Description

Info off procurement task

Server setup checklists

Details

Related Objects
Search...

Event Timeline

		Status	Subtype	Assigned	Task
					Unknown Object (Task)
		Resolved		• Cmjohnson	T244506 (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet

	F31808337: unnamed.jpg
	May 8 2020, 7:09 PM

(Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnetClosed, ResolvedPublicActions

Description

Info off procurement task

Server setup checklists

Details

Related ObjectsSearch...

Event Timeline

(Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet
Closed, ResolvedPublic
Actions

Related Objects
Search...