Page MenuHomePhabricator

Add new disks to syslog server in eqiad (lithium)
Closed, ResolvedPublic

Description

Disks for lithium have arrived in {T139612} and we'll need to get the current disks replaced once syslog server in codfw is fully setup in T138073: setup syslog server in codfw

Details

Related Gerrit Patches:

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 18 2016, 9:52 AM
fgiunchedi added a subtask: Unknown Object (Task).Aug 18 2016, 9:53 AM
fgiunchedi changed the task status from Open to Stalled.Aug 18 2016, 2:49 PM
fgiunchedi added a subscriber: Cmjohnson.

stalled until T138073: setup syslog server in codfw is resolved and we have redundancy cc @Cmjohnson

fgiunchedi changed the task status from Stalled to Open.Oct 5 2016, 3:10 PM
fgiunchedi reassigned this task from fgiunchedi to Cmjohnson.
fgiunchedi added a project: ops-eqiad.

@Cmjohnson we can go ahead with swapping the disks and reimage now. wezen.codfw.wmnet has a month worth of logs for redundancy.

Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptOct 5 2016, 3:10 PM

Change 314280 had a related patch set uploaded (by Filippo Giunchedi):
install_server: reinstall lithium with jessie and gpt

https://gerrit.wikimedia.org/r/314280

Change 314280 merged by Filippo Giunchedi:
install_server: reinstall lithium with jessie and gpt

https://gerrit.wikimedia.org/r/314280

@fgiunchedi The disks have been swapped and you're free to reinstall.

Mentioned in SAL (#wikimedia-operations) [2016-10-05T15:35:16Z] <godog> reimage lithium with bigger disks T143307

I see lithium still stuck at

Scanning for devices.  Please wait, this may take several minutes...

so likely a reseat or sth like that is needed @Cmjohnson ?

@fgiunchedi The disks are fine, the bios sees them correctly and during this morning's attempt to install Jessie, I was able to see the offer/request and an image was served but eventually timed out. On subsequent attempts to do the same thing, lithium hits cabon for a dhcp offer but nothing happens.

Log from when I did get an image
Oct 6 10:32:06 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 6 10:32:06 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 6 10:32:10 carbon dhcpd: DHCPREQUEST for 10.64.32.154 (208.80.154.10) from c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 6 10:32:10 carbon dhcpd: DHCPACK on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 6 10:32:10 carbon dhcpd: DHCPREQUEST for 10.64.32.154 (208.80.154.10) from c8:1f:66:bf:7f:ea via 10.64.32.2
Oct 6 10:32:10 carbon dhcpd: DHCPACK on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.2
Oct 6 10:32:11 carbon atftpd[7951]: Serving jessie-installer/debian-installer/amd64/pxelinux.0 to 10.64.32.154:2070
Oct 6 10:32:11 carbon atftpd[7951]: Serving jessie-installer/debian-installer/amd64/pxelinux.0 to 10.64.32.154:2071
Oct 6 10:32:11 carbon atftpd[7951]: Serving jessie-installer/ldlinux.c32 to 10.64.32.154:49152
Oct 6 10:32:11 carbon atftpd[7951]: Serving jessie-installer/pxelinux.cfg/ttyS1-115200 to 10.64.32.154:49153
Oct 6 10:32:11 carbon atftpd[7951]: Serving jessie-installer/pxelinux.cfg/boot.txt to 10.64.32.154:49154
Oct 6 10:32:21 carbon atftpd[7951]: Serving jessie-installer/debian-installer/amd64/linux to 10.64.32.154:49155
Oct 6 10:32:26 carbon atftpd[7951]: timeout: retrying...
Oct 6 10:33:48 atftpd[7951]: last message repeated 4 times

Log from when I didn't
Oct 6 10:39:54 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 6 10:39:54 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.3

Indeed that's odd @Cmjohnson I can see the dhcp offers from _both_ cr1 and cr2 in eqiad coming in a roughly the same time

Oct  6 11:33:10 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.2
Oct  6 11:33:10 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.2
Oct  6 11:33:10 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.3
Oct  6 11:33:10 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.3

I tried to putting the 500GB disks but still running into issues with the installer. I checked the vlan, switch port, dhcp file.

Oct 12 16:37:52 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.2
Oct 12 16:37:52 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.2
Oct 12 16:37:52 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 12 16:37:52 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 12 16:43:10 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.2
Oct 12 16:43:10 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.2
Oct 12 16:43:10 carbon dhcpd: DHCPDISCOVER from c8:1f:66:bf:7f:ea via 10.64.32.3
Oct 12 16:43:10 carbon dhcpd: DHCPOFFER on 10.64.32.154 to c8:1f:66:bf:7f:ea via 10.64.32.3

RobH closed subtask Unknown Object (Task) as Resolved.Oct 12 2016, 5:40 PM

I had a look at this. This is not network related. carbon answers as it should, the routers relay the DHCP packets as they should. AFAICT it's the motherboard that fails to acknowledge the receipt of the DHCP packets or the receipt of some TFTP packets.

Change 317550 had a related patch set uploaded (by Filippo Giunchedi):
base: send syslog only to codfw to reimage lithium

https://gerrit.wikimedia.org/r/317550

Change 317550 merged by Filippo Giunchedi:
base: send syslog only to codfw to reimage lithium

https://gerrit.wikimedia.org/r/317550

Change 317566 had a related patch set uploaded (by Filippo Giunchedi):
Revert "base: send syslog only to codfw to reimage lithium"

https://gerrit.wikimedia.org/r/317566

Change 317566 merged by Filippo Giunchedi:
Revert "base: send syslog only to codfw to reimage lithium"

https://gerrit.wikimedia.org/r/317566

fgiunchedi closed this task as Resolved.Oct 24 2016, 7:44 PM
fgiunchedi added a subscriber: faidon.

as pointed out by @faidon the problem with pxe failing is that lithium was hammered with udp packets from the fleet. After removing lithium as syslog target the install went fine.

Lithium is back in service with 4TB disks, resolving.