Page MenuHomePhabricator

kafka1018 fails to boot
Closed, ResolvedPublic13 Estimated Story Points

Description

After the last reboot for kernel updates, kafka1018 did not boot.

-------------------------------------------------------------------------------
Record:      43
Date/Time:   11/28/2017 15:19:00
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM1,DIMM2,DIMM3,DIMM4,DIMM5,DIMM6,DIMM7,DIMM8.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   11/28/2017 15:30:08
Source:      system
Severity:    Critical
Description: Fault detected on drive 3 in disk drive bay 1.
-------------------------------------------------------------------------------

I tried to powercycle/hardreset it but nothing really works, the com2 output is blank now. Christ would you mind to double check when you have time what's happening?

Event Timeline

Yes, there is definitely a bad disk but the memory errors are probably a bad CPU or motherboad. It would extremely bad fortune for all 8 DIMM to fail at one time. But I did find in the SEL CPU1 failure. I ended up swapping out both CPU's from an old swift servers same class and CPU1 still shows failed. This suggests a bad motherboard. If this were under warranty that would be the repair by Dell but the warranty expired in October 2015.

ecord: 45
Date/Time: 11/28/2017 16:07:54
Source: system
Severity: Critical

Description: CPU 1 has an internal error (IERR).

Record: 46
Date/Time: 11/28/2017 16:09:08
Source: system
Severity: Critical
Description: CPU 1 machine check error detected.

After Swap

Record: 58
Date/Time: 11/28/2017 16:31:12
Source: system
Severity: Critical

Description: The chassis is open while the power is off.

Record: 59
Date/Time: 11/28/2017 16:31:18
Source: system
Severity: Ok

Description: The chassis is closed while the power is off.

Record: 60
Date/Time: 11/28/2017 16:31:20
Source: system
Severity: Critical

Description: CPU 1 has an internal error (IERR).

Summary of what has been discussed so far on IRC:

  • Chris will try to find a used motherboard in the DC and see if it can be swapped on kafka1018, so a simple reimage should be enough.
  • I asked to @madhuvishy if one of the notebook100[12] hosts could be temporarily repurposed as kafka1018 since they have the same hw specs. She confirmed that notebook1002 is not heavily used at the moment, so it might be a good target if we don't find another solution.
  • @faidon mentioned that we might also be able to find a spare in the DC and use it as temporal replacement, but we'd need to check with @RobH first.

Last but not the least, we can't really swap all Kafka clients over to the new Kafka Jumbo cluster since we are not ready for the migration yet.

Just spoke to Chris and Faidon on IRC, and with my team. The best option seems to be repurpose notebook1002.eqiad.wmnet to kafka1023.eqiad.wmnet (new hostname), and assign to it the kafka 18 id so it will replace kafka1018 (that will be ready for decom).

I'd need Director approval so I am going to loop in @mark asking for his opinion, and @RobH too for the new kafka1023 hostname.

  • Why are we trying to replace kafka1018 since it is out of warranty and the Analytics team already got the kafka-jumbo nodes?

Because we are not ready yet to move all the Kafka Analytics clients to the Kafka Jumbo cluster. Each Kafka topic partition is replicated on three different kafka hosts, so we are currently experiencing only some alerts for topics not fully replicated due to the absence of kafka1018. If another broker goes down we might get into troubles since we'll be one step away from loosing data.

  • Why notebook1002? Isn't it used?

After a chat with @madhuvishy she confirmed that only notebook1001 is really used by users, notebook1002 is more a testing machine now. These hosts were previously analytics hadoop workers, and hence we have the same hardware specs as the kafka1012->1022 hosts (even the same number of disks).

  • When are you planning to decommission kafka1012->1022?

We should be able to move varnishkafka on cache misc to Kafka Jumbo by the end of this quarter, and hopefully move all the clients during the next.

Updating after a chat with Faidon: better to see if there is a onsite spare to repurpose, but for that I'd need to ping @RobH :)

So kafka1018 is a Dell PowerEdge R720xd. It is a 2U server, with 12 LFF disk bays. It has dual Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 48Gb memory, and 12 * 2TB disks.

We don't have any spare hardware that comes close to this (in terms of disks). So this would be a new order (where we'll need to determine the best new CPU, as this is a very old server we are replacing, as well as potentially move to a 1u system if possible).

As for re-allocation of notboot1002, that would need to be approved by @mark, I wouldn't be in that approval process.

Hope that helps!

Edit addition: I also recommend against us trying to resurrect one machine by pulling old out of warranty mainboards from other machines. We should just replace old broken hardware with in warranty hardware when they get this old. (IMO)

I saw this host as DOWN when looking at Icinga as it was in the unacknowledeged section (though notifications were disabled).

Then i searched Phab for the host name, which usually always worked just fine but for some reason Phab search is currently degraded and it did not find me this ticket even though the host name is in the title.

Then i searched SAL for hostname but i missed it because mw1018 was part of a regex covering multiple servers.

That made me powercycle it to see what's up and i can confirm i got:

Broadcom NetXtreme Ethernet Boot Agent
CPLDoversionD:s103)2foundrondthe hostoadapter
ManagementDEnginesModendledobyrBIOSr B:OActive
ManagementsEnginetFirmware3Versionprev:o0002.0001
0UncorrectableeMemoryuErrorPatchhost a:a0005.
0StrikeitheeF1lkeyrtoncontinue,dF2 to run0the system setup program

Afterwards i found this ticket :p

Just spoke to Chris and Faidon on IRC, and with my team. The best option seems to be repurpose notebook1002.eqiad.wmnet to kafka1023.eqiad.wmnet (new hostname), and assign to it the kafka 18 id so it will replace kafka1018 (that will be ready for decom).

I'd need Director approval so I am going to loop in @mark asking for his opinion, and @RobH too for the new kafka1023 hostname.

  • Why are we trying to replace kafka1018 since it is out of warranty and the Analytics team already got the kafka-jumbo nodes?

Because we are not ready yet to move all the Kafka Analytics clients to the Kafka Jumbo cluster. Each Kafka topic partition is replicated on three different kafka hosts, so we are currently experiencing only some alerts for topics not fully replicated due to the absence of kafka1018. If another broker goes down we might get into troubles since we'll be one step away from loosing data.

  • Why notebook1002? Isn't it used?

After a chat with @madhuvishy she confirmed that only notebook1001 is really used by users, notebook1002 is more a testing machine now. These hosts were previously analytics hadoop workers, and hence we have the same hardware specs as the kafka1012->1022 hosts (even the same number of disks).

  • When are you planning to decommission kafka1012->1022?

We should be able to move varnishkafka on cache misc to Kafka Jumbo by the end of this quarter, and hopefully move all the clients during the next.

This seems like a good temporary solution to get us by until the old hosts will be decommissioned soon. Approved.

Change 394985 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] site.pp: set notebook1002 as spare::system

https://gerrit.wikimedia.org/r/394985

My idea is to:

  1. set notebook1002 as spare::system and clean it up from running services.
  2. use wmf-auto-reimage to rename the host to kafka1023 if possible (to test the feature).
  3. assign kafka::analytics::broker to kafka1023 (and others, see site.pp) changing the hiera kafka config in the following way:
kafka_clusters:
  # This is the analytics Kafka cluster, named just 'eqiad' for
  # historical reasons.
  eqiad:
    # Optional api_version indicates the Kafka API version the
    # brokers are running.  Clients can use this to override
    # version discovery for versions of Kafka where the version
    # request API doesn't exist (< 0.10).  Once all brokers
    # are on 0.10, this shouldn't be needed.
    api_version: 0.9
    zookeeper_cluster_name: main-eqiad
    brokers:
      kafka1012.eqiad.wmnet:
        id: 12  # Row A
      kafka1013.eqiad.wmnet:
        id: 13  # Row A
      kafka1014.eqiad.wmnet:
        id: 14  # Row C
-    kafka1018.eqiad.wmnet:
+    kafka1023.eqiad.wmnet:
        id: 18  # Row D
      kafka1020.eqiad.wmnet:
        id: 20  # Row D
      kafka1022.eqiad.wmnet:
        id: 22  # Row C
  1. run puppet on kafka1023 and verify that it effectively takeover id 18 (asking for all the missing data from other brokers).

@elukey I recommend copying home directories on notebook1002 and back them up somewhere on notebook1001, and send a note to analytics and research-l asking folks to just use 1001. I don't think anyone uses 1002, but a few users have notebooks there, so notifying would be good.

Change 394985 merged by Elukey:
[operations/puppet@production] site.pp: set notebook1002 as spare::system

https://gerrit.wikimedia.org/r/394985

Change 397397 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] netbook.cfg: add kafka1023

https://gerrit.wikimedia.org/r/397397

Change 397397 merged by Elukey:
[operations/puppet@production] netboot.cfg: add kafka1023

https://gerrit.wikimedia.org/r/397397

Change 397534 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Rename notebook1002 to kafka1023

https://gerrit.wikimedia.org/r/397534

Updated procedure after a chat with @Volans:

  1. shutdown notebook1002
  2. replace it in DNS and puppet with kafka1023 (caveat: wmf-auto-reimage will need only the notebook1002 mgmt record to be there)
  3. launch wmf-auto-reimage with --rename, that should take care of the manual work

Change 397539 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Prepare the conditions to rename notebook1002 in kafka1023

https://gerrit.wikimedia.org/r/397539

Change 397743 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] netboot.cfg: change partman config for notebook1002

https://gerrit.wikimedia.org/r/397743

Change 397743 merged by Elukey:
[operations/puppet@production] netboot.cfg: change partman config for notebook1002

https://gerrit.wikimedia.org/r/397743

notebook1002 is now PXE installing fine, I removed the previous hw config and created 12 1 disk RAID0 virtual devices with PERC. The other Kafka brokers use plain JBOD but it seems cumbersome and a bit tricky to test, so since the difference is minimal I'll stop spending time on this to finally rename notebook1002 to kafka1023.

Mentioned in SAL (#wikimedia-operations) [2017-12-12T14:46:47Z] <elukey> start rename notebook1002 -> kafka1023 - step 2, dns config (host already shutdown) - T181518

Change 397539 merged by Elukey:
[operations/dns@master] Prepare the conditions to rename notebook1002 in kafka1023

https://gerrit.wikimedia.org/r/397539

Mentioned in SAL (#wikimedia-operations) [2017-12-12T15:02:51Z] <elukey> clear recdns records related to notebook1002/kafka1023 (rec_control wipe-cache kafka1023.eqiad.wmnet kafka1023.mgmt.eqiad.wmnet notebook1002.eqiad.wmnet 14.5.64.10.in-addr.arpa 104.3.65.10.in-addr.arpa) - T181518

Change 397534 merged by Elukey:
[operations/puppet@production] Rename notebook1002 to kafka1023

https://gerrit.wikimedia.org/r/397534

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

notebook1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201712121528_elukey_9474_notebook1002_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

kafka1023.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201712121731_elukey_20395_kafka1023_eqiad_wmnet.log.

Change 397884 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove any trace of notebook1002 records

https://gerrit.wikimedia.org/r/397884

Completed auto-reimage of hosts:

['kafka1023.eqiad.wmnet']

Of which those FAILED:

['kafka1023.eqiad.wmnet']

Change 397884 merged by Elukey:
[operations/dns@master] Remove any trace of notebook1002 records

https://gerrit.wikimedia.org/r/397884

Change 398255 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Replace kafka1018 with kafka1023 in the analytics kafka cluster

https://gerrit.wikimedia.org/r/398255

Change 398255 abandoned by Elukey:
Replace kafka1018 with kafka1023 in the analytics kafka cluster

Reason:
needs to be broken down in multiple steps

https://gerrit.wikimedia.org/r/398255

Change 398292 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Establish ipsec session between kafka1023 and the cp nodes

https://gerrit.wikimedia.org/r/398292

Change 398292 merged by Elukey:
[operations/puppet@production] Establish ipsec session between kafka1023 and the cp nodes

https://gerrit.wikimedia.org/r/398292

Change 398295 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add interface::add_ip6_mapped to kafka1023

https://gerrit.wikimedia.org/r/398295

Change 398295 merged by Elukey:
[operations/puppet@production] Add interface::add_ip6_mapped to kafka1023

https://gerrit.wikimedia.org/r/398295

Change 398299 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Allow ipsec in iptables/ferm rules for kafka1023

https://gerrit.wikimedia.org/r/398299

Change 398299 merged by Elukey:
[operations/puppet@production] Allow ipsec in iptables/ferm rules for kafka1023

https://gerrit.wikimedia.org/r/398299

Change 398301 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Replace kafka1018 with kafka1023 in the Analytics Kafka cluster

https://gerrit.wikimedia.org/r/398301

Change 398301 merged by Elukey:
[operations/puppet@production] Replace kafka1018 with kafka1023 in the Analytics Kafka cluster

https://gerrit.wikimedia.org/r/398301

kafka1023 is now fully productionized and catching up with the missing partitions. Opened https://phabricator.wikimedia.org/T182955 to decom kafka1018.

elukey set the point value for this task to 13.Dec 15 2017, 4:02 PM