Page MenuHomePhabricator

db1246 crashed yet again
Closed, ResolvedPublic

Description

This server has crashed once more after replacing many HW parts https://phabricator.wikimedia.org/T391372#10774902

But yet, it crashed after less than a week.
Previous crashes: T359940 T361968 T363119 T374215 T387673 T391372

@wiki_willy this is the 7th crash with the same error, can we just get Dell to replace the whole server.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I was having some issue to re-image this host but @Volans was able to help by removing the host from puppetdb. See below for error

$ sudo puppet lookup --render-as s --compile --node db1246.eqiad.wmnet
               profile::puppet::agent::force_puppet7
08:31 <volans> Error: Could not run: Evaluation Error: Error while evaluating a
               Function Call, Could not find class ::raid::perccli for
               db1246.eqiad.wmnet (file:
               /srv/puppet_code/environments/production/modules/raid/manifests/init.pp,
               line: 30, column: 13)

The host is back up now, I am running some HW test. I will get the host back to DB team next week.

Thanks

Icinga downtime and Alertmanager silence (ID=3e20d375-9adc-4351-ba8a-0bbdf71aba3b) set by swfrench@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Host has crashed - T393296

db1246.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-05-08T16:22:27Z] <swfrench@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: Host has crashed - T393296

I've re-added a 1w downtime, as the earlier one was removed as a side-effect of the reimage. If we expect the host to be powered on for ongoing work, and also expect that work to extend past 1w from now, it may be a better option to set profile::monitoring::notifications_enabled: false in hiera.

@FCeratto-WMF this host needs to be recloned and added to production, can you take care of this?

Change #1143988 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1246: Disable notifications

https://gerrit.wikimedia.org/r/1143988

Change #1143988 merged by Marostegui:

[operations/puppet@production] db1246: Disable notifications

https://gerrit.wikimedia.org/r/1143988

The server has be relocated and re -imaged since May8th and we haven't seen any issue so far. Can we please put the server back in production to see if we do have the same issue when has load on it. if we do have the same issue, we will have to look into the disks performance on the server while under load.

Thanks

FCeratto-WMF changed the task status from Open to In Progress.May 16 2025, 11:04 AM
FCeratto-WMF claimed this task.
FCeratto-WMF triaged this task as High priority.

Started cloning db1188.eqiad.wmnet to db1246.eqiad.wmnet - fceratto@cumin1002

Completed depool of db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - fceratto@cumin1002 - fceratto@cumin1002

The host is not pinging, responding to ssh nor sending host-level metrics:
https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&from=2025-05-07T10%3A00%3A09.186Z&to=2025-05-16T11%3A51%3A44.389Z&timezone=utc&var-server=db1246&var-datasource=000000026&var-cluster=misc&refresh=5m
Was it shut down by mistake perhaps? Otherwise it crashed again.
The host was not in use so it had no CPU and I/O load.
@Papaul can you please take a look?

The host crashed yesterday with the same error as always:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   05/15/2025 11:42:14
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   05/15/2025 11:42:34
Source:      system
Severity:    Ok
Description: The system board BP1 PG voltage is within range.
-------------------------------------------------------------------------------

@wiki_willy @Papaul can we get Dell to replace this host? This server has been having this same error since it was delivered, we have replaced pieces, we have updated firmwares, cold restarts, and even change rack locations and the same error persists.

Start pool of db1188 gradually with 4 steps - Pooling back in - fceratto@cumin1002

@FCeratto-WMF @Marostegui yes I will talk with @wiki_willy on getting a replacement. Thank you

Completed pool of db1188 gradually with 4 steps - Pooling back in - fceratto@cumin1002

Hey @VRiley-WMF & @Jclark-ctr - I remember you two were working on tracking down and consolidating all the Dell Support tickets that we've opened for this server. Can you send me the full list of Dell Tech Support ticket numbers that we're created? I'll use that data to try and push for out account team to get us a replacement host. Thanks, Willy

Hey @wiki_willy Here is the list I have

188297490 - April 5th 2024
197398410 - September 10th 2024
198075128 - September 23rd 2024
200579927 - November 7th 2024
206617456 - March 7th 2025
208331710 - April 10th 2025

Awesome, thanks @VRiley-WMF! Can you do me one more favor and summarize what was replaced next to each ticket for each Tech Support request?

Yes, There was only 3 dispatches on this issue. It goes as follows

188297490 - April 5th 2024 - Logs showed up "clean" on their end and didn't send anyone out.
197398410 - September 10th 2024 - They had me troubleshoot basics (reboot the server, flea power drain). Logs were "clean" and didn't send anyone out.
198075128 - September 23rd 2024 - They had me reseat parts, and perform a flea power drain again. Logs were "clean" and no dispatch.
200579927 - November 7th 2024 - Dispatch was sent out to replace the motherboard. Swapped it out.
206617456 - March 7th 2025 - Disptach was sent out again, and replaced the motherboard again.
208331710 - April 10th 2025 - Disparch was sent out again, and replaced the following

  • ASSY Cable Power CPU Power Interposer Board Motherboard
  • ASSY Cable Power Power Interposer Board Motherboard
  • ASSY Card, Controller, PERC H755, With Battery Back Up, Front
  • ASSY Cable, Power, I2C, X8, X10
  • ASSY Controller Card Interface, PCB, V2
  • ASSY Card, Backplane, 10X2.5, 1U, Version 3
  • ASSY Motherboard, WMX Version 4

Perfect, thanks @VRiley-WMF! I just sent an email out to our Dell Account team and cc'd you and John on it.

Just a quick update: our Dell Account team is working on a resolution. There's a new open case for requesting a RMA and a server replacement.

There has been a new ticket opened for this unit for RMA. It is Service Request 210136653

@VRiley-WMF I've been suggested to assign this task to you while we wait for the RMA, I hope you don't mind :)

Placing in blocked for the time being while we wait for the RMA

I just filled out the registration for the seed server today, so it should be arriving in the next 1-2 weeks. @VRiley-WMF - just a heads up that it won't include the hard drives, so you'll have to move the disks over to the replacement chassis. It also probably won't have the normal packing slip that you see on new procurement requests.

Change #1153566 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] installserver: Allow reimage of db1246

https://gerrit.wikimedia.org/r/1153566

Change #1153566 merged by Marostegui:

[operations/puppet@production] installserver: Allow reimage of db1246

https://gerrit.wikimedia.org/r/1153566

Any update on the status of the delivery?
Thanks!

@Marostegui Looks like the Seed server was delivered Jun 12th to the data center

Screenshot 2025-06-18 at 5.44.33 PM.png (206×1 px, 36 KB)

@VRiley-WMF this would be the dell that you placed in the new cage. the P.O on packing slip starts with S.S seed Server

Thank you - from puppet side this host is ready to be installed and reimaged.

We have received the Seed Server for this unit. Would we like to use a new/different name but set it up in the same location?

We have received the Seed Server for this unit. Would we like to use a new/different name but set it up in the same location?

Manuel is out until next week. I personally don't care either way but let's wait for him to come back!

@VRiley-WMF is it a completely new server right? With a different asset tag? If that's the case, I think we should treat it as such and just give it a new hostname: db1259 and follow the normal racking+installation process.
If the asset tag is the same, then we should probably just reuse it, including the same IP

Change #1166797 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Add db1259 to puppet

https://gerrit.wikimedia.org/r/1166797

Change #1166797 merged by Marostegui:

[operations/puppet@production] mariadb: Add db1259 to puppet

https://gerrit.wikimedia.org/r/1166797

@VRiley-WMF I've created the puppet patches for db1259.

@Marostegui thanks! I will be installing this as a "new" unit of db1259

@Marostegui thanks! I will be installing this as a "new" unit of db1259

<3

Just to verify with you @Marostegui the server is now in netbox. However, this seed server only has a single 1.92TB drive, while the other server has ten 1.92 drives. Is it safe to transfer those drives into the seed server and re-image it?

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm

Before proceeding with the imaging, I wanted to make sure, it's okay for me to wipe these drives, correct? I think that's why it may fail on the reimage

Before proceeding with the imaging, I wanted to make sure, it's okay for me to wipe these drives, correct? I think that's why it may fail on the reimage

Yes, you can totally wipe them an recreate the RAID 10.
Thank you

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm executed with errors:

  • db1259 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console db1259.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm executed with errors:

  • db1259 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console db1259.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm completed:

  • db1259 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507122048_vriley_3510369_db1259.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@VRiley-WMF thank you so much, we can take it from here.
@FCeratto-WMF the host is accessible, please productionize it: puppet + recloning + pooling

Icinga downtime and Alertmanager silence (ID=94d69bcb-2e00-40fe-be67-efbd8d2cf1e5) set by fceratto@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: New host setup

db1259.eqiad.wmnet

Change #1169034 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] Prepare db1259 for production, remove db1246

https://gerrit.wikimedia.org/r/1169034

Mentioned in SAL (#wikimedia-operations) [2025-07-14T09:56:50Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Add db1259 T393296', diff saved to https://phabricator.wikimedia.org/P78964 and previous config saved to /var/cache/conftool/dbconfig/20250714-095649-fceratto.json

Change #1169034 merged by Federico Ceratto:

[operations/puppet@production] Prepare db1259 for production

https://gerrit.wikimedia.org/r/1169034

Started cloning db1233.eqiad.wmnet to db1259.eqiad.wmnet - fceratto@cumin1002

Completed depool of db1233 - Depool db1233.eqiad.wmnet to then clone it to db1259.eqiad.wmnet - fceratto@cumin1002 - fceratto@cumin1002

Mentioned in SAL (#wikimedia-operations) [2025-07-14T10:54:17Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Set db1259 T393296', diff saved to https://phabricator.wikimedia.org/P78979 and previous config saved to /var/cache/conftool/dbconfig/20250714-105416-fceratto.json

Started cloning db1233.eqiad.wmnet to db1259.eqiad.wmnet - fceratto@cumin1002

Start pool of db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning - fceratto@cumin1002

Completed pool of db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning - fceratto@cumin1002

Finished cloning db1233.eqiad.wmnet to db1259.eqiad.wmnet - fceratto@cumin1002

Change #1169110 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1259.yaml: enable notifications

https://gerrit.wikimedia.org/r/1169110

Completed depool of db1259 - Pooling in - fceratto@cumin1002

Change #1169110 merged by Federico Ceratto:

[operations/puppet@production] db1259.yaml: enable notifications

https://gerrit.wikimedia.org/r/1169110

Start pool of db1259 gradually with 4 steps - Pooling in - fceratto@cumin1002

Start pool of db1259 gradually with 4 steps - Pooling in - fceratto@cumin1002

Completed pool of db1259 gradually with 4 steps - Pooling in - fceratto@cumin1002

Jclark-ctr mentioned this in Unknown Object (Task).Dec 17 2025, 6:41 PM