Page MenuHomePhabricator

pc1017 crashed
Closed, ResolvedPublic

Description

The pattern looks like what we've seen on T375382

journalctl output: P70583
actions taken: systemctl set-environment MYSQLD_OPTS="--innodb-force-recovery=1" && systemctl restart mariadb && journalctl -fln50 -u mariadb

Event Timeline

ABran-WMF triaged this task as High priority.EditedOct 24 2024, 9:00 AM
ABran-WMF moved this task from Triage to In progress on the DBA board.

mariadb version on pc cluster is the same as the one we have issues on other sections (10.6.19) - see T377718

Mentioned in SAL (#wikimedia-operations) [2024-10-24T09:12:02Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 4:00:00 on pc[1014,1017].eqiad.wmnet with reason: pc maintenance T378068

Mentioned in SAL (#wikimedia-operations) [2024-10-24T09:12:17Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc[1014,1017].eqiad.wmnet with reason: pc maintenance T378068

Change #1082721 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] parsercache: Move pc1014 to be the master of pc5 instead of pc1017

https://gerrit.wikimedia.org/r/1082721

Change #1082721 merged by Jcrespo:

[operations/puppet@production] parsercache: Move pc1014 to be the master of pc5 instead of pc1017

https://gerrit.wikimedia.org/r/1082721

Icinga downtime and Alertmanager silence (ID=d60a748b-47aa-4525-a863-4f0eef44fac6) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: moved pc number

pc1014.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-10-24T10:11:51Z] <jynus@cumin1002> dbctl commit (dc=all): 'promoting pc1014 as the master of pc5 T378068', diff saved to https://phabricator.wikimedia.org/P70584 and previous config saved to /var/cache/conftool/dbconfig/20241024-101150-jynus.json

Icinga downtime and Alertmanager silence (ID=75a868cf-4682-460c-b2fa-9d21f253934e) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: stopped being the active one, stopping replication

pc1017.eqiad.wmnet

pc1014 has been setup as the host to cover for pc5 in eqiad, instead of pc1017. dbctl and puppet may need further adjustments to see how to configure now the candidates/mark them on puppet and dbctl, but other wise, and with some 10% impact in cache misses, the servers seem to be working as intended.

As followups:

  • Cleanup existing heartbeat tables of pc1 entries contaminated by the pc1014 host (hosts need to catch up first, as if they are deleted now, they reappear, as they are being replicated)
  • Discover what happened to pc1017/fix/prevent if from happening
  • Handle and decide candidates and spares on puppet and dbctl (pc1017 is still in pc5, but depooled)

pc1014 has been setup as the host to cover for pc5 in eqiad, instead of pc1017. dbctl and puppet may need further adjustments to see how to configure now the candidates/mark them on puppet and dbctl, but other wise, and with some 10% impact in cache misses, the servers seem to be working as intended.

As followups:

  • Cleanup existing heartbeat tables of pc1 entries contaminated by the pc1014 host (hosts need to catch up first, as if they are deleted now, they reappear, as they are being replicated)
MariaDB root@(none):(none)> select * from heartbeat.heartbeat;
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+
| ts                         | server_id | file              | position  | relay_master_log_file | exec_master_log_pos | shard | datacenter |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+
| 2024-10-29T12:35:58.002700 | 171978762 | pc1017-bin.015739 | 922604865 | pc1014-bin.341069     | 368247039           | pc5   | eqiad      |
| 2024-10-29T12:35:58.001180 | 171978841 | pc1014-bin.341069 | 368980950 | pc2017-bin.015750     | 136129720           | pc5   | eqiad      |
| 2024-10-29T12:35:58.001520 | 180360460 | pc2017-bin.015750 | 136190583 | pc1014-bin.341069     | 368246424           | pc5   | codfw      |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+
MariaDB root@(none):(none)> select * from heartbeat.heartbeat;
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
| ts                         | server_id | file              | position   | relay_master_log_file | exec_master_log_pos | shard | datacenter |
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
| 2024-10-29T12:42:00.001030 | 171966521 | pc1011-bin.340179 | 931191173  | pc2011-bin.352869     | 1001310725          | pc1   | eqiad      |
| 2024-10-29T12:42:00.000940 | 180355144 | pc2011-bin.352869 | 1001346740 | pc1011-bin.340179     | 931167151           | pc1   | codfw      |
+----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+

heartbeat looks OK

  • Discover what happened to pc1017/fix/prevent if from happening

T378076 has been created as a consequence of T378068, I think we are still uncertain that this outage was triggered by a huge re-caching?

  • Handle and decide candidates and spares on puppet and dbctl (pc1017 is still in pc5, but depooled)

@Ladsgroup wdyt? Do you want pc1017 to be repooled? should it be a floating spare maybe?

Discover what happened to pc1017/fix/prevent if from happening

No, T378076 is separate. I mean what caused it to crash. Was it a memory or controller issue? (my working theory) or was it a bug, or something else? Given this and T375382 happened in such a small period of time, but we hadn't had issues before in a long time, maybe there is a pattern or something to look for. We would like to do something to prevent it, even if that means considering those hosts as "unstable" and reimage them/only use them as spares (that is an example, I mean deciding what's the best way to move forward).

Yeah, we need to figure out what happened to the host.

arnaudb@cumin1002:~ $ sudo ipmitool -I lanplus -H "pc1017.mgmt.eqiad.wmnet" -U root -E sel list
Unable to read password from environment
Password: 
   1 | 07/15/2024 | 09:10:04 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted

nothing much on SEL side, ipmi is unreachable though, I'll reset the interface

Any logs on the controller? If we are unable to find any HW traces my suggestion would be:
Let's get the IPMI working and reimage the host (formatting /srv even). I'll install 10.6.20 on there when it is compiled and let's set it as spare for now (replicating)

Mentioned in SAL (#wikimedia-operations) [2024-11-05T13:16:42Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-05T13:16:55Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Change #1087457 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] dotfiles: add ssh config for mgmt

https://gerrit.wikimedia.org/r/1087457

Any logs on the controller? If we are unable to find any HW traces my suggestion would be:
Let's get the IPMI working and reimage the host (formatting /srv even). I'll install 10.6.20 on there when it is compiled and let's set it as spare for now (replicating)

racadm soft reset + forced: failed
racadm hard reset + forced: failed
ipmitool -I lanplus -H "pc1017.mgmt.eqiad.wmnet" -U root -E chassis power cycle: failed as well

Nothing found on megacli logs (not much logs to go with btw), I'll reimage and get back to debugging IPMI.

Change #1087457 merged by Arnaudb:

[operations/puppet@production] dotfiles: add ssh config for mgmt

https://gerrit.wikimedia.org/r/1087457

Mentioned in SAL (#wikimedia-operations) [2024-11-05T13:54:05Z] <arnaudb> reimage pc1017 T378068

Did you issue the reimage with formatting /srv/?

Change #1087468 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: wipe /srv on pc1017

https://gerrit.wikimedia.org/r/1087468

Change #1087468 merged by Arnaudb:

[operations/puppet@production] mariadb: wipe /srv on pc1017

https://gerrit.wikimedia.org/r/1087468

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed:

  • pc1017 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411051453_arnaudb_363231_pc1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed:

  • pc1017 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411051540_arnaudb_369530_pc1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@Marostegui you're good to go to install 6.20 on this host! let me know when you're done, I'll keep an eye on its hardware.

@Marostegui you're good to go to install 6.20 on this host! let me know when you're done, I'll keep an eye on its hardware.

Please populate it + start replication and I will upgrade when I've done a few more tests on 10.6.20, as it is currently running only on test-s4. But let's not have this host cold for long

Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:49:57Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:50:02Z] <arnaudb@cumin1002> END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:50:12Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:50:14Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

downgrade done: Unpacking wmf-mariadb106 (10.6.17+deb12u1) over (10.6.19+deb12u1) ...
Setting up wmf-mariadb106 (10.6.17+deb12u1) ...

will fill it up now

Mentioned in SAL (#wikimedia-operations) [2024-11-12T07:52:22Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-12T07:52:36Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-13T15:47:36Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-13T15:47:53Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-18T07:46:13Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-18T07:46:17Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-25T07:55:40Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-11-25T07:55:53Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-12-02T08:00:39Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Mentioned in SAL (#wikimedia-operations) [2024-12-02T08:00:52Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm

ABran-WMF changed the task status from Open to In Progress.Dec 2 2024, 10:13 AM

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed:

  • pc1017 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412021016_arnaudb_1725706_pc1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I brought maraidb back online. I followed instruction in https://wikitech.wikimedia.org/wiki/MariaDB#Setting_up_a_fresh_server (I couldn't run some parts of it because mariadb deamon was running before running the init command, I killed it, exactly like the issue I had in T374355#10135227. Set up grants and heartbeat tables. Created parsercache database but before setting up the replication pc000 to pc255 needs to be created. I can't find how I did that two months ago and I need to leave. Since pc1013 is now getting warmed up, situation is much better. I will try to look into this when I'm back.

Change #1100890 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] create_pc_tables.sh: Create table in parsercache

https://gerrit.wikimedia.org/r/1100890

Marostegui reassigned this task from ABran-WMF to Ladsgroup.

I brought maraidb back online. I followed instruction in https://wikitech.wikimedia.org/wiki/MariaDB#Setting_up_a_fresh_server (I couldn't run some parts of it because mariadb deamon was running before running the init command, I killed it, exactly like the issue I had in T374355#10135227. Set up grants and heartbeat tables. Created parsercache database but before setting up the replication pc000 to pc255 needs to be created. I can't find how I did that two months ago and I need to leave. Since pc1013 is now getting warmed up, situation is much better. I will try to look into this when I'm back.

I have created the tables and configured replication on pc1017 from pc5 master (as that's what's configured on dbctl). Created the tables using a quick script that I pushed to dbtools so it is stored somewhere: https://gerrit.wikimedia.org/r/c/operations/software/+/1100890

So this is all done and server is back.

Change #1100890 merged by jenkins-bot:

[operations/software@master] create_pc_tables.sh: Create table in parsercache

https://gerrit.wikimedia.org/r/1100890