The pattern looks like what we've seen on T375382
journalctl output: P70583
actions taken: systemctl set-environment MYSQLD_OPTS="--innodb-force-recovery=1" && systemctl restart mariadb && journalctl -fln50 -u mariadb
The pattern looks like what we've seen on T375382
journalctl output: P70583
actions taken: systemctl set-environment MYSQLD_OPTS="--innodb-force-recovery=1" && systemctl restart mariadb && journalctl -fln50 -u mariadb
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Ladsgroup | T378068 pc1017 crashed | |||
| Resolved | PRODUCTION ERROR | None | T378076 Parsercache issues in codfw causing large-scale outage |
mariadb version on pc cluster is the same as the one we have issues on other sections (10.6.19) - see T377718
Mentioned in SAL (#wikimedia-operations) [2024-10-24T09:12:02Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 4:00:00 on pc[1014,1017].eqiad.wmnet with reason: pc maintenance T378068
Mentioned in SAL (#wikimedia-operations) [2024-10-24T09:12:17Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on pc[1014,1017].eqiad.wmnet with reason: pc maintenance T378068
Change #1082721 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] parsercache: Move pc1014 to be the master of pc5 instead of pc1017
Change #1082721 merged by Jcrespo:
[operations/puppet@production] parsercache: Move pc1014 to be the master of pc5 instead of pc1017
Mentioned in SAL (#wikimedia-operations) [2024-10-24T09:59:24Z] <jynus> restart pc1014 T378068
Icinga downtime and Alertmanager silence (ID=d60a748b-47aa-4525-a863-4f0eef44fac6) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: moved pc number
pc1014.eqiad.wmnet
Mentioned in SAL (#wikimedia-operations) [2024-10-24T10:11:51Z] <jynus@cumin1002> dbctl commit (dc=all): 'promoting pc1014 as the master of pc5 T378068', diff saved to https://phabricator.wikimedia.org/P70584 and previous config saved to /var/cache/conftool/dbconfig/20241024-101150-jynus.json
Icinga downtime and Alertmanager silence (ID=75a868cf-4682-460c-b2fa-9d21f253934e) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: stopped being the active one, stopping replication
pc1017.eqiad.wmnet
pc1014 has been setup as the host to cover for pc5 in eqiad, instead of pc1017. dbctl and puppet may need further adjustments to see how to configure now the candidates/mark them on puppet and dbctl, but other wise, and with some 10% impact in cache misses, the servers seem to be working as intended.
As followups:
MariaDB root@(none):(none)> select * from heartbeat.heartbeat; +----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+ | ts | server_id | file | position | relay_master_log_file | exec_master_log_pos | shard | datacenter | +----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+ | 2024-10-29T12:35:58.002700 | 171978762 | pc1017-bin.015739 | 922604865 | pc1014-bin.341069 | 368247039 | pc5 | eqiad | | 2024-10-29T12:35:58.001180 | 171978841 | pc1014-bin.341069 | 368980950 | pc2017-bin.015750 | 136129720 | pc5 | eqiad | | 2024-10-29T12:35:58.001520 | 180360460 | pc2017-bin.015750 | 136190583 | pc1014-bin.341069 | 368246424 | pc5 | codfw | +----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+ MariaDB root@(none):(none)> select * from heartbeat.heartbeat; +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+ | ts | server_id | file | position | relay_master_log_file | exec_master_log_pos | shard | datacenter | +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+ | 2024-10-29T12:42:00.001030 | 171966521 | pc1011-bin.340179 | 931191173 | pc2011-bin.352869 | 1001310725 | pc1 | eqiad | | 2024-10-29T12:42:00.000940 | 180355144 | pc2011-bin.352869 | 1001346740 | pc1011-bin.340179 | 931167151 | pc1 | codfw | +----------------------------+-----------+-------------------+------------+-----------------------+---------------------+-------+------------+
heartbeat looks OK
- Discover what happened to pc1017/fix/prevent if from happening
T378076 has been created as a consequence of T378068, I think we are still uncertain that this outage was triggered by a huge re-caching?
- Handle and decide candidates and spares on puppet and dbctl (pc1017 is still in pc5, but depooled)
@Ladsgroup wdyt? Do you want pc1017 to be repooled? should it be a floating spare maybe?
Discover what happened to pc1017/fix/prevent if from happening
No, T378076 is separate. I mean what caused it to crash. Was it a memory or controller issue? (my working theory) or was it a bug, or something else? Given this and T375382 happened in such a small period of time, but we hadn't had issues before in a long time, maybe there is a pattern or something to look for. We would like to do something to prevent it, even if that means considering those hosts as "unstable" and reimage them/only use them as spares (that is an example, I mean deciding what's the best way to move forward).
arnaudb@cumin1002:~ $ sudo ipmitool -I lanplus -H "pc1017.mgmt.eqiad.wmnet" -U root -E sel list Unable to read password from environment Password: 1 | 07/15/2024 | 09:10:04 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
nothing much on SEL side, ipmi is unreachable though, I'll reset the interface
Any logs on the controller? If we are unable to find any HW traces my suggestion would be:
Let's get the IPMI working and reimage the host (formatting /srv even). I'll install 10.6.20 on there when it is compiled and let's set it as spare for now (replicating)
Mentioned in SAL (#wikimedia-operations) [2024-11-05T13:16:42Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-05T13:16:55Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Change #1087457 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] dotfiles: add ssh config for mgmt
racadm soft reset + forced: failed
racadm hard reset + forced: failed
ipmitool -I lanplus -H "pc1017.mgmt.eqiad.wmnet" -U root -E chassis power cycle: failed as well
Nothing found on megacli logs (not much logs to go with btw), I'll reimage and get back to debugging IPMI.
Mentioned in SAL (#wikimedia-operations) [2024-11-05T13:54:05Z] <arnaudb> reimage pc1017 T378068
Change #1087457 merged by Arnaudb:
[operations/puppet@production] dotfiles: add ssh config for mgmt
Change #1087468 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] mariadb: wipe /srv on pc1017
Change #1087468 merged by Arnaudb:
[operations/puppet@production] mariadb: wipe /srv on pc1017
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed:
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm
https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087499 to revert partman change
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed:
@Marostegui you're good to go to install 6.20 on this host! let me know when you're done, I'll keep an eye on its hardware.
Please populate it + start replication and I will upgrade when I've done a few more tests on 10.6.20, as it is currently running only on test-s4. But let's not have this host cold for long
Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:49:57Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:50:02Z] <arnaudb@cumin1002> END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:50:12Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-07T07:50:14Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
downgrade done: Unpacking wmf-mariadb106 (10.6.17+deb12u1) over (10.6.19+deb12u1) ...
Setting up wmf-mariadb106 (10.6.17+deb12u1) ...
will fill it up now
Mentioned in SAL (#wikimedia-operations) [2024-11-12T07:52:22Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-12T07:52:36Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-13T15:47:36Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-13T15:47:53Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-18T07:46:13Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-18T07:46:17Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-25T07:55:40Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-11-25T07:55:53Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-12-02T08:00:39Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Mentioned in SAL (#wikimedia-operations) [2024-12-02T08:00:52Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled
Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed:
I brought maraidb back online. I followed instruction in https://wikitech.wikimedia.org/wiki/MariaDB#Setting_up_a_fresh_server (I couldn't run some parts of it because mariadb deamon was running before running the init command, I killed it, exactly like the issue I had in T374355#10135227. Set up grants and heartbeat tables. Created parsercache database but before setting up the replication pc000 to pc255 needs to be created. I can't find how I did that two months ago and I need to leave. Since pc1013 is now getting warmed up, situation is much better. I will try to look into this when I'm back.
Change #1100890 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/software@master] create_pc_tables.sh: Create table in parsercache
I have created the tables and configured replication on pc1017 from pc5 master (as that's what's configured on dbctl). Created the tables using a quick script that I pushed to dbtools so it is stored somewhere: https://gerrit.wikimedia.org/r/c/operations/software/+/1100890
So this is all done and server is back.
Change #1100890 merged by jenkins-bot:
[operations/software@master] create_pc_tables.sh: Create table in parsercache