17:49:57 <+icinga-wm> PROBLEM - Host db1179 #page is DOWN: PING CRITICAL - Packet loss = 100% 17:52:13 <+logmsgbot> !log rzl@cumin2002 dbctl commit (dc=all): 'db1179 depooled', diff saved to https://phabricator.wikimedia.org/P66324 and previous config saved to /var/cache/conftool/dbconfig/20240711-175212-rzl.json
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
db1179: Disable notification for db1179 | operations/puppet | production | +1 -1 |
Related Objects
- Mentioned In
- T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis
- Mentioned Here
- P66875 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Maint over (T369855 T370304)'
P66874 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Maint over (T369855 T370304)'
P66873 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Maint over (T369855 T370304)'
T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis
P66872 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Maint over (T369855 T370304)'
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2024-07-12T08:59:22Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855
Mentioned in SAL (#wikimedia-operations) [2024-07-12T08:59:35Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855
I am unable to reach it via management interface either, it might need a bit of hands on
Change #1054055 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] db1179: Disable notification for db1179
Mentioned in SAL (#wikimedia-operations) [2024-07-15T07:17:50Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855
Mentioned in SAL (#wikimedia-operations) [2024-07-15T07:17:54Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855
Change #1054055 merged by Marostegui:
[operations/puppet@production] db1179: Disable notification for db1179
All hosts in x1 are potential candidate masters. They all run the same binlog format, so there's not really a fixed candidate master like on the other sections.
This server has been down for a few days, @wiki_willy please let me know if I can help
Hi @ABran-WMF - can you work with the onsite engineers on this? cc'ing @VRiley-WMF & @Jclark-ctr
sure thing!
@VRiley-WMF @Jclark-ctr the host has been depooled and is downtimed, you should be able to take it from here. Feel free to ping if needed!
Noted that the server doesn't want to power on. Tried to power cycle it, attempted a flea power drain. Reseated the power cable from the motherboard. Removed all the RAM except A1, and there was no change. No change when removing all the RAM. Will need to investigate further.
Attempted bare minimum setup (CPU1, A1 RAM, no additional cards) no change. Attempted swapping out the power button module still no change. Will attempt swapping out MB
After swapping the Mainboard, it was finally able to boot. It currently is having issues with RAM which I will continue to troubleshoot tomorrow.
After swapping out the MB, it'll boot. However, it's consistently throwing errors with a few DIMM slots (even after replacing the memory in those slots). Since this server is out of warranty, Continuing to investigate options.
Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:07:46Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66872 and previous config saved to /var/cache/conftool/dbconfig/20240722-010745-ladsgroup.json
Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:22:51Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66873 and previous config saved to /var/cache/conftool/dbconfig/20240722-012251-ladsgroup.json
Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:37:57Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66874 and previous config saved to /var/cache/conftool/dbconfig/20240722-013756-ladsgroup.json
Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:53:02Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66875 and previous config saved to /var/cache/conftool/dbconfig/20240722-015302-ladsgroup.json
No but it had ten days of replication replayed (with RBR) and if it had issues, it would have broken replication really quickly. Also logs also said aria recovery was done and was successful. On top of that, I did several queries to make sure the data is the same before repooling. I can still do a checksum on top if you think is needed but I think the replication not breaking for RBR is usually good enough sign for me.
Sure, that's fine (remember we don't use Aria, so in this case that can be misleading).