User Details
- User Since
- Jan 7 2025, 6:49 PM (74 w, 5 d)
- Availability
- Available
- IRC Nick
- federico3
- LDAP User
- Federico Ceratto
- MediaWiki User
- FCeratto-WMF [ Global Accounts ]
Today
Thu, Jun 11
Wed, Jun 10
Various cookbooks and scripts grew organically with different feature sets so I'm putting together a summary table at https://wikitech.wikimedia.org/wiki/MariaDB/Upgrading_a_section#Tool_summary
We can reimage db2185 first
Tue, Jun 9
Related to T426633
Schema change completed.
This is also preventing https://gerrit.wikimedia.org/r/c/operations/dns/+/1286411 from being merged
Fri, Jun 5
@Marostegui db1274 is not ready for replication (there's no /srv data, no MariaDB installed and it's not in zarcillo yet) as it's part of the new batch.
Thu, Jun 4
The host crashed again.
Wed, Jun 3
We can follow the same pattern of schema change helper and rolling restarts: walking across DCs and sections in the safest sequence, and optional CLI flags to limit scope e.g. --sections s1,s4 --dc codfw.
We can reuse a good chunk of existing code for this.
Tue, Jun 2
Mon, Jun 1
Thanks @VRiley-WMF
journald is not showing hardware errors.
MariaDB started cleanly, replication is catching up as expected.
https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=%24__all&var-server=db1224&var-port=9104&from=now-30m&to=now&timezone=utc
The cookbook itself could compute the expected downtime expiration time when setting it (just datetime.now() + timedelta(...), nothing fancy)
Sat, May 30
Yes, MariaDB is not running.
Fri, May 29
Crash on the 26 after 48 minutes of uptime:
May 26 11:48:36 db2189 kernel: ? __pfx_flush_tlb_func+0x10/0x10 May 26 11:48:36 db2189 kernel: on_each_cpu_cond_mask+0x24/0x40 May 26 11:48:36 db2189 kernel: arch_tlbbatch_flush+0xe7/0x100 May 26 11:48:36 db2189 kernel: try_to_unmap_flush+0x2d/0x40 May 26 11:48:36 db2189 kernel: migrate_pages_batch+0x741/0xa50 May 26 11:48:36 db2189 kernel: migrate_pages+0x960/0xb70 May 26 11:48:36 db2189 kernel: ? __pfx_alloc_misplaced_dst_folio+0x10/0x10 May 26 11:48:36 db2189 kernel: migrate_misplaced_folio+0xda/0x290 May 26 11:48:36 db2189 kernel: __handle_mm_fault+0xcfb/0xf70 May 26 11:48:36 db2189 kernel: handle_mm_fault+0xe2/0x2c0 May 26 11:48:36 db2189 kernel: do_user_addr_fault+0x217/0x620 May 26 11:48:36 db2189 kernel: exc_page_fault+0x7e/0x180 May 26 11:48:36 db2189 kernel: asm_exc_page_fault+0x26/0x30 May 26 11:48:36 db2189 kernel: RIP: 0033:0x55e2dbe40614 May 26 11:48:36 db2189 kernel: Code: 00 00 0f 1f 40 00 55 48 89 d0 4c 89 c2 48 89 e5 41 57 41 56 41 55 49 89 cd 41 54 49 89 f4 53 48 89 fb 48 83 ec 28 48 8b 76 18 <0f> b6 4f fd f6 46 3c 01 0f 84 56 01 00 00 31 f6 45 31 ff 83 e1 07 May 26 11:48:36 db2189 kernel: RSP: 002b:00007f1603989ae0 EFLAGS: 00010202 May 26 11:48:36 db2189 kernel: RAX: 00007f1603989bd0 RBX: 00007f25f1362084 RCX: 0000000000000003 May 26 11:48:36 db2189 kernel: RDX: 0000000000000002 RSI: 00007f15fd6bbd80 RDI: 00007f25f1362084 May 26 11:48:36 db2189 kernel: RBP: 00007f1603989b30 R08: 0000000000000002 R09: 00007f1603989bb0 May 26 11:48:36 db2189 kernel: R10: 00007f141004f536 R11: 000000000000000d R12: 00007f15fd6f0620 May 26 11:48:36 db2189 kernel: R13: 0000000000000003 R14: 0000000000000043 R15: 000000000000004d May 26 11:48:36 db2189 kernel: </TASK> May 26 11:48:36 db2189 kernel: watchdog: BUG: soft lockup - CPU#21 stuck for 23s! [mysqld:13037] May 26 11:48:36 db2189 kernel: Modules linked in: tcp_diag inet_diag binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac skx_edac_common nfit libnvdimm dell_pc x86_pkg_temp_thermal platform_profile intel_powerclamp xfs coretemp crct10dif_pclmul gha> May 26 11:48:36 db2189 kernel: configfs nfnetlink ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic cdc_ether usbnet mii dm_mod sd_mod ahci libahci xhci_pci iTCO_wdt intel_pmc_bxt libata xhci_hcd megaraid_sas iTCO_vendor_support tg3 watchdog crc32_pclmul bnxt_en scsi_mod usbcore libphy crc32c_inte> May 26 11:48:36 db2189 kernel: CPU: 21 UID: 498 PID: 13037 Comm: mysqld Tainted: G L 6.12.88+deb13-amd64 #1 Debian 6.12.88-1 May 26 11:48:36 db2189 kernel: Tainted: [L]=SOFTLOCKUP May 26 11:48:36 db2189 kernel: Hardware name: Dell Inc. PowerEdge R650xs/05FK0J, BIOS 1.10.2 03/03/2023 May 26 11:48:36 db2189 kernel: RIP: 0010:smp_call_function_many_cond+0x345/0x4c0 May 26 11:48:36 db2189 kernel: Code: e8 20 2e 4d 00 3b 05 9a 15 ca 01 0f 83 e1 fd ff ff 48 63 d0 49 8b 34 24 48 03 34 d5 20 6e 5f 9b 8b 56 08 83 e2 01 74 0a f3 90 <8b> 4e 08 83 e1 01 75 f6 83 c0 01 eb bb 65 8b 05 b3 fb e6 65 48 0f May 26 11:48:36 db2189 kernel: RSP: 0000:ff45268d6abc7ae0 EFLAGS: 00000202 May 26 11:48:36 db2189 kernel: RAX: 0000000000000007 RBX: 0000000000000202 RCX: 0000000000000001 May 26 11:48:36 db2189 kernel: RDX: 0000000000000001 RSI: ff21ee203f1bd8a0 RDI: 0000000000000007 May 26 11:48:36 db2189 kernel: RBP: ff21ee203f535180 R08: ff21ede146282590 R09: 0000000000000000 May 26 11:48:36 db2189 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff21ee203f537180 May 26 11:48:36 db2189 kernel: R13: ff21ede146282f58 R14: 0000000000000015 R15: 0000000000000020 May 26 11:48:36 db2189 kernel: FS: 00007f1387fff6c0(0000) GS:ff21ee203f500000(0000) knlGS:0000000000000000 May 26 11:48:36 db2189 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 26 11:48:36 db2189 kernel: CR2: 00007f768d5d6300 CR3: 000000408d230003 CR4: 0000000000771ef0 May 26 11:48:36 db2189 kernel: PKRU: 55555554 May 26 11:48:36 db2189 kernel: Call Trace: May 26 11:48:36 db2189 kernel: <TASK> May 26 11:48:36 db2189 kernel: ? __pfx_flush_tlb_func+0x10/0x10 May 26 11:48:36 db2189 kernel: on_each_cpu_cond_mask+0x24/0x40 May 26 11:48:36 db2189 kernel: arch_tlbbatch_flush+0xe7/0x100 May 26 11:48:36 db2189 kernel: try_to_unmap_flush+0x2d/0x40 May 26 11:48:36 db2189 kernel: migrate_pages_batch+0x741/0xa50 May 26 11:48:36 db2189 kernel: migrate_pages+0x960/0xb70 May 26 11:48:36 db2189 kernel: ? __pfx_alloc_misplaced_dst_folio+0x10/0x10 May 26 11:48:36 db2189 kernel: migrate_misplaced_folio+0xda/0x290 May 26 11:48:36 db2189 kernel: __handle_mm_fault+0xcfb/0xf70 May 26 11:48:36 db2189 kernel: handle_mm_fault+0xe2/0x2c0 May 26 11:48:36 db2189 kernel: do_user_addr_fault+0x217/0x620 May 26 11:48:36 db2189 kernel: exc_page_fault+0x7e/0x180 May 26 11:48:36 db2189 kernel: asm_exc_page_fault+0x26/0x30 May 26 11:48:36 db2189 kernel: RIP: 0033:0x55e2dbd0ccb8 May 26 11:48:36 db2189 kernel: Code: 5d 01 8b 10 48 c1 e2 05 48 01 fa 48 39 d7 73 55 66 0f 6f 05 1a 72 4c 00 66 0f ef c9 66 0f 1f 44 00 00 48 8b 07 48 85 c0 74 2e <49> 01 07 48 8b 47 08 49 01 47 08 48 8b 47 10 49 3b 47 10 0f 82 4f May 26 11:48:36 db2189 kernel: RSP: 002b:00007f1387ffec80 EFLAGS: 00010202 May 26 11:48:36 db2189 kernel: RAX: 0000000000000176 RBX: 00007f768d790040 RCX: 0000000000000000 May 26 11:48:36 db2189 kernel: RDX: 00007f768fd255c0 RSI: 00007f768d790040 RDI: 00007f768fd25100 May 26 11:48:36 db2189 kernel: RBP: 00007f1387ffecc0 R08: 0013bb7d9236ba0f R09: 0000000000000923 May 26 11:48:36 db2189 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f768ff8aa40 May 26 11:48:36 db2189 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 00007f768d5d6300 May 26 11:48:36 db2189 kernel: </TASK> May 26 11:48:36 db2189 kernel: watchdog: BUG: soft lockup - CPU#32 stuck for 78s! [mysqld:12782] -- Boot 60f2a03b401641fe9a0553be3de9fffb -- May 27 14:33:46 db2189 kernel: Linux version 6.12.88+deb13-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-14 (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #1 SMP PREEMPT_DYNAMIC Debian 6.12.88-1 (2026-05-15) May 27 14:33:46 db2189 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.12.88+deb13-amd64 root=UUID=72d41885-6bd1-4362-999e-09a99d7a5406 ro console=ttyS1,115200n8 raid0.default_layout=2 elevator=deadline May 27 14:33:46 db2189 kernel: x86/tme: not enabled by BIOS
Planned reboot on 26:
May 26 11:00:07 db2189 ferm[1727008]: Stopping Firewall: ferm. May 26 11:00:07 db2189 systemd[1]: ferm.service: Deactivated successfully. May 26 11:00:07 db2189 systemd[1]: Stopped ferm.service - ferm firewall configuration. May 26 11:00:07 db2189 systemd[1]: Reached target shutdown.target - System Shutdown. May 26 11:00:07 db2189 systemd[1]: Reached target final.target - Late Shutdown Services. May 26 11:00:07 db2189 systemd[1]: systemd-reboot.service: Deactivated successfully. May 26 11:00:07 db2189 systemd[1]: Finished systemd-reboot.service - System Reboot. May 26 11:00:07 db2189 systemd[1]: Reached target reboot.target - System Reboot. May 26 11:00:07 db2189 systemd[1]: Shutting down. May 26 11:00:07 db2189 systemd[1]: Using hardware watchdog 'iTCO_wdt', version 4, device /dev/watchdog0 May 26 11:00:07 db2189 systemd[1]: Watchdog running with a hardware timeout of 10min. May 26 11:00:07 db2189 kernel: watchdog: watchdog0: watchdog did not stop! May 26 11:00:07 db2189 systemd-shutdown[1]: Using hardware watchdog 'iTCO_wdt', version 4, device /dev/watchdog0 May 26 11:00:07 db2189 systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min. May 26 11:00:07 db2189 systemd-shutdown[1]: Syncing filesystems and block devices. May 26 11:00:07 db2189 systemd-shutdown[1]: Sending SIGTERM to remaining processes... May 26 11:00:07 db2189 systemd-journald[2882953]: Received SIGTERM from PID 1 (systemd-shutdow). May 26 11:00:07 db2189 systemd-journald[2882953]: Journal stopped -- Boot 9697fec88cae47e9bfe79fa19c35a0ad -- May 26 11:02:38 db2189 kernel: Linux version 6.12.88+deb13-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-14 (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #1 SMP PREEMPT_DYNAMIC Debian 6.12.88-1 (2026-05-15) May 26 11:02:38 db2189 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.12.88+deb13-amd64 root=UUID=72d41885-6bd1-4362-999e-09a99d7a5406 ro console=ttyS1,115200n8 raid0.default_layout=2 elevator=deadline May 26 11:02:38 db2189 kernel: x86/tme: not enabled by BIOS May 26 11:02:38 db2189 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Prometheus metrics disappeared on the 26, hour before the page: https://grafana.wikimedia.org/goto/bfnjbsk11296oc?orgId=1
The next day (28):
I'm seeing the following errors in the logs that look a bit suspicious, specifically the N/A, transition to Non-recoverable ; CPU 2 ;, could it be a hardware issue?
Thu, May 28
@VRiley-WMF the host is not responding on ssh and not generating metrics so maybe it did not power up. Please update the firmware and tomorrow I'll try to powercycle it.
getsel:
------------------------------------------------------------------------------- Record: 57 Date/Time: 05/28/2026 15:06:07 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- Record: 58 Date/Time: 05/28/2026 15:06:07 Source: system Severity: Critical Description: CPU 2 machine check error detected. ------------------------------------------------------------------------------- Record: 59 Date/Time: 05/28/2026 15:06:07 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 60 Date/Time: 05/28/2026 15:06:07 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 61 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 62 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 63 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 64 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 65 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 66 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 67 Date/Time: 05/28/2026 15:06:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 68 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 69 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 70 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 71 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 72 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 73 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 74 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 75 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 76 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 77 Date/Time: 05/28/2026 15:06:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 78 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 79 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 80 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 81 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 82 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 83 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 84 Date/Time: 05/28/2026 15:06:10 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 85 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 86 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 87 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 88 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 89 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 90 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 91 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 92 Date/Time: 05/28/2026 15:06:11 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 93 Date/Time: 05/28/2026 15:06:12 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 94 Date/Time: 05/28/2026 15:06:12 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 95 Date/Time: 05/28/2026 15:06:12 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 96 Date/Time: 05/28/2026 15:06:12 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 97 Date/Time: 05/28/2026 15:06:12 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 98 Date/Time: 05/28/2026 15:06:12 Source: system Severity: Ok Description: An OEM diagnostic event occurred. -------------------------------------------------------------------------------
UI form added to https://zarcillo.wikimedia.org/ui/hosts - it should suggest only hosts that have no live instances
Example logs from local testbed:
INFO Deleting host db-test1002 DEBUG query 'SELECT instance_name, hostname, dc, port, instance_group, fqdn, section FROM instances_view WHERE hostname = :hn' returned 1 DEBUG query 'SELECT hb.hostname AS hn, hb.section, MAX(lag) AS lag FROM heartbeat_status hb JOIN dns_a dns ON hb.replication_source_ipv4 = dns.ipv4addr GROUP BY section, hn' returned 318 DEBUG query 'SELECT * FROM alertmanager' returned 17 DEBUG query 'SELECT hostname, section, role, MIN(pooled) AS pooled FROM noc_dbs GROUP BY hostname, section' returned 244 DEBUG query 'SELECT * FROM prometheus_kernel_dbs' returned 306 DEBUG query 'SELECT * FROM candidates' returned 34 DEBUG query 'DELETE FROM locks WHERE instance = :instance' returned 1 rows DEBUG query 'DELETE FROM section_instances WHERE instance = :instance' returned 1 rows DEBUG query 'DELETE FROM instances WHERE server = :fqdn' returned 1 rows DEBUG query 'DELETE FROM host_meta WHERE hostname = :hostname' returned 0 rows DEBUG query 'DELETE FROM puppet_hiera WHERE hostname = :hostname' returned 0 rows DEBUG query 'DELETE FROM puppet_roles WHERE fqdn = :fqdn' returned 0 rows INFO Deleted host db-test1002
No error in the logs, replication is catching up.
The host was shut down cleanly so I can check and repool it.
Wed, May 27
Related to this and T427381: I updated the description with some more detailed steps
(added a long downtime just in case)
https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1294265 is related. Tested with:
There are no events in getsel after 06/13/2025 14:24:15
Yes, the initial fix should be ready for CR shortly
(we discussed taking the opportunity to make the pool/depool codebase simpler and more procedural and splitting the pool/depool parts in the two cookbooks)
Tue, May 26
db2196, db2221 and db2222 have silences removed and are fully pooled-in
es2042 and es2041 in section es4 have been switched: es2041 is now a replica and can be depooled