Page MenuHomePhabricator

FCeratto-WMF (Federico Ceratto)
Site Reliability Engineer

Projects (6)

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jan 7 2025, 6:49 PM (74 w, 5 d)
Availability
Available
IRC Nick
federico3
LDAP User
Federico Ceratto
MediaWiki User
FCeratto-WMF [ Global Accounts ]

Recent Activity

Today

FCeratto-WMF created T429128: Zarcillo: show alertmanager silences.
Mon, Jun 15, 8:02 AM · DBA
FCeratto-WMF closed T384212: Create a dashboard to show depooled hosts, a subtask of T384810: MariaDB lifetime management system, as Resolved.
Mon, Jun 15, 8:01 AM · Patch-For-Review, DBA
FCeratto-WMF closed T384212: Create a dashboard to show depooled hosts as Resolved.
Mon, Jun 15, 8:01 AM · DBA

Thu, Jun 11

FCeratto-WMF closed T426083: Switchover s1 master (db1163 -> db1184) as Resolved.
Thu, Jun 11, 6:45 AM · DBA
FCeratto-WMF updated the task description for T426083: Switchover s1 master (db1163 -> db1184).
Thu, Jun 11, 6:45 AM · DBA
FCeratto-WMF changed the status of T426083: Switchover s1 master (db1163 -> db1184) from Open to In Progress.
Thu, Jun 11, 6:20 AM · DBA

Wed, Jun 10

FCeratto-WMF added a comment to T428046: Automate mass upgrades (OS/mariadb).

Various cookbooks and scripts grew organically with different feature sets so I'm putting together a summary table at https://wikitech.wikimedia.org/wiki/MariaDB/Upgrading_a_section#Tool_summary

Wed, Jun 10, 11:14 AM · DBA
FCeratto-WMF closed T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie as Resolved.
Wed, Jun 10, 8:09 AM · DBA
FCeratto-WMF closed T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie, a subtask of T422365: Migration to Debian Trixie of production database-related hosts, as Resolved.
Wed, Jun 10, 8:09 AM · DBA
FCeratto-WMF updated the task description for T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie.
Wed, Jun 10, 8:09 AM · DBA
FCeratto-WMF changed the status of T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie, a subtask of T422365: Migration to Debian Trixie of production database-related hosts, from Open to In Progress.
Wed, Jun 10, 7:23 AM · DBA
FCeratto-WMF changed the status of T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie from Open to In Progress.

We can reimage db2185 first

Wed, Jun 10, 7:23 AM · DBA

Tue, Jun 9

FCeratto-WMF updated the task description for T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie.
Tue, Jun 9, 1:04 PM · DBA
FCeratto-WMF added a comment to T428460: Migrate zarcillo/orchestrator DBs to Debian Trixie.

Related to T426633

Tue, Jun 9, 11:28 AM · DBA
FCeratto-WMF closed T426086: Switchover s4 master (db1160 -> db1244) as Resolved.
Tue, Jun 9, 9:58 AM · DBA
FCeratto-WMF updated the task description for T426086: Switchover s4 master (db1160 -> db1244).
Tue, Jun 9, 9:58 AM · DBA
FCeratto-WMF updated the task description for T426086: Switchover s4 master (db1160 -> db1244).
Tue, Jun 9, 9:57 AM · DBA
FCeratto-WMF added a comment to T426086: Switchover s4 master (db1160 -> db1244).

Schema change completed.

Tue, Jun 9, 9:33 AM · DBA
FCeratto-WMF updated the task description for T426083: Switchover s1 master (db1163 -> db1184).
Tue, Jun 9, 9:30 AM · DBA
FCeratto-WMF updated the task description for T426083: Switchover s1 master (db1163 -> db1184).
Tue, Jun 9, 9:27 AM · DBA
FCeratto-WMF updated the task description for T426086: Switchover s4 master (db1160 -> db1244).
Tue, Jun 9, 7:40 AM · DBA
FCeratto-WMF added a comment to T428541: authdns-update failing.

This is also preventing https://gerrit.wikimedia.org/r/c/operations/dns/+/1286411 from being merged

Tue, Jun 9, 6:30 AM · SRE, Traffic, DNS
FCeratto-WMF updated the task description for T426086: Switchover s4 master (db1160 -> db1244).
Tue, Jun 9, 6:25 AM · DBA
FCeratto-WMF updated the task description for T426086: Switchover s4 master (db1160 -> db1244).
Tue, Jun 9, 6:15 AM · DBA
FCeratto-WMF updated the task description for T426086: Switchover s4 master (db1160 -> db1244).
Tue, Jun 9, 6:03 AM · DBA
FCeratto-WMF changed the status of T426086: Switchover s4 master (db1160 -> db1244) from Open to In Progress.
Tue, Jun 9, 5:58 AM · DBA

Fri, Jun 5

FCeratto-WMF created T428264: wmfdb upgrades.
Fri, Jun 5, 2:10 PM · DBA
FCeratto-WMF added a comment to T428240: db1274 is not booting up.

@Marostegui db1274 is not ready for replication (there's no /srv data, no MariaDB installed and it's not in zarcillo yet) as it's part of the new batch.

Fri, Jun 5, 1:19 PM · SRE, ops-eqiad, DC-Ops, DBA
FCeratto-WMF updated the task description for T419635: Drop il_to column from imagelinks table in wmf production.
Fri, Jun 5, 1:15 PM · Data-Engineering-Radar, Schema-change-in-production, Data-Engineering, DBA
FCeratto-WMF updated the task description for T428240: db1274 is not booting up.
Fri, Jun 5, 10:15 AM · SRE, ops-eqiad, DC-Ops, DBA
FCeratto-WMF added a subtask for T407942: Productionize db12[65-90]: T428240: db1274 is not booting up.
Fri, Jun 5, 10:13 AM · DBA
FCeratto-WMF added a parent task for T428240: db1274 is not booting up: T407942: Productionize db12[65-90].
Fri, Jun 5, 10:13 AM · SRE, ops-eqiad, DC-Ops, DBA
FCeratto-WMF created T428240: db1274 is not booting up.
Fri, Jun 5, 10:12 AM · SRE, ops-eqiad, DC-Ops, DBA

Thu, Jun 4

FCeratto-WMF placed T427535: db1224 is unreachable up for grabs.
Thu, Jun 4, 1:17 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF reopened T427535: db1224 is unreachable as "Open".

The host crashed again.

Thu, Jun 4, 1:16 PM · SRE, DC-Ops, ops-eqiad, DBA

Wed, Jun 3

FCeratto-WMF added a comment to T428046: Automate mass upgrades (OS/mariadb).

We can follow the same pattern of schema change helper and rolling restarts: walking across DCs and sections in the safest sequence, and optional CLI flags to limit scope e.g. --sections s1,s4 --dc codfw.
We can reuse a good chunk of existing code for this.

Wed, Jun 3, 12:30 PM · DBA

Tue, Jun 2

FCeratto-WMF closed T426095: Switchover s8 master (db1209 -> db1193), a subtask of T419635: Drop il_to column from imagelinks table in wmf production, as Resolved.
Tue, Jun 2, 9:17 AM · Data-Engineering-Radar, Schema-change-in-production, Data-Engineering, DBA
FCeratto-WMF closed T426095: Switchover s8 master (db1209 -> db1193) as Resolved.
Tue, Jun 2, 9:17 AM · DBA
FCeratto-WMF updated the task description for T426095: Switchover s8 master (db1209 -> db1193).
Tue, Jun 2, 9:17 AM · DBA
FCeratto-WMF updated the task description for T427301: codfw: rack A3 maintenance.
Tue, Jun 2, 8:59 AM · DBA, ServiceOps new, Infrastructure-Foundations, netops
FCeratto-WMF updated the task description for T427301: codfw: rack A3 maintenance.
Tue, Jun 2, 8:59 AM · DBA, ServiceOps new, Infrastructure-Foundations, netops

Mon, Jun 1

FCeratto-WMF created P93463 (An Untitled Masterwork).
Mon, Jun 1, 4:58 PM
FCeratto-WMF claimed T427535: db1224 is unreachable.

Thanks @VRiley-WMF
journald is not showing hardware errors.
MariaDB started cleanly, replication is catching up as expected.
https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=%24__all&var-server=db1224&var-port=9104&from=now-30m&to=now&timezone=utc

Mon, Jun 1, 1:32 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF closed T427381: sre.mysql.depool: do not depend on .list_hosts_instances() for depooling as Resolved.
Mon, Jun 1, 12:03 PM · DBA
FCeratto-WMF created P93431 (An Untitled Masterwork).
Mon, Jun 1, 11:27 AM
FCeratto-WMF created P93426 (An Untitled Masterwork).
Mon, Jun 1, 11:00 AM
FCeratto-WMF added a comment to T426318: Add support for automatic downtime when depooling instances using sre.mysql.depool.

The cookbook itself could compute the expected downtime expiration time when setting it (just datetime.now() + timedelta(...), nothing fancy)

Mon, Jun 1, 8:29 AM · DBA

Sat, May 30

FCeratto-WMF added a comment to T427535: db1224 is unreachable.

Yes, MariaDB is not running.

Sat, May 30, 6:22 PM · SRE, DC-Ops, ops-eqiad, DBA

Fri, May 29

FCeratto-WMF added a comment to P93403 Page 8024 for db2189 phabricator.wikimedia.org/T427376.

Crash on the 26 after 48 minutes of uptime:

May 26 11:48:36 db2189 kernel:  ? __pfx_flush_tlb_func+0x10/0x10
May 26 11:48:36 db2189 kernel:  on_each_cpu_cond_mask+0x24/0x40
May 26 11:48:36 db2189 kernel:  arch_tlbbatch_flush+0xe7/0x100
May 26 11:48:36 db2189 kernel:  try_to_unmap_flush+0x2d/0x40
May 26 11:48:36 db2189 kernel:  migrate_pages_batch+0x741/0xa50
May 26 11:48:36 db2189 kernel:  migrate_pages+0x960/0xb70
May 26 11:48:36 db2189 kernel:  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
May 26 11:48:36 db2189 kernel:  migrate_misplaced_folio+0xda/0x290
May 26 11:48:36 db2189 kernel:  __handle_mm_fault+0xcfb/0xf70
May 26 11:48:36 db2189 kernel:  handle_mm_fault+0xe2/0x2c0
May 26 11:48:36 db2189 kernel:  do_user_addr_fault+0x217/0x620
May 26 11:48:36 db2189 kernel:  exc_page_fault+0x7e/0x180
May 26 11:48:36 db2189 kernel:  asm_exc_page_fault+0x26/0x30
May 26 11:48:36 db2189 kernel: RIP: 0033:0x55e2dbe40614
May 26 11:48:36 db2189 kernel: Code: 00 00 0f 1f 40 00 55 48 89 d0 4c 89 c2 48 89 e5 41 57 41 56 41 55 49 89 cd 41 54 49 89 f4 53 48 89 fb 48 83 ec 28 48 8b 76 18 <0f> b6 4f fd f6 46 3c 01 0f 84 56 01 00 00 31 f6 45 31 ff 83 e1 07
May 26 11:48:36 db2189 kernel: RSP: 002b:00007f1603989ae0 EFLAGS: 00010202
May 26 11:48:36 db2189 kernel: RAX: 00007f1603989bd0 RBX: 00007f25f1362084 RCX: 0000000000000003
May 26 11:48:36 db2189 kernel: RDX: 0000000000000002 RSI: 00007f15fd6bbd80 RDI: 00007f25f1362084
May 26 11:48:36 db2189 kernel: RBP: 00007f1603989b30 R08: 0000000000000002 R09: 00007f1603989bb0
May 26 11:48:36 db2189 kernel: R10: 00007f141004f536 R11: 000000000000000d R12: 00007f15fd6f0620
May 26 11:48:36 db2189 kernel: R13: 0000000000000003 R14: 0000000000000043 R15: 000000000000004d
May 26 11:48:36 db2189 kernel:  </TASK>
May 26 11:48:36 db2189 kernel: watchdog: BUG: soft lockup - CPU#21 stuck for 23s! [mysqld:13037]
May 26 11:48:36 db2189 kernel: Modules linked in: tcp_diag inet_diag binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac skx_edac_common nfit libnvdimm dell_pc x86_pkg_temp_thermal platform_profile intel_powerclamp xfs coretemp crct10dif_pclmul gha>
May 26 11:48:36 db2189 kernel:  configfs nfnetlink ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic cdc_ether usbnet mii dm_mod sd_mod ahci libahci xhci_pci iTCO_wdt intel_pmc_bxt libata xhci_hcd megaraid_sas iTCO_vendor_support tg3 watchdog crc32_pclmul bnxt_en scsi_mod usbcore libphy crc32c_inte>
May 26 11:48:36 db2189 kernel: CPU: 21 UID: 498 PID: 13037 Comm: mysqld Tainted: G             L     6.12.88+deb13-amd64 #1  Debian 6.12.88-1
May 26 11:48:36 db2189 kernel: Tainted: [L]=SOFTLOCKUP
May 26 11:48:36 db2189 kernel: Hardware name: Dell Inc. PowerEdge R650xs/05FK0J, BIOS 1.10.2 03/03/2023
May 26 11:48:36 db2189 kernel: RIP: 0010:smp_call_function_many_cond+0x345/0x4c0
May 26 11:48:36 db2189 kernel: Code: e8 20 2e 4d 00 3b 05 9a 15 ca 01 0f 83 e1 fd ff ff 48 63 d0 49 8b 34 24 48 03 34 d5 20 6e 5f 9b 8b 56 08 83 e2 01 74 0a f3 90 <8b> 4e 08 83 e1 01 75 f6 83 c0 01 eb bb 65 8b 05 b3 fb e6 65 48 0f
May 26 11:48:36 db2189 kernel: RSP: 0000:ff45268d6abc7ae0 EFLAGS: 00000202
May 26 11:48:36 db2189 kernel: RAX: 0000000000000007 RBX: 0000000000000202 RCX: 0000000000000001
May 26 11:48:36 db2189 kernel: RDX: 0000000000000001 RSI: ff21ee203f1bd8a0 RDI: 0000000000000007
May 26 11:48:36 db2189 kernel: RBP: ff21ee203f535180 R08: ff21ede146282590 R09: 0000000000000000
May 26 11:48:36 db2189 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff21ee203f537180
May 26 11:48:36 db2189 kernel: R13: ff21ede146282f58 R14: 0000000000000015 R15: 0000000000000020
May 26 11:48:36 db2189 kernel: FS:  00007f1387fff6c0(0000) GS:ff21ee203f500000(0000) knlGS:0000000000000000
May 26 11:48:36 db2189 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 11:48:36 db2189 kernel: CR2: 00007f768d5d6300 CR3: 000000408d230003 CR4: 0000000000771ef0
May 26 11:48:36 db2189 kernel: PKRU: 55555554
May 26 11:48:36 db2189 kernel: Call Trace:
May 26 11:48:36 db2189 kernel:  <TASK>
May 26 11:48:36 db2189 kernel:  ? __pfx_flush_tlb_func+0x10/0x10
May 26 11:48:36 db2189 kernel:  on_each_cpu_cond_mask+0x24/0x40
May 26 11:48:36 db2189 kernel:  arch_tlbbatch_flush+0xe7/0x100
May 26 11:48:36 db2189 kernel:  try_to_unmap_flush+0x2d/0x40
May 26 11:48:36 db2189 kernel:  migrate_pages_batch+0x741/0xa50
May 26 11:48:36 db2189 kernel:  migrate_pages+0x960/0xb70
May 26 11:48:36 db2189 kernel:  ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
May 26 11:48:36 db2189 kernel:  migrate_misplaced_folio+0xda/0x290
May 26 11:48:36 db2189 kernel:  __handle_mm_fault+0xcfb/0xf70
May 26 11:48:36 db2189 kernel:  handle_mm_fault+0xe2/0x2c0
May 26 11:48:36 db2189 kernel:  do_user_addr_fault+0x217/0x620
May 26 11:48:36 db2189 kernel:  exc_page_fault+0x7e/0x180
May 26 11:48:36 db2189 kernel:  asm_exc_page_fault+0x26/0x30
May 26 11:48:36 db2189 kernel: RIP: 0033:0x55e2dbd0ccb8
May 26 11:48:36 db2189 kernel: Code: 5d 01 8b 10 48 c1 e2 05 48 01 fa 48 39 d7 73 55 66 0f 6f 05 1a 72 4c 00 66 0f ef c9 66 0f 1f 44 00 00 48 8b 07 48 85 c0 74 2e <49> 01 07 48 8b 47 08 49 01 47 08 48 8b 47 10 49 3b 47 10 0f 82 4f
May 26 11:48:36 db2189 kernel: RSP: 002b:00007f1387ffec80 EFLAGS: 00010202
May 26 11:48:36 db2189 kernel: RAX: 0000000000000176 RBX: 00007f768d790040 RCX: 0000000000000000
May 26 11:48:36 db2189 kernel: RDX: 00007f768fd255c0 RSI: 00007f768d790040 RDI: 00007f768fd25100
May 26 11:48:36 db2189 kernel: RBP: 00007f1387ffecc0 R08: 0013bb7d9236ba0f R09: 0000000000000923
May 26 11:48:36 db2189 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f768ff8aa40
May 26 11:48:36 db2189 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 00007f768d5d6300
May 26 11:48:36 db2189 kernel:  </TASK>
May 26 11:48:36 db2189 kernel: watchdog: BUG: soft lockup - CPU#32 stuck for 78s! [mysqld:12782]
-- Boot 60f2a03b401641fe9a0553be3de9fffb --
May 27 14:33:46 db2189 kernel: Linux version 6.12.88+deb13-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-14 (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #1 SMP PREEMPT_DYNAMIC Debian 6.12.88-1 (2026-05-15)
May 27 14:33:46 db2189 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.12.88+deb13-amd64 root=UUID=72d41885-6bd1-4362-999e-09a99d7a5406 ro console=ttyS1,115200n8 raid0.default_layout=2 elevator=deadline
May 27 14:33:46 db2189 kernel: x86/tme: not enabled by BIOS
Fri, May 29, 3:28 PM
FCeratto-WMF added a comment to P93403 Page 8024 for db2189 phabricator.wikimedia.org/T427376.

Planned reboot on 26:

May 26 11:00:07 db2189 ferm[1727008]: Stopping Firewall: ferm.
May 26 11:00:07 db2189 systemd[1]: ferm.service: Deactivated successfully.
May 26 11:00:07 db2189 systemd[1]: Stopped ferm.service - ferm firewall configuration.
May 26 11:00:07 db2189 systemd[1]: Reached target shutdown.target - System Shutdown.
May 26 11:00:07 db2189 systemd[1]: Reached target final.target - Late Shutdown Services.
May 26 11:00:07 db2189 systemd[1]: systemd-reboot.service: Deactivated successfully.
May 26 11:00:07 db2189 systemd[1]: Finished systemd-reboot.service - System Reboot.
May 26 11:00:07 db2189 systemd[1]: Reached target reboot.target - System Reboot.
May 26 11:00:07 db2189 systemd[1]: Shutting down.
May 26 11:00:07 db2189 systemd[1]: Using hardware watchdog 'iTCO_wdt', version 4, device /dev/watchdog0
May 26 11:00:07 db2189 systemd[1]: Watchdog running with a hardware timeout of 10min.
May 26 11:00:07 db2189 kernel: watchdog: watchdog0: watchdog did not stop!
May 26 11:00:07 db2189 systemd-shutdown[1]: Using hardware watchdog 'iTCO_wdt', version 4, device /dev/watchdog0
May 26 11:00:07 db2189 systemd-shutdown[1]: Watchdog running with a hardware timeout of 10min.
May 26 11:00:07 db2189 systemd-shutdown[1]: Syncing filesystems and block devices.
May 26 11:00:07 db2189 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
May 26 11:00:07 db2189 systemd-journald[2882953]: Received SIGTERM from PID 1 (systemd-shutdow).
May 26 11:00:07 db2189 systemd-journald[2882953]: Journal stopped
-- Boot 9697fec88cae47e9bfe79fa19c35a0ad --
May 26 11:02:38 db2189 kernel: Linux version 6.12.88+deb13-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-14 (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #1 SMP PREEMPT_DYNAMIC Debian 6.12.88-1 (2026-05-15)
May 26 11:02:38 db2189 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.12.88+deb13-amd64 root=UUID=72d41885-6bd1-4362-999e-09a99d7a5406 ro console=ttyS1,115200n8 raid0.default_layout=2 elevator=deadline
May 26 11:02:38 db2189 kernel: x86/tme: not enabled by BIOS
May 26 11:02:38 db2189 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Fri, May 29, 3:27 PM
FCeratto-WMF updated the title for P93403 Page 8024 for db2189 phabricator.wikimedia.org/T427376 from untitled to Page 8024 for db2189 phabricator.wikimedia.org/T427376.
Fri, May 29, 3:24 PM
FCeratto-WMF added a comment to P93403 Page 8024 for db2189 phabricator.wikimedia.org/T427376.

Prometheus metrics disappeared on the 26, hour before the page: https://grafana.wikimedia.org/goto/bfnjbsk11296oc?orgId=1

Fri, May 29, 3:23 PM
FCeratto-WMF added a comment to P93403 Page 8024 for db2189 phabricator.wikimedia.org/T427376.

The next day (28):

Fri, May 29, 3:15 PM
FCeratto-WMF created P93402 (An Untitled Masterwork).
Fri, May 29, 2:54 PM
FCeratto-WMF closed T427388: db2212 failed to reboot as Resolved.
Fri, May 29, 11:44 AM · SRE, ops-codfw, DC-Ops, DBA
FCeratto-WMF created P93399 (An Untitled Masterwork).
Fri, May 29, 10:31 AM
FCeratto-WMF added a comment to T427535: db1224 is unreachable.

I'm seeing the following errors in the logs that look a bit suspicious, specifically the N/A, transition to Non-recoverable ; CPU 2 ;, could it be a hardware issue?

Fri, May 29, 7:22 AM · SRE, DC-Ops, ops-eqiad, DBA

Thu, May 28

FCeratto-WMF added a comment to T427535: db1224 is unreachable.

@VRiley-WMF the host is not responding on ssh and not generating metrics so maybe it did not power up. Please update the firmware and tomorrow I'll try to powercycle it.

Thu, May 28, 6:48 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF updated the task description for T427535: db1224 is unreachable.
Thu, May 28, 4:24 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF triaged T427535: db1224 is unreachable as High priority.
Thu, May 28, 4:21 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF added a comment to T427535: db1224 is unreachable.

Dashboard: https://grafana.wikimedia.org/goto/cfnfwwukbq0hsd?orgId=1

Thu, May 28, 4:21 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF added a comment to T427535: db1224 is unreachable.

getsel:

-------------------------------------------------------------------------------
Record:      57
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      60
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      61
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      62
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      63
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      64
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      65
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      66
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      67
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      68
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      69
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      70
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      71
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      72
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      73
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      74
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      75
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      76
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      77
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      78
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      79
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      80
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      81
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      82
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      83
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      84
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      85
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      86
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      87
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      88
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      89
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      90
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      91
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      92
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      93
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      94
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      95
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      96
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      97
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      98
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Thu, May 28, 4:20 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF created T427535: db1224 is unreachable.
Thu, May 28, 4:19 PM · SRE, DC-Ops, ops-eqiad, DBA
FCeratto-WMF added a comment to T426613: Create a way to delete hosts from zarcillo and orchestrator for decommissioning.

UI form added to https://zarcillo.wikimedia.org/ui/hosts - it should suggest only hosts that have no live instances

Thu, May 28, 3:19 PM · Patch-For-Review, DBA
FCeratto-WMF added a comment to T426613: Create a way to delete hosts from zarcillo and orchestrator for decommissioning.

Example logs from local testbed:

INFO Deleting host db-test1002
DEBUG query 'SELECT instance_name, hostname, dc, port, instance_group, fqdn, section FROM instances_view WHERE hostname = :hn' returned 1
DEBUG query 'SELECT hb.hostname AS hn, hb.section, MAX(lag) AS lag FROM heartbeat_status hb JOIN dns_a dns ON hb.replication_source_ipv4 = dns.ipv4addr GROUP BY section, hn' returned 318
DEBUG query 'SELECT * FROM alertmanager' returned 17
DEBUG query 'SELECT hostname, section, role, MIN(pooled) AS pooled FROM noc_dbs GROUP BY hostname, section' returned 244
DEBUG query 'SELECT * FROM prometheus_kernel_dbs' returned 306
DEBUG query 'SELECT * FROM candidates' returned 34
DEBUG query 'DELETE FROM locks WHERE instance = :instance' returned 1 rows
DEBUG query 'DELETE FROM section_instances WHERE instance = :instance' returned 1 rows
DEBUG query 'DELETE FROM instances WHERE server = :fqdn' returned 1 rows
DEBUG query 'DELETE FROM host_meta WHERE hostname = :hostname' returned 0 rows
DEBUG query 'DELETE FROM puppet_hiera WHERE hostname = :hostname' returned 0 rows
DEBUG query 'DELETE FROM puppet_roles WHERE fqdn = :fqdn' returned 0 rows
INFO Deleted host db-test1002
Thu, May 28, 3:14 PM · Patch-For-Review, DBA
FCeratto-WMF claimed T427388: db2212 failed to reboot.
Thu, May 28, 2:46 PM · SRE, ops-codfw, DC-Ops, DBA
FCeratto-WMF reopened T427388: db2212 failed to reboot as "In Progress".

No error in the logs, replication is catching up.

Thu, May 28, 2:46 PM · SRE, ops-codfw, DC-Ops, DBA
FCeratto-WMF added a comment to T427388: db2212 failed to reboot.

The host was shut down cleanly so I can check and repool it.

Thu, May 28, 2:41 PM · SRE, ops-codfw, DC-Ops, DBA
FCeratto-WMF created P93372 (An Untitled Masterwork).
Thu, May 28, 2:09 PM
FCeratto-WMF updated the task description for T419635: Drop il_to column from imagelinks table in wmf production.
Thu, May 28, 9:25 AM · Data-Engineering-Radar, Schema-change-in-production, DBA, Data-Engineering
FCeratto-WMF claimed T427381: sre.mysql.depool: do not depend on .list_hosts_instances() for depooling.
Thu, May 28, 8:54 AM · DBA
FCeratto-WMF claimed T427378: Use pool/depool cookbook in auto schema.
Thu, May 28, 8:53 AM · DBA
FCeratto-WMF moved T427378: Use pool/depool cookbook in auto schema from Triage to Ready on the DBA board.
Thu, May 28, 8:49 AM · DBA
FCeratto-WMF moved T427377: sre.mysql.pool: remove downtime before pooling from Triage to Ready on the DBA board.
Thu, May 28, 8:49 AM · DBA
FCeratto-WMF updated the task description for T426095: Switchover s8 master (db1209 -> db1193).
Thu, May 28, 6:18 AM · DBA
FCeratto-WMF updated the task description for T426095: Switchover s8 master (db1209 -> db1193).
Thu, May 28, 6:09 AM · DBA
FCeratto-WMF changed the status of T426095: Switchover s8 master (db1209 -> db1193) from Open to In Progress.
Thu, May 28, 6:03 AM · DBA
FCeratto-WMF changed the status of T426095: Switchover s8 master (db1209 -> db1193), a subtask of T419635: Drop il_to column from imagelinks table in wmf production, from Open to In Progress.
Thu, May 28, 6:03 AM · Data-Engineering-Radar, Schema-change-in-production, DBA, Data-Engineering
FCeratto-WMF closed T426590: Switchover s4 master (db2179 -> db2240), a subtask of T419635: Drop il_to column from imagelinks table in wmf production, as Resolved.
Thu, May 28, 6:02 AM · Data-Engineering-Radar, Schema-change-in-production, DBA, Data-Engineering
FCeratto-WMF closed T426590: Switchover s4 master (db2179 -> db2240) as Resolved.
Thu, May 28, 6:02 AM · DBA

Wed, May 27

FCeratto-WMF added a comment to T422361: sre.mysql.pool / depool / parsercache cleanup.

Related to this and T427381: I updated the description with some more detailed steps

Wed, May 27, 4:09 PM · Patch-For-Review, DBA
FCeratto-WMF updated the task description for T422361: sre.mysql.pool / depool / parsercache cleanup.
Wed, May 27, 4:08 PM · Patch-For-Review, DBA
FCeratto-WMF renamed T422361: sre.mysql.pool / depool / parsercache cleanup from Rename sre.mysql.parsercache to something else to sre.mysql.pool / depool / parsercache cleanup.
Wed, May 27, 4:06 PM · Patch-For-Review, DBA
FCeratto-WMF added a comment to T427376: db2189 crashed.

(added a long downtime just in case)

Wed, May 27, 2:39 PM · SRE, ops-codfw, DBA, DC-Ops
FCeratto-WMF added a comment to T427381: sre.mysql.depool: do not depend on .list_hosts_instances() for depooling.

https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1294265 is related. Tested with:

Wed, May 27, 1:18 PM · DBA
FCeratto-WMF added a comment to T427388: db2212 failed to reboot.

There are no events in getsel after 06/13/2025 14:24:15

Wed, May 27, 1:00 PM · SRE, ops-codfw, DC-Ops, DBA
FCeratto-WMF added projects to T427388: db2212 failed to reboot: DC-Ops, ops-codfw.
Wed, May 27, 12:57 PM · SRE, ops-codfw, DC-Ops, DBA
FCeratto-WMF added a comment to T427381: sre.mysql.depool: do not depend on .list_hosts_instances() for depooling.

Yes, the initial fix should be ready for CR shortly

Wed, May 27, 11:43 AM · DBA
FCeratto-WMF added a comment to T427381: sre.mysql.depool: do not depend on .list_hosts_instances() for depooling.

(we discussed taking the opportunity to make the pool/depool codebase simpler and more procedural and splitting the pool/depool parts in the two cookbooks)

Wed, May 27, 11:40 AM · DBA
FCeratto-WMF created T427381: sre.mysql.depool: do not depend on .list_hosts_instances() for depooling.
Wed, May 27, 11:35 AM · DBA
FCeratto-WMF created T427378: Use pool/depool cookbook in auto schema.
Wed, May 27, 11:25 AM · DBA
FCeratto-WMF created T427377: sre.mysql.pool: remove downtime before pooling.
Wed, May 27, 11:21 AM · DBA
FCeratto-WMF created P93228 (An Untitled Masterwork).
Wed, May 27, 11:06 AM

Tue, May 26

FCeratto-WMF added a comment to T426199: codfw: rack A2 maintenance.

db2196, db2221 and db2222 have silences removed and are fully pooled-in

Tue, May 26, 3:12 PM · ServiceOps-Upgrades-Hardware, ServiceOps new, Infrastructure-Foundations, netops
FCeratto-WMF added a comment to T426199: codfw: rack A2 maintenance.

es2042 and es2041 in section es4 have been switched: es2041 is now a replica and can be depooled

Tue, May 26, 11:22 AM · ServiceOps-Upgrades-Hardware, ServiceOps new, Infrastructure-Foundations, netops
FCeratto-WMF updated the task description for T419635: Drop il_to column from imagelinks table in wmf production.
Tue, May 26, 8:21 AM · Data-Engineering-Radar, Schema-change-in-production, Data-Engineering, DBA
FCeratto-WMF closed T425622: Switchover s2 master (db1222 -> db1162), a subtask of T419635: Drop il_to column from imagelinks table in wmf production, as Resolved.
Tue, May 26, 6:37 AM · Data-Engineering-Radar, Schema-change-in-production, Data-Engineering, DBA
FCeratto-WMF closed T425622: Switchover s2 master (db1222 -> db1162) as Resolved.
Tue, May 26, 6:37 AM · DBA
FCeratto-WMF closed T425622: Switchover s2 master (db1222 -> db1162), a subtask of T424615: Migrate s2 section to Debian Trixie, as Resolved.
Tue, May 26, 6:37 AM · DBA
FCeratto-WMF updated the task description for T425622: Switchover s2 master (db1222 -> db1162).
Tue, May 26, 6:23 AM · DBA