Page MenuHomePhabricator

rdbms errors in eqiad
Open, Needs TriagePublic

Assigned To
None
Authored By
jijiki
Apr 7 2026, 1:07 PM
Referenced Files
F75239449: image.png
Apr 7 2026, 1:22 PM
F75239473: image.png
Apr 7 2026, 1:22 PM
F75238565: image.png
Apr 7 2026, 1:07 PM
F75238517: image.png
Apr 7 2026, 1:07 PM

Description

While investigating some other issues

I noticed that MediaWiki has been logging erros in the rdbms channel in eqiad, but but very few in codfw, even before the T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad)

mediawiki channel errors

image.png (1×3 px, 504 KB)

image.png (1×3 px, 490 KB)

While the error rate remains within acceptable limits (?), we should discuss if we should investigate further

Event Timeline

jijiki renamed this task from rdbms erros in eqiad to rdbms errors in eqiad.Apr 7 2026, 1:13 PM
jijiki updated the task description. (Show Details)

I can take a look from the DB side of things, but remember that being active-active we make no changes at all on the DB layer during switchovers (apart from circular replication, which only affects master for a few days).

If you look at the db errors dashboard, the top db with issues is db1224 (note that it's still only 4% of all errors) which is the vslow host of x1(!) so it has lower weight. The errors has started on March 14 which doesn't correspond with the dc switchover. Worth looking into https://logstash.wikimedia.org/goto/9fd372f130cc640f184a749639dcb8d7

The part that worries me more is that it's doing x1 reads on code paths that it shouldn't (normal page view) https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2026.04.08?id=k1bybJ0BDahWTbnKttaO

It looks like a different issue but all x1 hosts seems to be having this problem stemming from deferred updates. Started at around March 14.