Page MenuHomePhabricator

db1169 is lagged over 16000 seconds
Closed, ResolvedPublic

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-03-01T18:12:22Z] <taavi@cumin1002> dbctl commit (dc=all): 'depool db1169 T358892', diff saved to https://phabricator.wikimedia.org/P58287 and previous config saved to /var/cache/conftool/dbconfig/20240301-181221-taavi.json

I will review what happened when I get home as that schema change is being done with the script, it should've depooled it.

Funnily enough I don't see it being repooled in the ticket

I can't find it in the logs but the usual reason is probably because db1169 was depooled when the script got started and since the config gets loaded at start of the script, it assumed it's one of depooled hosts (backup sources, etc.) and just went ahead with the schema change.

I do see it being depooled by the script: T354015#9587476

That's from yesterday :)

Marostegui added a subscriber: taavi.

I can't find it in the logs but the usual reason is probably because db1169 was depooled when the script got started and since the config gets loaded at start of the script, it assumed it's one of depooled hosts (backup sources, etc.) and just went ahead with the schema change.

This is the issue - I just checked all SAL entries.
This host was depooled last night, then this happened

  • Start of the schema change (and hence the scripts reads the config): 06:36 marostegui@cumin1002: START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance

^ At that time, even if it is a different host, the config is read and db1169 is still depooled from last night.

  • 06:36 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P58275 and previous config saved to /var/cache/conftool/dbconfig/20240301-063647-root.json

^ I manually start repooling the host from last night.

That was the race condition. Problem solved.

I am going to start repooling db1169 now given that it is back in sync with the master and the schema change is finished.

Thanks @JJMC89 for the heads up about this and @taavi for depooling it!
Closing this as fixed as the host will repool itself over the next couple of hours.