Page MenuHomePhabricator

Centralauth last deployment creating database contention on CentralAuthUser::saveSettings (Lock wait timeout exceeded; try restarting transaction)
Closed, ResolvedPublicPRODUCTION ERROR


As I commented here initially: T119736#2445587

Since 2016-07-05 at 23h UTC (probably, a software deploy) there is high rate of Lock wait timeout exceeded; try restarting transaction ( This host is centralauth master and the queries are CentralAuthUser::saveSettings:

Event Timeline

These continue being the 50% of the database errors currently (when zhwiki users go to sleep T140108).

<_joe_> jynus: I guess you should notify that to @Tgr and @Anomie at least

It looks like the majority of the logged URLs seem to be from #Echo, along the lines of /w/api.php?centralauthtoken=[varies]&format=json&notwikis=enwiki&action=query&meta=notifications&notprop=count%7Clist&notgroupbysection=1&notunreadfirst=1. Very likely the reason this started is their "fire off a lot of parallel API requests on $user->saveSettings()" that they turned on recently that has caused several other bugs.

But there's no reason that these requests should be having to save the CentralAuthUser. I suspect the reason it's happening is because CentralAuthUser doesn't set gu_auth_token on newly-added users, so the next request for the user is having to update the database to save it. So let's try fixing that first.

Change 298963 had a related patch set uploaded (by Anomie):
Set gu_auth_token when adding new users

Change 298963 merged by jenkins-bot:
Set gu_auth_token when adding new users

Tons of 503s today for (what appears to be) the same issue: (credits to Jaime)

Outage last from ~13:24 to ~13:33 UTC

jcrespo claimed this task.

According to kibana, I do not see this error anymore, having disappeared at 2016-07-19 15:00 UTC:

I suppose some kind of deployment happened at that time, so this is resolved for me.

Probably, which turns off the problematic Echo feature now that the transition is completed.

Seems very similar, but not exactly a duplicate. This one would happen when the lock is held so long that other connections time out, while that would happen when the lock isn't held long enough to time out.

The reduced contention now that Echo isn't making lots of simultaneous requests should have reduced the incidence of that one too, and the patch to void having CentralAuthUser->getAuthToken() need to call CentralAuthUser->resetAuthToken() in the first place will benefit that too.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM