Page MenuHomePhabricator

Centralauth last deployment creating database contention on CentralAuthUser::saveSettings (Lock wait timeout exceeded; try restarting transaction)
Closed, ResolvedPublic

Description

As I commented here initially: T119736#2445587

Since 2016-07-05 at 23h UTC (probably, a software deploy) there is high rate of Lock wait timeout exceeded; try restarting transaction (10.64.16.30). This host is centralauth master and the queries are CentralAuthUser::saveSettings:

https://logstash.wikimedia.org/#dashboard/temp/AVXVMNhAw3dCNxx2bE7U

Event Timeline

jcrespo created this task.Jul 11 2016, 5:19 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 11 2016, 5:19 PM

These continue being the 50% of the database errors currently (when zhwiki users go to sleep T140108).

<_joe_> jynus: I guess you should notify that to @Tgr and @Anomie at least

It looks like the majority of the logged URLs seem to be from Notifications, along the lines of /w/api.php?centralauthtoken=[varies]&format=json&notwikis=enwiki&action=query&meta=notifications&notprop=count%7Clist&notgroupbysection=1&notunreadfirst=1. Very likely the reason this started is their "fire off a lot of parallel API requests on $user->saveSettings()" that they turned on recently that has caused several other bugs.

But there's no reason that these requests should be having to save the CentralAuthUser. I suspect the reason it's happening is because CentralAuthUser doesn't set gu_auth_token on newly-added users, so the next request for the user is having to update the database to save it. So let's try fixing that first.

Change 298963 had a related patch set uploaded (by Anomie):
Set gu_auth_token when adding new users

https://gerrit.wikimedia.org/r/298963

Change 298963 merged by jenkins-bot:
Set gu_auth_token when adding new users

https://gerrit.wikimedia.org/r/298963

elukey added a subscriber: elukey.Jul 15 2016, 2:13 PM

Tons of 503s today for (what appears to be) the same issue:

https://logstash.wikimedia.org/#/dashboard/temp/AVXu3JKcT4MudYQNSuOT (credits to Jaime)

Outage last from ~13:24 to ~13:33 UTC

jcrespo closed this task as Resolved.Jul 20 2016, 3:36 PM
jcrespo claimed this task.

According to kibana, I do not see this error anymore, having disappeared at 2016-07-19 15:00 UTC:

https://logstash.wikimedia.org/goto/52ba432ceedd3803801fde6807e16737

I suppose some kind of deployment happened at that time, so this is resolved for me.

Probably https://gerrit.wikimedia.org/r/#/c/299704/, which turns off the problematic Echo feature now that the transition is completed.

Seems very similar, but not exactly a duplicate. This one would happen when the lock is held so long that other connections time out, while that would happen when the lock isn't held long enough to time out.

The reduced contention now that Echo isn't making lots of simultaneous requests should have reduced the incidence of that one too, and the patch to void having CentralAuthUser->getAuthToken() need to call CentralAuthUser->resetAuthToken() in the first place will benefit that too.

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptDec 15 2016, 4:10 PM