Page MenuHomePhabricator

CAS update failed on user_touched (triggered from ApiLogin on Wikitech)
Closed, ResolvedPublic

Description

Labslogbot (user id 55 on wikitech) does some api.php queries that trigger CAS update failure from LdapAuthentication plugin. That happens on an hourly basis since Nov 8th 17:00 UTC.

Stack trace

{
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/extensions/LdapAuthentication/LdapAuthenticationPlugin.php",
    "line": 1172,
    "function": "saveSettings",
    "class": "User",
    "type": "->",
    "args": []
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/extensions/LdapAuthentication/LdapPrimaryAuthenticationProvider.php",
    "line": 142,
    "function": "updateUser",
    "class": "LdapAuthenticationPlugin",
    "type": "->",
    "args": ["User"]
}, {
    "function": "onUserLoggedIn",
    "class": "LdapPrimaryAuthenticationProvider",
    "type": "->",
    "args": ["User"]
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/Hooks.php",
    "line": 195,
    "function": "call_user_func_array",
    "args": ["array", "array"]
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/auth/AuthManager.php",
    "line": 2387,
    "function": "run",
    "class": "Hooks",
    "type": "::",
    "args": ["string", "array"]
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/auth/AuthManager.php",
    "line": 689,
    "function": "setSessionDataForUser",
    "class": "MediaWiki\\Auth\\AuthManager",
    "type": "->",
    "args": ["User", "boolean"]
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/auth/AuthManager.php",
    "line": 381,
    "function": "continueAuthentication",
    "class": "MediaWiki\\Auth\\AuthManager",
    "type": "->",
    "args": ["array"]
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/api/ApiLogin.php",
    "line": 145,
    "function": "beginAuthentication",
    "class": "MediaWiki\\Auth\\AuthManager",
    "type": "->",
    "args": ["array", "string"]
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/api/ApiMain.php",
    "line": 1435,
    "function": "execute",
    "class": "ApiLogin",
    "type": "->",
    "args": []
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/api/ApiMain.php",
    "line": 510,
    "function": "executeAction",
    "class": "ApiMain",
    "type": "->",
    "args": []
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/includes/api/ApiMain.php",
    "line": 481,
    "function": "executeActionWithErrorHandling",
    "class": "ApiMain",
    "type": "->",
    "args": []
}, {
    "file": "/srv/mediawiki/php-1.29.0-wmf.2/api.php",
    "line": 83,
    "function": "execute",
    "class": "ApiMain",
    "type": "->",
    "args": []
}, {
    "file": "/srv/mediawiki/w/api.php",
    "line": 3,
    "args": ["string"],
    "function": "require"
}

Event Timeline

mmodell created this task.Nov 9 2016, 9:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2016, 9:07 PM
mmodell triaged this task as Unbreak Now! priority.Nov 9 2016, 9:07 PM
Restricted Application added subscribers: Jay8g, Luke081515, TerraCodes. · View Herald TranscriptNov 9 2016, 9:07 PM

@cicalese: why LDAP-Authorization? The stack trace ends with LdapAuthentication/LdapAuthenticationPlugin.php

This task was initially tagged with LDAPAuthorization, which is an extension I manage. But, the stack trace is for LdapAuthentication. So, I removed the LDAPAuthorization tag and added the LdapAuthentication tag.

I'm sorry, my mistake!

elukey added a subscriber: elukey.Nov 10 2016, 11:39 AM
hashar added a subscriber: hashar.Nov 10 2016, 11:42 AM

Same as T95839 ?

That causes our fatal/exception monitoring probe to alarm every hour since Nov 8th 17:00 UTC. Apparently something hits the wikitech api.php on the hour which trigger the authentication issue.

hashar updated the task description. (Show Details)Nov 10 2016, 11:48 AM
Andrew added a subscriber: Andrew.Nov 10 2016, 3:02 PM

I can reproduce this issue pretty easily, and it's driving me a bit crazy; let me know if I can do anything to help.

@Andrew: can you point me to steps to reproduce?

mmodell added a comment.EditedNov 10 2016, 6:07 PM

I had a hypothesis that this was caused by the LDAP record being corrupted (saved with a future timestamp) however I've confirmed that isn't the case. Also the code for LdapAuthentication extension hasn't changed since July.

I'm rather lost about what could be causing this.

@mmodell: What I do (which is a bit ugly) is:

  1. Create 10 VMs in the 'testlabs' project at the same time
  2. Note that at least two of them show up as redlinks in wikitech
  3. Look in /var/log/nova/nova-compute.log on the host of the redlink vm for the failure message

I suspect that that I can write a standalone script that will cause the problem if that's necessary.

That's strange, I can't find anything in mediawiki which has changed recently (at least not within the codepath indicated by the stack trace)

It may have been happening for ages -- I assumed that it was my crappy code producing inconsistent behavior, but now that I've rewritten and reviewed my code I have renewed focus on the MW end :)

I'm also seeing this:

refreshLinksPrioritized Nova_Resource:Integration-slave-precise-1012.integration.eqiad.wmflabs rootJobTimestamp=20161110200232 useRecursiveLinksUpdate= triggeringUser={"userId":55,"userName":"Labslogbot"} triggeringRevisionId=966531 requestId=d09616729375a804f18550d5 (id=22453341,timestamp=20161110200232) t=65 error=DBUnexpectedError: LinksUpdate::acquirePageLock: Cannot flush pre-lock snapshot because writes are pending (SMWStore::updateData, SMW::writePropertyTableRowUpdates-delete-smw_di_number, SMW::writePropertyTableRowUpdates-delete-smw_di_blob, SMW::writePropertyTableRowUpdates-delete-smw_di_time, SMW::writePropertyTableRowUpdates-delete-smw_di_wikipage, SMWSql3SmwIds::setPropertyTableHashes, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount, SMWSQLStore3Writers::addToPropertyUsageCount).

My summary is:

  • The issue is caused by an hourly cron that uses the account Labslogbot
  • The first occurence is Nov 8th 17:00 UTC

I suspect there are multiple threads refreshing the instances on Wikitech and there is a race condition when login which apparently cause the MW account to be refreshed from LDAP.

What would be nice to know is:

  • What/where is that hourly cron
  • What is the code?
  • What has changed on Nov 8th

Brain dump:

  • how one can have parallel threads hitting the API on a wiki having LDAP auth. Not sure why LdapAuthentication ALWAYS touch the user.
  • maybe oauth does not have the issue :]

To clarify. I am not working on this task.

Matanya lowered the priority of this task from Unbreak Now! to High.Nov 10 2016, 10:53 PM
Matanya added a subscriber: Matanya.

lowering to high, as it doesn't affect prod and not blocking release, should be fixed though, to avoid log spam.

Change 321677 had a related patch set uploaded (by Andrew Bogott):
wikistatus: Replace the max_retries throttle for wiki login

https://gerrit.wikimedia.org/r/321677

Change 321677 merged by Andrew Bogott:
wikistatus: Replace the max_retries throttle for wiki login

https://gerrit.wikimedia.org/r/321677

Change 322204 had a related patch set uploaded (by Reedy):
Try and prevent CAS errors in LDAP code

https://gerrit.wikimedia.org/r/322204

Change 322207 had a related patch set uploaded (by Andrew Bogott):
Wikitech: Increase login throttle limits x4

https://gerrit.wikimedia.org/r/322207

Change 322207 abandoned by Andrew Bogott:
Wikitech: Increase login throttle limits x4

Reason:
After fixes elsewhere it looks like we can live without this.

https://gerrit.wikimedia.org/r/322207

hashar removed a subscriber: hashar.Nov 22 2016, 5:00 PM

Change 322204 abandoned by Reedy:
Try and prevent CAS errors in LDAP code

https://gerrit.wikimedia.org/r/322204

Krinkle updated the task description. (Show Details)Apr 10 2018, 3:24 PM
Krinkle removed a project: Patch-For-Review.
Krinkle closed this task as Resolved.Apr 10 2018, 3:28 PM
Krinkle added a subscriber: Krinkle.

No instances of errors relating to "user_touched" in Logstash that also relate to LdapAuthentication or Wikitech, in the last 30 days.

Krinkle renamed this task from CAS update failed on user_touched for user ID '55' (read from replica); the version of the user to be saved is older than the current version. to CAS update failed on user_touched (triggered from ApiLogin on Wikitech).Apr 10 2018, 3:29 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM