Page MenuHomePhabricator

Don't show "Nonce already used" error on memcache failure
Closed, ResolvedPublic

Description

MWOAuthDataStore::lookup_nonce() calls memcached and does not differentiate between a successful lookup (which, for the OAuth protocol, means a nonce reuse error) and a memcached error. This means that the client receives an mwoauth-invalid-authorization error ("The authorization headers in your request are not valid: Nonce already used: XXX") even if the actual error is some memcached server dropping connections. This is very confusing for the client and should be improved. Unfortunately this behavior comes from the external OAuth library so that might not be easy (but maybe we can just throw an exception?), but at a minimum we should differentiate in our internal logging and set up an alert via T100735.

Event Timeline

Sitic raised the priority of this task from to Needs Triage.
Sitic updated the task description. (Show Details)
Sitic subscribed.

One possibility is double-posting of some kind. The other is a memcached error; the extension uses MWOAuthDataStore::lookup_nonce() to store nonces for 5 mins in memcached and check uniqueness, and it returns false both when the nonce is already used and when a memcached operation fails. Given that accepting a nonce without a duplication check opens us to replay attacks, I don't think much can be done about that (although we could improve logging to differentiate between the two cases).

I'm pretty sure that it isn't double-posting, the first instance of this error I can find is from 15 Jul 2015 20:19:29 -0700 (PDT), everything worked fine in the last weeks.
As a test 100 API requests resulted in 5 errors: (the nonce is definitely used for fist time here by me, there were no other API requests and the last 10 digits are the unix time)

import requests
from requests_oauthlib import OAuth1

auth1 = OAuth1(consumer_token['key'],
               client_secret=consumer_token['secret'],
               resource_owner_key=access_token['key'],
               resource_owner_secret=access_token['secret'])
for i in range(100):
    response = requests.get(
        "https://meta.wikimedia.org/w/api.php",
        params={
            'action': "query",
            'meta': "userinfo",
            'format': "json"  
        },
        auth=auth1
    )
    if 'error' in response.json():
        print response.json()
    time.sleep(3)     


{u'servedby': u'mw1128', u'error': {u'info': u'The authorization headers in your request are not valid: Nonce already used: 8100470808494508661437080839', u'*': u'See https://meta.wikimedia.org/w/api.php for API usage', u'code': u'mwoauth-invalid-authorization'}}
{u'servedby': u'mw1128', u'error': {u'info': u'The authorization headers in your request are not valid: Nonce already used: 91229791883386003641437080945', u'*': u'See https://meta.wikimedia.org/w/api.php for API usage', u'code': u'mwoauth-invalid-authorization'}}
{u'servedby': u'mw1128', u'error': {u'info': u'The authorization headers in your request are not valid: Nonce already used: 19195888973674011721437080983', u'*': u'See https://meta.wikimedia.org/w/api.php for API usage', u'code': u'mwoauth-invalid-authorization'}}
{u'servedby': u'mw1139', u'error': {u'info': u'The authorization headers in your request are not valid: Nonce already used: 141774547467664134131437081014', u'*': u'See https://meta.wikimedia.org/w/api.php for API usage', u'code': u'mwoauth-invalid-authorization'}}
{u'servedby': u'mw1128', u'error': {u'info': u'The authorization headers in your request are not valid: Nonce already used: 85106567837463406581437081030', u'*': u'See https://meta.wikimedia.org/w/api.php for API usage', u'code': u'mwoauth-invalid-authorization'}}

I've also just seen mw1134 as servedby, unfortunately I don't have the servedby information in my logfile for the previous errors.

Seems to have been fixed

[23:26:17] <ori> !log bounced nutcracker on mw1128 and mw1134
[23:26:21] <hashar> mw1139 seems to have the same issue :-(
[23:26:22] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:27:08] <grrrit-wm> (PS5) Lokal Profil: Add DCAT-AP for Wikibase [puppet] - https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087)
[23:27:11] <ori> !log bounced nutcracker on mw1139 as well. hashar noticed flood of errors from these hosts on https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors . lack of monitoring / alerts is troubling.

Tgr renamed this task from Some OAuth API requests return "Nonce already used" errors to Don't show "Nonce already used" error on memcache failure.Jul 16 2015, 10:06 PM
Tgr updated the task description. (Show Details)
Tgr set Security to None.

Seems to have been fixed

[23:26:17] <ori> !log bounced nutcracker on mw1128 and mw1134
[23:26:21] <hashar> mw1139 seems to have the same issue :-(
[23:26:22] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:27:08] <grrrit-wm> (PS5) Lokal Profil: Add DCAT-AP for Wikibase [puppet] - https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087)
[23:27:11] <ori> !log bounced nutcracker on mw1139 as well. hashar noticed flood of errors from these hosts on https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors . lack of monitoring / alerts is troubling.

@ori any idea if nutcracker failure was related to T105131: intermittent nutcracker failures ?

Tgr triaged this task as Low priority.Mar 7 2017, 4:10 AM

Regarding Memcached sometimes failing to store (e.g. due to being expired within one second of storing, or due to switching a shard to a new host, or due to resharding, or due to a host briefly not responding),

That's a fundamental contract between MW and Memcached that I don't think we will depart from anytime soon. Non-cache use cases do not belong in Memcached. The use case of T106066 is what session store, main stash, db-replicated, or dedicated tables are for.

Krinkle raised the priority of this task from Low to Medium.Mar 4 2021, 8:14 AM
Krinkle claimed this task.
Krinkle added a project: Performance-Team.

Change since:

  • mcrouter introduced, which handles failures differently from nutcracker.
  • gutter pool introduced, which adds fallback capacity to be able to keep storing things temporarily during a memc shard failover (T244852).
  • oauth nonces moved from nutcracker/Redis to mcrouter (T313578).