Page MenuHomePhabricator

Frequent "Nonce already used" errors in scripts and tools
Closed, ResolvedPublic

Description

Two of my bot tasks, NoLicense and inrbot started generating frequent authentication errors like the following:

2021-01-18 12:36:40 pywiki ERROR: Retrying failed OAuth authentication for commons:commons: The authorization headers in your request are not valid: Nonce already used:

The first was recorded at 2020-12-10 17:42:12 UTC. The problem went away between Dec. 16 and Jan 15, except for one occurrence on Jan 6. Since then, I have been getting errors every few hours.inrbot runs constantly, while NoLicense is started fresh every hour. The INaturalistReviewBot account is used solely by the inrbot task, while NoLicense runs as AntiCompositeBot with several other tasks. One of the other AntiCompositeBot tasks has had this error once. All errors have been on commons.wikimedia.org, but that is likely just selection bias as my enwiki tasks make significantly fewer edits. Both use OAuth to login using pywikibot.

Timestamps of errors
2020-12-10 17:42:12
2020-12-14 20:18:52
2020-12-16 11:08:20
2021-01-06 17:06:19
2021-01-14 17:20:16
2021-01-14 19:15:29
2021-01-14 21:02:00
2021-01-14 23:45:44
2021-01-15 07:12:43
2021-01-15 07:25:53
2021-01-15 08:05:07
2021-01-15 09:05:16
2021-01-15 18:05:19
2021-01-15 19:05:17
2021-01-15 23:14:11
2021-01-16 01:41:47
2021-01-16 10:11:54
2021-01-16 10:13:46
2021-01-16 15:05:52
2021-01-16 15:06:48
2021-01-16 16:05:32
2021-01-16 17:36:13
2021-01-16 18:21:22
2021-01-16 21:05:53
2021-01-17 07:05:23
2021-01-17 10:11:27
2021-01-17 10:11:34
2021-01-17 11:06:24
2021-01-17 14:05:25
2021-01-17 17:05:16
2021-01-17 20:29:02
2021-01-17 20:55:04
2021-01-17 21:06:07
2021-01-17 21:26:31
2021-01-17 21:26:34
2021-01-17 21:42:30
2021-01-17 21:42:32
2021-01-17 21:52:30
2021-01-18 02:07:41
2021-01-18 02:42:00
2021-01-18 09:51:32
2021-01-18 12:05:20
2021-01-18 12:36:40
2021-01-18 20:05:25
2021-01-18 21:18:48
2021-01-18 22:05:17

Event Timeline

JJMC89 edited subscribers, added: JJMC89; removed: pywikibot-bugs-list.

This looks like T106066: Don't show "Nonce already used" error on memcache failure, not an issue with Pywikibot. The API returns code mwoauth-invalid-authorization with Nonce already used in the info.

My bots have seen it on the below dates.

  • 2021-01-05
  • 2021-01-11
  • 2021-01-14
  • 2021-01-15
  • 2021-01-16
  • 2021-01-17
  • 2021-01-18

Both use OAuth to login using pywikibot.

Presumably OAuth 2?

OAuth 2 does not use nonces, AFAIK.

Pywikibot should probably log the X-Request-Id header in the API response and include it into error reports.

Just eyeballing the error logs, there are some big spikes of memcached timeout errors, so probably related to that.

The proper fix would be replacing memcached with some more reliable storage for OAuth nonces. Kask, maybe? Although these aren't really sessions so not sure if it would violate some design assumption.

Pywikibot should probably log the X-Request-Id header in the API response and include it into error reports.

Here are four from my logs.

  • YApm4wpAEMMAAJbQSVAAAAEL
  • YApm5wpAAE0AABFHwu4AAACW
  • YApmLQpAEJ8AAGJavwQAAABB
  • YApmMgpAMN4AAz-TWXMAAABT

Seen 1897 errors and counting on both of bots I maintain, on both frwiki and wikidata. Looks like first seen at Nov 3, 2020 22:50 CET.
But it really started multiple time an hour starting Jan 15, 2021 01:02 CET (mediawiki deployment?).

Here are four from my logs.

  • YApm4wpAEMMAAJbQSVAAAAEL
  • YApm5wpAAE0AABFHwu4AAACW
  • YApmLQpAEJ8AAGJavwQAAABB
  • YApmMgpAMN4AAz-TWXMAAABT

No memcached error in any of those requests, just the OAuth one. Are you sure Pywikibot is not repeating the request for some reason?

But it really started multiple time an hour starting Jan 15, 2021 01:02 CET (mediawiki deployment?).

No deployment around that time.

No memcached error in any of those requests, just the OAuth one. Are you sure Pywikibot is not repeating the request for some reason?

Pywikibot retries after getting the nonce already used response.

I hacked some more logging into Pywikibot before and after where it sends the request. From that, I don't see duplicate requests or reused nonces.

diff --git a/pywikibot/comms/http.py b/pywikibot/comms/http.py
index a96c6b82c..b1335a311 100644
--- a/pywikibot/comms/http.py
+++ b/pywikibot/comms/http.py
@@ -423,9 +423,12 @@ def fetch(uri: str, method: str = 'GET', headers: Optional[dict] = None,
         # Note that the connections are pooled which mean that a future
         # HTTPS request can succeed even if the certificate is invalid and
         # verify=True, when a request with verify=False happened before
+        pywikibot.log('Sending request: method={}, uri={}, headers={}, '
+                      'kwargs={}'.format(method, uri, headers, kwargs))
         response = session.request(method, uri,
                                    headers=headers, auth=auth, timeout=timeout,
                                    **kwargs)
+        pywikibot.log('Sent request: method={r.method}, url={r.url}, headers={r.headers}, body={r.body}'.format(r=response.request))
     except Exception as e:
         response = e
     else:

The issue has been occurring with other bot libraries too, not just pywikibot. In a few cases, the issues recurs even after waiting for 5 seconds and retrying the API request.

Samwalton9-WMF renamed this task from Frequent Nonce already used errors from Pywikibot scripts to Frequent "Nonce already used" errors in scripts and tools.Jan 25 2021, 11:49 AM
This comment was removed by Samwalton9-WMF.

This issue there is even in wikimedia’s oauthclient-php.
When user trying to get token. But not in each request, time to time.

$client = new Client( $conf );
$accessToken = new Token( $_SESSION['tokenKey'], $_SESSION['tokenSecret'] );
$params = ['action'=>'query','meta'=>'tokens','format'=>'json']; 
$userInfo = $client->makeOAuthCall(
     $accessToken, $apiURL, TRUE, $params
); // <<<——- nonce already used

This issue also happens every now and then in a webtool (https://replacer.toolforge.org) where users log in with OAuth 1.

In this case, the tool simply forces to log in again, with a new request/access token. But it is a little annoying.

The error messages are all similar to this one (which is trimmed in the middle, sorry): 400 BAD_REQUEST Bad Request,mwoauth-invalid-auth..52..ot valid: Nonce already used: 370062727

Looking into the logs, it's happened about 30 times in the last day, I hope not for the same user (I don't have that info).

Also had a similar problem when trying to run some scripts today. The error is sporadic and popped up in one every 10-15 page changes.
Fortunately the scripts do not fail.
See below:

WARNING: API error mwoauth-invalid-authorization: The authorization headers in your request are not valid: Nonce already used: 179091647063069294231612029666
ERROR: Retrying failed OAuth authentication for wiktionary:mg: The authorization headers in your request are not valid: Nonce already used: 179091647063069294231612029666

I confirm the freuqency of this error increased by a lot for my bot as well.

Looking at the server logs:

[urbanecm@mwlog1001 /srv/mw-log]$ for  i in archive/OAuth.log-202101*.gz; do date=$(echo "$i" | sed 's/archive\/OAuth.log-//g;s/\.gz//g'); hits=$(zgrep 'so nonce has been used by this consumer+token' "$i" | wc -l); echo -e "$date\t$hits"; done > ~/data.tsv
[urbanecm@mwlog1001 /srv/mw-log]$ phaste ~/data.tsv
https://phabricator.wikimedia.org/P14057
[urbanecm@mwlog1001 /srv/mw-log]$

Until January 15, the number of nonce-related errors was stable, around 50k per day. After that date, it jumped up to ~110k of errors per day, ie. it more than doubled.

Visualisation (interactive; raw data are at P14057):

image.png (371×600 px, 15 KB)

@Tgr Any idea about what would more than double the number of nonce-related errors?

For the record, my scripts that are affected by this bug are pretty simple, and submit requests via this method:

def make_request(api, payload, post=False):
	config = json.loads(open('config.json').read())
	auth = OAuth1(config['customer_token'], config['customer_secret'], config['access_token'], config['access_secret'])
	if post:
		return requests.post(api, data=payload, auth=auth)
	else:
		return requests.get(api, params=payload, auth=auth)

I don't think any request got repeated (unless there is a bug in requests_oauthlib :)).

Has any action been applied? I haven't seen this error again for days.

Still occurring for me. The rate is subjectively a bit slower, but it has definitely not been fixed.

Still occuring, yes. This bot log crudely suggests a rate of 2 in 100.

This happens occasionally in the Wikidata Lexeme Forms tool as well, though I don’t have numbers for how often it happens. (Counting requests in ~/uwsgi.log would be misleading, since “bulk mode” requests can make hundreds of API requests, each of which can (apparently) fail with this error.)

@Tgr Any idea about what would more than double the number of nonce-related errors?

Some kind of infrastructure problem that makes redis requests error out (previously I said memcached but actually it seems OAuth nonces have been using redis for a long time). Beyond that, no clue.

I looked at redis error logs to see if there was a spike but apparently there's something like half a million redis errors logged on a normal day so it's hard to pick up patterns :( I tried to get the rate of OAuth errors specifically, but none of the ElasticSearch queries I could come up with yielded any results (I was looking for messages on the redis channel which had the words OAUTH or nonce in them, per OAuth's Utils::getCacheKey). Not sure if that means no errors were logged or just that the queries didn't work - I tested them with other keys and it was deterministic but (to me) pretty arbitrary whether they worked for a given key or not.

Mentioned in SAL (#wikimedia-cloud) [2021-02-18T20:44:38Z] <wm-bot> <lucaswerkmeister> deployed 23ccbcf6f6 (work around T272319)

My bot was working for months without such problem.
It is set to pause on errors.
Today I noticed that it was not running for a while.
And found that it froze at 2021.01.15.
Restarted it at Toolforge, same problem happened today (2021.02.24).
Restarted it from my own PC - same error.
Bot is custom, but I think that it is not reusing nonces:
https://github.com/Vort/WikiTasks/blob/dcef5143858b3a4f1fd0dfb0914c536f39dddd2f/MwApi.cs#L110

Added debug information:

image.png (355×792 px, 59 KB)

nonce: xoaxytctpjfyapbpbpqueyefjeqagimc
X-Request-Id: YDZLzE8c0wGk0qURc3HLRAAAAI0
-> fine
nonce: mxtcxoqgqagclbuhujwoxwazjunucbng
X-Request-Id: YDZLzYlPs8Iw@d1YyqpJaQAAAMQ
-> bug

Tagging serviceops because this seems to be caused by high redis error rates (although I did not manage to get stats from logstash but I can't think of another potential cause) and Cloud Services since it mainly affects their audience.

Change 666852 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: remove shard06 from redis_sessions

https://gerrit.wikimedia.org/r/666852

We lost mc1024 (shard06 on redis) on Jan14, which might be related. On the other hand I can't be completely sure, since nutcracker ejects the faulty host and reshards https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness.

Regardless, we will remove the shard properly from the pool, and hope that it will fix the issues we are experiencing.

Change 666861 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] redis: create new shard on mc1036 to sub in for mc1024

https://gerrit.wikimedia.org/r/666861

We lost mc1024 (shard06 on redis) on Jan14, which might be related. On the other hand I can't be completely sure, since nutcracker ejects the faulty host and reshards https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness.

Regardless, we will remove the shard properly from the pool, and hope that it will fix the issues we are experiencing.

To further elaborate: I've uploaded a more conservative patch, that installs a second redis instance on mc1036 to use as shard 6. That should cause minimal disruption to the other keys. We have free memory for another redis instance, and the network bandwidth used is reduced. I've asked for a review, and if others agree we'll just go with it

Change 666861 merged by Giuseppe Lavagetto:
[operations/puppet@production] redis: create new shard on mc1036 to sub in for mc1024

https://gerrit.wikimedia.org/r/666861

After merging my change, the number of errors in OAuth.log regarding 'nonce already used' decreased from ~ 80/minute to ~ 17/minute, which seems to be in line with the rate we had before the incident.

Having said that, it would be better if OAuth didn't send such a log when unable to connect to redis and/or doesn't find the key, I guess?

I'll wait to hear back from the bot developers before resolving the task, but I'm optimistic.

Apologies for the disruption we caused to everyone.

Further update: in the next 10 minutes, after the dust of resharding settled, we just had 36 errors, which seems more than acceptable.

My bot run finished successfully a minute ago.

I got a few reports of bots not having more issues, I would consider the immediate problem solved.

The origin of this problem is that we're still using a battery of sharded redises as if they were a consistent datastore, not a cache, in MediaWiki. I hope OAuth tokens can be migrated off of redis ASAP, or that we change the storage abstraction used in MediaWiki for them.

Having said that, it would be better if OAuth didn't send such a log when unable to connect to redis and/or doesn't find the key, I guess?

The underlying problem is MediaWiki's BagOStuff abstraction layer not differentiating between a miss and a connection error. (Not finding the key is fine, the OAuth action would succeed then.)

The origin of this problem is that we're still using a battery of sharded redises as if they were a consistent datastore, not a cache, in MediaWiki. I hope OAuth tokens can be migrated off of redis ASAP, or that we change the storage abstraction used in MediaWiki for them.

These are short-TTL tokens just to prevent replay attacks, so we can just configure OAuth to use a different datastore without causing much disruption, as long as it is accessible via ObjectCache (granted, that' still a cache, but maybe there's a less problematic one).

Change 666852 abandoned by Effie Mouzeli:
[operations/puppet@production] hieradata: remove shard06 from redis_sessions

Reason:
in favour of I50f69891684c147caa27321d8ffab6ee27374884

https://gerrit.wikimedia.org/r/666852

There have been no errors since the fix was deployed...until now.

2021-03-04 03:05:24 pywiki ERROR: Retrying failed OAuth authentication for commons:commons: The authorization headers in your request are not valid: Nonce already used: 146...

@AntiCompositeNumber, my bot is also having problems again: [2021.03.04 02:00:19] run froze. Need to reopen issue?

@Vort @AntiCompositeNumber this is related to T276415 - we had another hardware failure. I'll route around it now, but we definitely need to make OAuth tokens less brittle.

Change 668258 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] redis: create new shard on mc1035 to sub in for mc1027

https://gerrit.wikimedia.org/r/668258

Change 668258 merged by Giuseppe Lavagetto:
[operations/puppet@production] redis: create new shard on mc1035 to sub in for mc1027

https://gerrit.wikimedia.org/r/668258