Frequent "Nonce already used" errors in scripts and tools
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	AntiCompositeNumber
	Jan 18 2021, 7:51 PM

Description

Two of my bot tasks, NoLicense and inrbot started generating frequent authentication errors like the following:

2021-01-18 12:36:40 pywiki ERROR: Retrying failed OAuth authentication for commons:commons: The authorization headers in your request are not valid: Nonce already used:

The first was recorded at 2020-12-10 17:42:12 UTC. The problem went away between Dec. 16 and Jan 15, except for one occurrence on Jan 6. Since then, I have been getting errors every few hours.inrbot runs constantly, while NoLicense is started fresh every hour. The INaturalistReviewBot account is used solely by the inrbot task, while NoLicense runs as AntiCompositeBot with several other tasks. One of the other AntiCompositeBot tasks has had this error once. All errors have been on commons.wikimedia.org, but that is likely just selection bias as my enwiki tasks make significantly fewer edits. Both use OAuth to login using pywikibot.

Timestamps of errors

2020-12-10 17:42:12
2020-12-14 20:18:52
2020-12-16 11:08:20
2021-01-06 17:06:19
2021-01-14 17:20:16
2021-01-14 19:15:29
2021-01-14 21:02:00
2021-01-14 23:45:44
2021-01-15 07:12:43
2021-01-15 07:25:53
2021-01-15 08:05:07
2021-01-15 09:05:16
2021-01-15 18:05:19
2021-01-15 19:05:17
2021-01-15 23:14:11
2021-01-16 01:41:47
2021-01-16 10:11:54
2021-01-16 10:13:46
2021-01-16 15:05:52
2021-01-16 15:06:48
2021-01-16 16:05:32
2021-01-16 17:36:13
2021-01-16 18:21:22
2021-01-16 21:05:53
2021-01-17 07:05:23
2021-01-17 10:11:27
2021-01-17 10:11:34
2021-01-17 11:06:24
2021-01-17 14:05:25
2021-01-17 17:05:16
2021-01-17 20:29:02
2021-01-17 20:55:04
2021-01-17 21:06:07
2021-01-17 21:26:31
2021-01-17 21:26:34
2021-01-17 21:42:30
2021-01-17 21:42:32
2021-01-17 21:52:30
2021-01-18 02:07:41
2021-01-18 02:42:00
2021-01-18 09:51:32
2021-01-18 12:05:20
2021-01-18 12:36:40
2021-01-18 20:05:25
2021-01-18 21:18:48
2021-01-18 22:05:17

Details

Subject	Repo	Branch	Lines +/-
redis: create new shard on mc1035 to sub in for mc1027	operations/puppet	production	+2 -2
hieradata: remove shard06 from redis_sessions	operations/puppet	production	+1 -6
redis: create new shard on mc1036 to sub in for mc1024	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects

Mentioned In: T275714: Handle some errors automatically
T275277: Mitigate broken login due to Nonce already used
R2362:3a450b91804f: Stop using mwoauth.identify() to get user info
R2362:23ccbcf6f670: Work around MediaWiki “Nonce already used” errors
T273003: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API
T255431: Handle OAuth login errors better
T272325: Log response headers from MediaWiki API errors
Mentioned Here: T276415: mc1027.eqiad.wmnet is down, not powering back up
P14057 Masterwork From Distant Lands
T106066: Don't show "Nonce already used" error on memcache failure

Event Timeline

AntiCompositeNumber created this task.Jan 18 2021, 7:51 PM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptJan 18 2021, 7:51 PM

This looks like T106066: Don't show "Nonce already used" error on memcache failure, not an issue with Pywikibot. The API returns code mwoauth-invalid-authorization with Nonce already used in the info.

My bots have seen it on the below dates.

2021-01-05
2021-01-11
2021-01-14
2021-01-15
2021-01-16
2021-01-17
2021-01-18

Both use OAuth to login using pywikibot.

Presumably OAuth 2?

Reedy added a project: Platform Engineering.Jan 18 2021, 9:49 PM

I believe pywikibot uses OAuth 1, not OAuth 2. Docs are https://www.mediawiki.org/wiki/Manual:Pywikibot/OAuth

OAuth 2 does not use nonces, AFAIK.

Pywikibot should probably log the X-Request-Id header in the API response and include it into error reports.

Just eyeballing the error logs, there are some big spikes of memcached timeout errors, so probably related to that.

The proper fix would be replacing memcached with some more reliable storage for OAuth nonces. Kask, maybe? Although these aren't really sessions so not sure if it would violate some design assumption.

AntiCompositeNumber updated the task description. (Show Details)Jan 18 2021, 11:01 PM

JJMC89 mentioned this in T272325: Log response headers from MediaWiki API errors.Jan 18 2021, 11:45 PM

• AMooney triaged this task as Medium priority.Jan 19 2021, 2:28 PM

• AMooney edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.

• AMooney moved this task from Inbox to Later on the Platform Team Workboards (Clinic Duty Team) board.

We're also seeing this issue with logins for the Library-Card-Platform.

taavi subscribed.Jan 19 2021, 3:48 PM

LucasWerkmeister subscribed.Jan 19 2021, 8:32 PM

In T272319#6756412, @Tgr wrote:

Pywikibot should probably log the X-Request-Id header in the API response and include it into error reports.

Here are four from my logs.

YApm4wpAEMMAAJbQSVAAAAEL
YApm5wpAAE0AABFHwu4AAACW
YApmLQpAEJ8AAGJavwQAAABB
YApmMgpAMN4AAz-TWXMAAABT

Seen 1897 errors and counting on both of bots I maintain, on both frwiki and wikidata. Looks like first seen at Nov 3, 2020 22:50 CET.
But it really started multiple time an hour starting Jan 15, 2021 01:02 CET (mediawiki deployment?).

In T272319#6767907, @JJMC89 wrote:

Here are four from my logs.

YApm4wpAEMMAAJbQSVAAAAEL

YApm5wpAAE0AABFHwu4AAACW

YApmLQpAEJ8AAGJavwQAAABB

YApmMgpAMN4AAz-TWXMAAABT

No memcached error in any of those requests, just the OAuth one. Are you sure Pywikibot is not repeating the request for some reason?

In T272319#6769928, @Framawiki wrote:

But it really started multiple time an hour starting Jan 15, 2021 01:02 CET (mediawiki deployment?).

No deployment around that time.

In T272319#6770990, @Tgr wrote:

No memcached error in any of those requests, just the OAuth one. Are you sure Pywikibot is not repeating the request for some reason?

Pywikibot retries after getting the nonce already used response.

I hacked some more logging into Pywikibot before and after where it sends the request. From that, I don't see duplicate requests or reused nonces.

diff --git a/pywikibot/comms/http.py b/pywikibot/comms/http.py
index a96c6b82c..b1335a311 100644
--- a/pywikibot/comms/http.py
+++ b/pywikibot/comms/http.py
@@ -423,9 +423,12 @@ def fetch(uri: str, method: str = 'GET', headers: Optional[dict] = None,
         # Note that the connections are pooled which mean that a future
         # HTTPS request can succeed even if the certificate is invalid and
         # verify=True, when a request with verify=False happened before
+        pywikibot.log('Sending request: method={}, uri={}, headers={}, '
+                      'kwargs={}'.format(method, uri, headers, kwargs))
         response = session.request(method, uri,
                                    headers=headers, auth=auth, timeout=timeout,
                                    **kwargs)
+        pywikibot.log('Sent request: method={r.method}, url={r.url}, headers={r.headers}, body={r.body}'.format(r=response.request))
     except Exception as e:
         response = e
     else:

SD0001 subscribed.Jan 25 2021, 8:27 AM

The issue has been occurring with other bot libraries too, not just pywikibot. In a few cases, the issues recurs even after waiting for 5 seconds and retrying the API request.

Samwalton9-WMF renamed this task from Frequent Nonce already used errors from Pywikibot scripts to Frequent "Nonce already used" errors in scripts and tools.Jan 25 2021, 11:49 AM

Samwalton9-WMF mentioned this in T255431: Handle OAuth login errors better.

This comment was removed by Samwalton9-WMF.

Iluvatar awarded a token.Jan 27 2021, 9:02 AM

Iluvatar subscribed.

This issue there is even in wikimedia’s oauthclient-php.
When user trying to get token. But not in each request, time to time.

$client = new Client( $conf );
$accessToken = new Token( $_SESSION['tokenKey'], $_SESSION['tokenSecret'] );
$params = ['action'=>'query','meta'=>'tokens','format'=>'json']; 
$userInfo = $client->makeOAuthCall(
     $accessToken, $apiURL, TRUE, $params
); // <<<——- nonce already used

This issue also happens every now and then in a webtool (https://replacer.toolforge.org) where users log in with OAuth 1.

In this case, the tool simply forces to log in again, with a new request/access token. But it is a little annoying.

The error messages are all similar to this one (which is trimmed in the middle, sorry): 400 BAD_REQUEST Bad Request,mwoauth-invalid-auth..52..ot valid: Nonce already used: 370062727

Looking into the logs, it's happened about 30 times in the last day, I hope not for the same user (I don't have that info).

Also had a similar problem when trying to run some scripts today. The error is sporadic and popped up in one every 10-15 page changes.
Fortunately the scripts do not fail.
See below:

WARNING: API error mwoauth-invalid-authorization: The authorization headers in your request are not valid: Nonce already used: 179091647063069294231612029666
ERROR: Retrying failed OAuth authentication for wiktionary:mg: The authorization headers in your request are not valid: Nonce already used: 179091647063069294231612029666

I confirm the freuqency of this error increased by a lot for my bot as well.

Looking at the server logs:

[urbanecm@mwlog1001 /srv/mw-log]$ for  i in archive/OAuth.log-202101*.gz; do date=$(echo "$i" | sed 's/archive\/OAuth.log-//g;s/\.gz//g'); hits=$(zgrep 'so nonce has been used by this consumer+token' "$i" | wc -l); echo -e "$date\t$hits"; done > ~/data.tsv
[urbanecm@mwlog1001 /srv/mw-log]$ phaste ~/data.tsv
https://phabricator.wikimedia.org/P14057
[urbanecm@mwlog1001 /srv/mw-log]$

Until January 15, the number of nonce-related errors was stable, around 50k per day. After that date, it jumped up to ~110k of errors per day, ie. it more than doubled.

Visualisation (interactive; raw data are at P14057):

@Tgr Any idea about what would more than double the number of nonce-related errors?

For the record, my scripts that are affected by this bug are pretty simple, and submit requests via this method:

def make_request(api, payload, post=False):
	config = json.loads(open('config.json').read())
	auth = OAuth1(config['customer_token'], config['customer_secret'], config['access_token'], config['access_secret'])
	if post:
		return requests.post(api, data=payload, auth=auth)
	else:
		return requests.get(api, params=payload, auth=auth)

I don't think any request got repeated (unless there is a bug in requests_oauthlib :)).

kolbert subscribed.Jan 30 2021, 9:20 PM

Reedy mentioned this in T273003: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API.Feb 2 2021, 10:53 PM

Cyberpower678 subscribed.Feb 3 2021, 3:33 AM

Chicocvenancio subscribed.Feb 3 2021, 11:16 PM

Has any action been applied? I haven't seen this error again for days.

Still occurring for me. The rate is subjectively a bit slower, but it has definitely not been fixed.

Still occuring, yes. This bot log crudely suggests a rate of 2 in 100.

Nikki subscribed.Feb 13 2021, 10:55 PM

This happens occasionally in the Wikidata Lexeme Forms tool as well, though I don’t have numbers for how often it happens. (Counting requests in ~/uwsgi.log would be misleading, since “bulk mode” requests can make hundreds of API requests, each of which can (apparently) fail with this error.)

In T272319#6789422, @Urbanecm wrote:

@Tgr Any idea about what would more than double the number of nonce-related errors?

Some kind of infrastructure problem that makes redis requests error out (previously I said memcached but actually it seems OAuth nonces have been using redis for a long time). Beyond that, no clue.

I looked at redis error logs to see if there was a spike but apparently there's something like half a million redis errors logged on a normal day so it's hard to pick up patterns :( I tried to get the rate of OAuth errors specifically, but none of the ElasticSearch queries I could come up with yielded any results (I was looking for messages on the redis channel which had the words OAUTH or nonce in them, per OAuth's Utils::getCacheKey). Not sure if that means no errors were logged or just that the queries didn't work - I tested them with other keys and it was deterministic but (to me) pretty arbitrary whether they worked for a given key or not.

Mentioned in SAL (#wikimedia-cloud) [2021-02-18T20:44:38Z] <wm-bot> <lucaswerkmeister> deployed 23ccbcf6f6 (work around T272319)

LucasWerkmeister mentioned this in R2362:23ccbcf6f670: Work around MediaWiki “Nonce already used” errors.Feb 18 2021, 8:53 PM

LucasWerkmeister mentioned this in R2362:3a450b91804f: Stop using mwoauth.identify() to get user info.Feb 19 2021, 6:26 PM

Framawiki mentioned this in T275277: Mitigate broken login due to Nonce already used.Feb 20 2021, 1:02 PM

My bot was working for months without such problem.
It is set to pause on errors.
Today I noticed that it was not running for a while.
And found that it froze at 2021.01.15.
Restarted it at Toolforge, same problem happened today (2021.02.24).
Restarted it from my own PC - same error.
Bot is custom, but I think that it is not reusing nonces:
https://github.com/Vort/WikiTasks/blob/dcef5143858b3a4f1fd0dfb0914c536f39dddd2f/MwApi.cs#L110

Added debug information:

nonce: xoaxytctpjfyapbpbpqueyefjeqagimc
X-Request-Id: YDZLzE8c0wGk0qURc3HLRAAAAI0
-> fine
nonce: mxtcxoqgqagclbuhujwoxwazjunucbng
X-Request-Id: YDZLzYlPs8Iw@d1YyqpJaQAAAMQ
-> bug

JJMC89 mentioned this in T275714: Handle some errors automatically.Feb 25 2021, 12:08 AM

Tagging serviceops because this seems to be caused by high redis error rates (although I did not manage to get stats from logstash but I can't think of another potential cause) and Cloud Services since it mainly affects their audience.

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptFeb 25 2021, 9:01 AM

Change 666852 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: remove shard06 from redis_sessions

https://gerrit.wikimedia.org/r/666852

gerritbot added a project: Patch-For-Review.Feb 25 2021, 9:53 AM

We lost mc1024 (shard06 on redis) on Jan14, which might be related. On the other hand I can't be completely sure, since nutcracker ejects the faulty host and reshards https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness.

Regardless, we will remove the shard properly from the pool, and hope that it will fix the issues we are experiencing.

Change 666861 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] redis: create new shard on mc1036 to sub in for mc1024

https://gerrit.wikimedia.org/r/666861

In T272319#6860273, @jijiki wrote:

We lost mc1024 (shard06 on redis) on Jan14, which might be related. On the other hand I can't be completely sure, since nutcracker ejects the faulty host and reshards https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness.

Regardless, we will remove the shard properly from the pool, and hope that it will fix the issues we are experiencing.

To further elaborate: I've uploaded a more conservative patch, that installs a second redis instance on mc1036 to use as shard 6. That should cause minimal disruption to the other keys. We have free memory for another redis instance, and the network bandwidth used is reduced. I've asked for a review, and if others agree we'll just go with it

Change 666861 merged by Giuseppe Lavagetto:
[operations/puppet@production] redis: create new shard on mc1036 to sub in for mc1024

https://gerrit.wikimedia.org/r/666861

After merging my change, the number of errors in OAuth.log regarding 'nonce already used' decreased from ~ 80/minute to ~ 17/minute, which seems to be in line with the rate we had before the incident.

Having said that, it would be better if OAuth didn't send such a log when unable to connect to redis and/or doesn't find the key, I guess?

I'll wait to hear back from the bot developers before resolving the task, but I'm optimistic.

Apologies for the disruption we caused to everyone.

Further update: in the next 10 minutes, after the dust of resharding settled, we just had 36 errors, which seems more than acceptable.

My bot run finished successfully a minute ago.

I got a few reports of bots not having more issues, I would consider the immediate problem solved.

The origin of this problem is that we're still using a battery of sharded redises as if they were a consistent datastore, not a cache, in MediaWiki. I hope OAuth tokens can be migrated off of redis ASAP, or that we change the storage abstraction used in MediaWiki for them.

In T272319#6860581, @Joe wrote:

Having said that, it would be better if OAuth didn't send such a log when unable to connect to redis and/or doesn't find the key, I guess?

The underlying problem is MediaWiki's BagOStuff abstraction layer not differentiating between a miss and a connection error. (Not finding the key is fine, the OAuth action would succeed then.)

In T272319#6861186, @Joe wrote:

The origin of this problem is that we're still using a battery of sharded redises as if they were a consistent datastore, not a cache, in MediaWiki. I hope OAuth tokens can be migrated off of redis ASAP, or that we change the storage abstraction used in MediaWiki for them.

These are short-TTL tokens just to prevent replay attacks, so we can just configure OAuth to use a different datastore without causing much disruption, as long as it is accessible via ObjectCache (granted, that' still a cache, but maybe there's a less problematic one).

Change 666852 abandoned by Effie Mouzeli:
[operations/puppet@production] hieradata: remove shard06 from redis_sessions

Reason:
in favour of I50f69891684c147caa27321d8ffab6ee27374884

https://gerrit.wikimedia.org/r/666852

There have been no errors since the fix was deployed...until now.

2021-03-04 03:05:24 pywiki ERROR: Retrying failed OAuth authentication for commons:commons: The authorization headers in your request are not valid: Nonce already used: 146...

@AntiCompositeNumber, my bot is also having problems again: [2021.03.04 02:00:19] run froze. Need to reopen issue?

@Vort @AntiCompositeNumber this is related to T276415 - we had another hardware failure. I'll route around it now, but we definitely need to make OAuth tokens less brittle.

Change 668258 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] redis: create new shard on mc1035 to sub in for mc1027

https://gerrit.wikimedia.org/r/668258

Change 668258 merged by Giuseppe Lavagetto:
[operations/puppet@production] redis: create new shard on mc1035 to sub in for mc1027