Page MenuHomePhabricator

"Nonce already used" regularly occurring on beta cluster
Closed, ResolvedPublic

Description

The Pywikibot-OAuth GSOC code has been merged, with unit tests regularly testing test.wikipedia, test.orain, and beta cluster sites en.wp and zh.wp.

On both the beta cluster sites, tests are regularly encountering failures which appear to be T106066: Don't show "Nonce already used" error on memcache failure

https://travis-ci.org/wikimedia/pywikibot-core/jobs/75715387#L910

Details

Related Gerrit Patches:

Event Timeline

jayvdb created this task.Aug 15 2015, 10:48 AM
jayvdb raised the priority of this task from to High.
jayvdb updated the task description. (Show Details)
jayvdb added a subscriber: jayvdb.
jayvdb moved this task from Backlog to Tests on the Pywikibot-OAuth board.Aug 20 2015, 8:42 AM
jayvdb moved this task from Backlog to MediaWiki on the Pywikibot-network board.
greg added a subscriber: greg.Aug 28 2015, 3:23 PM

From my understanding, this is an issue in OAuth not Beta Cluster, yes? I'm removing the Beta Cluster project as such.

greg set Security to None.
Tgr added a subscriber: Tgr.

Unless there is a bug in OAuth or PWB that we are unaware of, this is a memcached availability issue (see the task linked in the description for details).

We have are testing oauth on multiple sites (see task description), and we have only seen this on the beta cluster.

Another, beta cluster: https://travis-ci.org/wikimedia/pywikibot-core/jobs/77787307#L741

This happens so frequently, and in such varied code paths, that I would say Oauth on the beta cluster is effectively broken for any non-trivial amount of automated testing.

Tgr added a comment.EditedSep 7 2015, 11:38 PM

Probably just restart the memcached machines?

Krinkle renamed this task from Nonce already used regularly occurring on beta cluster to "Nonce already used" regularly occurring on beta cluster.Sep 8 2015, 3:21 AM
Tgr added a comment.Sep 8 2015, 8:36 PM

@jayvdb is there a way to track this error? I looked at the travis history, but builds seem to break all over the place and mostly for unrelated reasons (or not obviously related, anyway - pretty much any error can be related to memcached failure), so I couldn't get a good picture of how frequent this issue is.

Tgr added a comment.Sep 9 2015, 12:39 AM

memcache failures in the last month, as seen by MediaWiki. Doesn't show any obvious recent increase.

jayvdb updated the task description. (Show Details)Sep 9 2015, 7:56 AM
jayvdb added a comment.Sep 9 2015, 8:48 AM

@Tgr, I am not sure what we can or should do to help, but if you can give us some suggestions we might be able to help.

It is occurring all over the place, at any test, before or after the first login.

To give you some idea of frequency, each test job performs about 40 tests, and each build includes two jobs : en.wp.beta and zh.wp.beta. So ~80 test methods are performed, and my guess is about 300 API calls are performed (We could add a API call counter if it helps). And as previously said, almost every test build fails, with one or both of the jobs experiencing this problem. So, my back of envelope guess is ~ 1 in 300 API calls fails.

These exceptions are always the result of an API error with code = 'mwoauth-invalid-authorization' and info = The authorization headers in your request are not valid: Nonce already used: dddd.

We could include in the exception the API servedby value. But I am guessing that the Operations team would have spotted the problem if it was a particular host more frequently causing the problem.

There is some variation in the oauth logging:
Some of the nose logs include

requests_oauthlib.oauth1_auth: DEBUG: Signing request <PreparedRequest [GET]> using client <Client nonce=None, signature_method=HMAC-SHA1, realm=None, encoding=utf-8, timestamp=None, resource_owner_secret=****, decoding=utf-8, verifier=None, signature_type=AUTH_HEADER, rsa_key=None, resource_owner_key=28c3e24fb4e98a9ddffdf46b120f381e, client_secret=****, callback_uri=None, client_key=6ad8150a40735dac98d4f07826061cd1>

Others are different:

requests_oauthlib.oauth1_auth: DEBUG: Signing request <PreparedRequest [POST]> using client <Client resource_owner_secret=****, client_secret=****, realm=None, verifier=None, signature_method=HMAC-SHA1, decoding=utf-8, rsa_key=None, client_key=6ad8150a40735dac98d4f07826061cd1, signature_type=AUTH_HEADER, encoding=utf-8, resource_owner_key=28c3e24fb4e98a9ddffdf46b120f381e, nonce=None, callback_uri=None, timestamp=None>

Nonce error before log in:
https://travis-ci.org/wikimedia/pywikibot-core/jobs/77787307#L371 - Nonce failed
https://travis-ci.org/wikimedia/pywikibot-core/jobs/77787307#L410 - log in

But mostly it occurs after the log in (but this isnt unusual, as the login happens early in the test sequence)
https://travis-ci.org/wikimedia/pywikibot-core/jobs/79135083#L440 - log in
https://travis-ci.org/wikimedia/pywikibot-core/jobs/79135083#L672 - Nonce failed
https://travis-ci.org/wikimedia/pywikibot-core/jobs/79135083#L763 - the API params which caused the problem were:

b'titles=%E9%A6%96%E9%A1%B5&maxlag=5&action=query&prop=info&meta=userinfo&uiprop=blockinfo%7Chasmsg&inprop=protection&format=json&indexpageids=&continue=

https://travis-ci.org/wikimedia/pywikibot-core/jobs/79188747#L440
https://travis-ci.org/wikimedia/pywikibot-core/jobs/79188747#L643

https://travis-ci.org/jayvdb/pywikibot-core/jobs/77788314#L549
https://travis-ci.org/jayvdb/pywikibot-core/jobs/77788314#L727

Another thing we could do is use https://pypi.python.org/pypi/nose-randomly , to re-order the tests randomly, but that doesnt have a high chance of helping flush out the problem, but will make logs less easy to compare between test builds. Along these lines, we could change were the login action occurs, moving it later in the sequence to see if it is related.

Also, as this seems to be an intermittent problem, pywikibot could simply retry whenever we encounter The authorization headers in your request are not valid: Nonce already used: . If the pywikibot builds stop being red so frequently, we can assume that the problem is recoverable within a small timescale.

Tgr added a subscriber: bd808.Sep 9 2015, 9:25 PM

@Tgr, I am not sure what we can or should do to help, but if you can give us some suggestions we might be able to help.

Someone needs to be able to verify whether random attempts at hitting some service or other on the head have improved the situation, that's all.

For now, @bd808 restarted nutcracker on deployment-mediawiki02 which seemed to be the exclusive source of recent memcached errors in logstash. Please tell if that made a difference.

jayvdb closed this task as Resolved.Sep 9 2015, 11:06 PM
jayvdb claimed this task.

I've seen four builds in a row without problems, so it looks fixed. Will re-open if we encounter it again. Thanks!

jayvdb reassigned this task from jayvdb to bd808.Sep 9 2015, 11:06 PM
Tgr added a comment.Sep 9 2015, 11:10 PM

Looks like the memcached graphite counts are a good way of tracking this after all (nutcracker restart was around 18:30 UTC):

jayvdb lowered the priority of this task from Unbreak Now! to Medium.Sep 12 2015, 3:26 PM
Tgr added a subscriber: ori.Sep 13 2015, 1:01 AM

Looks like it was well-behaved for two days and now the errors are back:


This time it's all deployment-mediawiki01 per https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/memcached , something on the scale of 1-10 errors per minute.

Restarted, but I have no idea how to identify the underlying bug. @ori did a lot of investigation on nutcrackers in production, does that apply here?

jayvdb raised the priority of this task from Medium to High.Sep 13 2015, 1:18 AM

Builds are regularly failing due to this.
I'll add servedby to our error messages, so we can see where the problems are happening.

Change 237977 had a related patch set uploaded (by John Vandenberg):
Expand login and rights exception tree

https://gerrit.wikimedia.org/r/237977

jayvdb raised the priority of this task from High to Unbreak Now!.Sep 23 2015, 6:09 AM
XZise added a subscriber: XZise.Sep 23 2015, 5:03 PM

Not sure if that patch helped but since a few days all tests on Travis connecting to OAuth fail similar to https://travis-ci.org/wikimedia/pywikibot-core/jobs/81789795#L1016

And I can't find servedby anywhere there (except for the 5 times where we actually create that test response).

greg removed bd808 as the assignee of this task.Sep 23 2015, 8:16 PM

Not sure if that patch helped but since a few days all tests on Travis connecting to OAuth fail similar to https://travis-ci.org/wikimedia/pywikibot-core/jobs/81789795#L1016
And I can't find servedby anywhere there (except for the 5 times where we actually create that test response).

servedby dont appear in these exceptions, as they are not inheriting from APIError, which is what https://gerrit.wikimedia.org/r/#/c/237977/ aims to solve.

This task has "Unbreak now" priority for four weeks now.
How to get this moving forward? Which info is needed and from who?

XZise added a comment.Oct 20 2015, 1:56 PM

Primarily a response from @jayvdb to the comments in https://gerrit.wikimedia.org/r/#/c/237977/. And as soon as that got merged we are going to get more information from Travis. Not sure who can then help with that information though.

Tgr added a comment.Oct 20 2015, 7:03 PM

Per graphite, there are tens of thousands of memcached failures per day. Maybe @ori has ideas, he debugged a similar issue in production.

Change 249222 had a related patch set uploaded (by Gergő Tisza):
Make nutcracker's auto_eject_hosts setting configurable

https://gerrit.wikimedia.org/r/249222

hashar lowered the priority of this task from Unbreak Now! to High.Nov 9 2015, 1:08 PM

As mentioned in T129763 , we are seeing this again now that the oauth login have been updated.

fwiw, https://gerrit.wikimedia.org/r/#/c/237977/ hasnt been merged, and is waiting on me to push up a new version, but I dont see how that is going to be particularly useful - the server logs should have sufficient information about the problems.

What might be more useful is to try to re-rig our unit tests so that they repeat the same set of tests many times, to attempt to isolate the problem a little. The test suite can be run using nose or py.test , so there is a wide variety of plugins which can be used to achieve this. Instructions in tests/README.rst.

Tgr added a comment.May 18 2016, 8:45 PM

@jayvdb: I don't have much time this month but do you want to shepherd https://gerrit.wikimedia.org/r/#/c/249222/ through a puppet SWAT (it needs a rebase too) and then disable the setting in beta? That's my best bet at improving the situation. The problem is somewhere in the memcached layer, I doubt logs from the appservers (or PWB) would give useful information.

Change 289582 had a related patch set uploaded (by John Vandenberg):
Retry after OAuth "Nonce already used" error

https://gerrit.wikimedia.org/r/289582

jayvdb added a comment.EditedMay 19 2016, 1:38 AM

In new build https://travis-ci.org/wikimedia/pywikibot-core/builds/131215620 both zh.wpbeta and en.wpbeta had Nonce problems, and on my own build the en.wpbeta saw the same problem: https://travis-ci.org/jayvdb/pywikibot-core/jobs/131281628 (but it gets stuck on "Test that protectedpages returns protected page titles. ... ")

It looks like a simple API request retry works for us, as can be seen at https://travis-ci.org/jayvdb/pywikibot-core/builds/131298000 , which is the patch above. Sometimes it needs to retry twice in the same unit test, so the message appears twice in succession, but they are different API requests in long-running tests.

@jayvdb: I don't have much time this month but do you want to shepherd https://gerrit.wikimedia.org/r/#/c/249222/ through a puppet SWAT (it needs a rebase too) and then disable the setting in beta? ...

Sorry, but my plate doth overflow already, and assuming the above patch is merged, this set of pywikibot test failures is fixed and we've got about another 10 more curly test failures to solve before our tree is green again (after almost year of red).

(I did get as far as copy-editing the relevant page, which is still currently very inaccessible to normal people.)

Change 289582 merged by jenkins-bot:
Retry after OAuth "Nonce already used" error

https://gerrit.wikimedia.org/r/289582

jayvdb lowered the priority of this task from High to Low.May 19 2016, 10:58 AM

Workaround is fairly easy.

Tgr closed this task as Resolved.Mar 7 2017, 4:20 AM
Tgr claimed this task.

Has not happened for a long time.

Change 249222 abandoned by Gergő Tisza:
Make nutcracker's auto_eject_hosts setting configurable

Reason:
not relevant anymore.

https://gerrit.wikimedia.org/r/249222