Page MenuHomePhabricator

NEED_TOKEN error spike when 1.27-wmf.11 SessionManager was deployed to group1
Closed, ResolvedPublicPRODUCTION ERROR

Details

Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • It could be that the user has a bad session cookie and SessionManager somehow is continuing to try (and fail) to use the bad cookie instead of sending a new one for a new, empty session.

From local testing against mediawiki.org, I think I've disproven this one. Logged in, then used eval.php to call User::setToken() and CentralAuthUser::resetAuthToken(), then confirmed meta=userinfo said I wasn't logged in on that session anymore, then tried a new login and it logged in without issue (including sending a replacement session cookie during the first step).

For the record regarding the rollout of wmf.11 (which was blocked due to this (and other issues) yesterday):
we will move ahead with the train (ie: roll out to group1, wait a bit, then roll out to wikipedis, as scheduled)

I did some live hacking earlier to log more data on the problem here. The logged data is available in Kibana from 21:08–21:49 UTC, the relevant messages are prefixed with "Live hack".

For most of the bots running into the issue, it appears that they are somehow not honoring the Set-Cookie headers: the incoming requests from these bots did not contain the commonswikiSession cookie. The record of cookies set during the request in WebResponse::$setCookie did show the proper cookies going out, and of course the cookies are showing up correctly for others as shown in T124252#1951112.

One bot had a different bug: in addition to the session cookie, it was sending cookies that should have been expired such as commonswikiUserID=deleted and centralauth_User=deleted. This causes the SessionProvider to reject the session in line with OWASP recommendations, since the claimed user or user ID is not valid.

Some of the top bots spamming login requests:

  • SDBot, UKBot, Community Tech bot, FlickreviewR 2, DanmicholoBot, CatWatchBot, SteinsplitterBot (~10/sec)
  • Ботчо (~1/sec)
  • BOTzilla, Mathbot, jembot (~1/10sec)

I'll try to contact the authors.

One bot had a different bug: in addition to the session cookie, it was sending cookies that should have been expired such as commonswikiUserID=deleted and centralauth_User=deleted. This causes the SessionProvider to reject the session in line with OWASP recommendations, since the claimed user or user ID is not valid.

That's exactly what's happening with the Android app: T124384#1954608

The old behavior was not to delete any cookie with the first API response, just set the new session. We can't do that anymore since we are stricter about cookie consistency; we should just probably treat cookies with the value deleted as nonexistent (at least as a B/C hack). Although probably replace deleted with something that's not a valid username.

Although probably replace deleted with something that's not a valid username.

deleted is apparently PHP magic (we just set the empty string). Shouldn't matter, lowercase usernames are not allowed, even on case-sensitive wikis.

Change 265685 had a related patch set uploaded (by Niedzielski):
Remove expired cookies

https://gerrit.wikimedia.org/r/265685

Change 265686 had a related patch set uploaded (by Niedzielski):
Remove expired cookies

https://gerrit.wikimedia.org/r/265686

Change 265689 had a related patch set uploaded (by Gergő Tisza):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265689

Change 265690 had a related patch set uploaded (by Gergő Tisza):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265690

I'm getting this error since last night, when trying to login through DotNetWikiBot 3.14. Any suggestions on how to get around it?

Here's what DNWB sends and receives during login:

Logging in...
postData = lgname=Lsjtest&lgpassword=XXX&lgdomain=

First response:
<?xml version="1.0"?><api><login result="NeedToken" token="266bf5000cd7057f231a00f1e3384050" cookieprefix="svwiki" sessionid="ob4no1uvpsbu7gg8ldbk6209035bdckg" /></api>

postData = lgname=Lsjtest&lgpassword=XXX&lgdomain=&lgtoken=266bf5000cd7057f231a00f1e3384050

Second response:
<?xml version="1.0"?><api><login result="NeedToken" token="83012cf23597c0769c2417d88f83337c" cookieprefix="svwiki" sessionid="q1aqcsgsjanf6c3da9o91tl2438picgd" /></api>

Unhandled Exception: DotNetWikiBot.WikiBotException:

Change 265686 merged by jenkins-bot:
Remove expired cookies

https://gerrit.wikimedia.org/r/265686

Change 265685 merged by jenkins-bot:
Remove expired cookies

https://gerrit.wikimedia.org/r/265685

I also ran into this, I think (and filed T124421 about crazy cookies).

I'm getting this error since last night, when trying to login through DotNetWikiBot 3.14. Any suggestions on how to get around it?

Here's what DNWB sends and receives during login:

Check your cookie handling to make sure (1) it's picking up the session cookie from the Set-Cookie header in the first response and (2) it's not sending "deleted" cookies.

Thank you @Anomie. Now it works. Though it does seem like a kluge to have to purge the "deleted" cookies "by hand" in the code.

Thank you @Anomie. Now it works. Though it does seem like a kluge to have to purge the "deleted" cookies "by hand" in the code.

You shouldn't have to purge "deleted" cookies by hand. These cookies have a Max-Age of 0 and an expires in 1970, both of which should cause your cookie-handling code to automatically delete the cookie as being expired. If your cookie-handling code doesn't do that, then it's broken and should be fixed.

Given this graph:

Selection_095.png (482×1 px, 24 KB)

What is the current plan?

Change 265859 had a related patch set uploaded (by Gergő Tisza):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265859

Change 265860 had a related patch set uploaded (by Gergő Tisza):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265860

@greg: https://gerrit.wikimedia.org/r/#/c/265689/ / https://gerrit.wikimedia.org/r/#/c/265690/ would fix this for bots that send deleted.

I pushed https://gerrit.wikimedia.org/r/265859 / https://gerrit.wikimedia.org/r/265860 for the rest, it should prevent unnecessary delete headers. It's a more scary hack than the other set though, I didn't have much time to test and now I'm off to (narrowly) catch a plane :(

Login sequence with the patch applied:

$ curl -sD- -d 'format=json&action=login&lgname=Admin&lgpassword=vagrant' 'http://localhost/w/api.php'
HTTP/1.1 200 OK
Date: Fri, 22 Jan 2016 22:35:11 GMT
Server: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.14
X-Powered-By: PHP/5.5.9-1ubuntu4.14
X-Content-Type-Options: nosniff
Cache-control: private, must-revalidate, max-age=0
Set-Cookie: wiki_session=8kqltku8q2auler58vs3c4jkpfsqi90h; path=/; httponly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
X-Frame-Options: DENY
Content-Length: 144
Content-Type: application/json; charset=utf-8

{"login":{"result":"NeedToken","token":"0c6ae5d1512e1b6b7a2c9bcc1037a2b8","cookieprefix":"wiki","sessionid":"8kqltku8q2auler58vs3c4jkpfsqi90h"}}
$ curl -sD- -H 'Cookie: wiki_session=8kqltku8q2auler58vs3c4jkpfsqi90h' -d 'format=json&action=login&lgname=Admin&lgpassword=vagrant&lgtoken=0c6ae5d1512e1b6b7a2c9bcc1037a2b8' 'http://localhost/w/api.php'
HTTP/1.1 200 OK
Date: Fri, 22 Jan 2016 22:35:46 GMT
Server: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.14
X-Powered-By: PHP/5.5.9-1ubuntu4.14
X-Content-Type-Options: nosniff
Cache-control: private, must-revalidate, max-age=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Set-Cookie: wiki_session=8kqltku8q2auler58vs3c4jkpfsqi90h; path=/; httponly
Set-Cookie: wikiUserID=1; expires=Wed, 20-Jul-2016 22:35:47 GMT; Max-Age=15552000; path=/; httponly
Set-Cookie: wikiUserName=Admin; expires=Wed, 20-Jul-2016 22:35:47 GMT; Max-Age=15552000; path=/; httponly
Set-Cookie: wikiToken=357ea2dd5f4084723ca9c567374d1235; expires=Wed, 20-Jul-2016 22:35:47 GMT; Max-Age=15552000; path=/; httponly
X-Frame-Options: DENY
Content-Length: 178
Content-Type: application/json; charset=utf-8

{"login":{"result":"Success","lguserid":1,"lgusername":"Admin","lgtoken":"357ea2dd5f4084723ca9c567374d1235","cookieprefix":"wiki","sessionid":"8kqltku8q2auler58vs3c4jkpfsqi90h"}}

Change 265689 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265689

Change 265690 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265690

Change 265869 had a related patch set uploaded (by Legoktm):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265869

Change 265870 had a related patch set uploaded (by Legoktm):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265870

Change 265869 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265869

Change 265870 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265870

Change 265871 had a related patch set uploaded (by BryanDavis):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265871

Change 265872 had a related patch set uploaded (by BryanDavis):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265872

Change 265859 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265859

Change 265871 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265871

Change 265872 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265872

Change 265860 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265860

Screen Shot 2016-01-22 at 17.21.26.png (678×1 px, 57 KB)

For once I'm glad to see a graph move down and to the right.

For once I'm glad to see a graph move down and to the right.

Looks like I got excited too quickly. Graphite was just lagging on showing the data. A couple of minutes after I took the screen shot the data showed up.

Some more data in P2508#10411.

Those million lines spanned about 3.5 minutes. IIRC it wasn't much different when SessionManager was only on group1 (5 minutes or something like that) which would suggest that not many bots were affected. OTOH per grafana the group1 peak was about 30 req/s and the final peak about ten times more, so not sure how to square that.

Change 265928 had a related patch set uploaded (by Anomie):
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/265928

Change 265928 merged by jenkins-bot:
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/265928

Change 266284 had a related patch set uploaded (by Anomie):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266284

Change 266285 had a related patch set uploaded (by Anomie):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266285

Change 266284 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266284

Change 266285 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266285

Change 266416 had a related patch set uploaded (by BryanDavis):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266416

Change 266421 had a related patch set uploaded (by BryanDavis):
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/266421

Change 266426 had a related patch set uploaded (by BryanDavis):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266426

Change 266421 merged by jenkins-bot:
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/266421

Change 266426 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266426

Change 266416 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266416

No permanent increase in NeedToken (or any other error) this time. There was a small spike at the time of group1 deployment, which probably means some bots have been logged out but could re-login successfully.

Login works well enough now, but bot sessions are much less stable than they used to be. Far too often, the bot either gets logged out or starts receiving error messages in the middle of a run. The most frequent error is "Invalid token", but I have also seen messages saying that "The Wiki is currently in read-only mode". If I break the run and start over, everything works fine again.

Login works well enough now, but bot sessions are much less stable than they used to be. Far too often, the bot either gets logged out or starts receiving error messages in the middle of a run. The most frequent error is "Invalid token", but I have also seen messages saying that "Wikipedia is in read-only mode". If I break the run and start over, everything works fine again.

... which makes my multi-threaded bot unable to complete any task.

I started four bot tasks last night. Three of them, running on the same computer, failed at the same time with an Invalid Token error. The fourth one, on a different computer, is still running.

I started four bot tasks last night. Three of them, running on the same computer, failed at the same time with an Invalid Token error. The fourth one, on a different computer, is still running.

Of the three tasks I restarted this morning, two failed during the day, but at different times (several hours apart). One with the "read-only" error, the other through a failed search operation. The third one completed without failure. This kind of failure rate, which is a totally new phenomenon since the recent software changes, makes it very tedious to do any kind of serious bot work.

@Lsj: Please reserve comments on this task for inability to log in due to NeedToken loops. Also, your bot should be coded to retry on transient errors such as read-only so complaining about single transient failures anywhere is unlikely to obtain a positive result.

@Anomie: So where is it appropriate to comment on these errors? In any case, my bot IS coded to retry on the usual kinds of single transient errors that do occur every now and then. That's not what I'm talking about here.

We didn't have a crazy spike last week like we did two weeks ago, the increase last week seems to have been just one bot. So let's call this resolved, and if that one bot is still a problem next week I'll see about having a Tool Labs admin shut it off until the owner can be bothered to fix it.

So let's call this resolved, and if that one bot is still a problem next week I'll see about having a Tool Labs admin shut it off until the owner can be bothered to fix it.

It turned out the bot owner wasn't bothered to fix it. On 2016-02-25, @Andrew killed and disabled Rezabot jobs 3813521 "zamandara", 3808244 "rademaker_RC", 3811634 "otherwiki", 3782104 "neededarticle", and 3792970 "purge_cat" and the operator was notified here.

On 2016-03-02, @chasemp killed Rezabot jobs 3758653 "lighttpd-rezabot" and 3956069 "porbinandeh"; since the operator was previously notified that their bot jobs were disabled, no additional notification was posted.

The operator of Obaid-bot was notified of the same problem on 2016-02-25 here; if not fixed by 2015-03-03 I'll ping for the same treatment.

The operator of JAnDbot was notified of the same problem on 2016-03-02 here; if not fixed by 2016-03-09 I'll ping for the same treatment.

In all these cases it appears the bots are running an old version of pywikibot. I suspect a simple upgrade and restart on the part of the operators of these bots would solve their problems.

User:Obaid-bot not working since May 2017

@2526: This task has been resolved for 18 months. If something (what exactly?) is wrong with some bot account on some wiki (which wiki?), and you think that is related to this task (why?), feel free to bring that problem up on the talk page of that bot account. Or in a new task if you think that the problem is not related to the code of that bot account but to MediaWiki software code. Thanks.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM