Page MenuHomePhabricator

NEED_TOKEN error spike when 1.27-wmf.11 SessionManager was deployed to group1
Closed, ResolvedPublic

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Anomie added a subscriber: Anomie.Jan 21 2016, 3:54 PM

Initial guesses, although none of them seem obviously likely:

  • It could be that the user has a bad session cookie and SessionManager somehow is continuing to try (and fail) to use the bad cookie instead of sending a new one for a new, empty session. On the client side, this would be fixed when the client expires the bad cookie. This case might have also have been fixed by 4f5057b. But the symptom should have been the uncaught exception from T124126, not just failure of tokens to be saved right across requests.
  • It could be that there's a code path that's changing the session but not persisting it to the top-level WebRequest somehow. ApiLogin always does a ->persist() on the global session and LoginForm uses $wgRequest to access the session, though, so that seems unlikely.
  • It could be that the token secret is somehow being set in a wrong Session. But again, that seems unlikely since LoginForm uses $wgRequest to access the Session, which should always be the right one.

Initial plan of attack would be to figure out a way for me to reliably reproduce this from a client, then I can go from there to find out why it's breaking.

  • It could be that the user has a bad session cookie and SessionManager somehow is continuing to try (and fail) to use the bad cookie instead of sending a new one for a new, empty session.

From local testing against mediawiki.org, I think I've disproven this one. Logged in, then used eval.php to call User::setToken() and CentralAuthUser::resetAuthToken(), then confirmed meta=userinfo said I wasn't logged in on that session anymore, then tried a new login and it logged in without issue (including sending a replacement session cookie during the first step).

greg added a comment.Jan 21 2016, 6:15 PM

For the record regarding the rollout of wmf.11 (which was blocked due to this (and other issues) yesterday):
we will move ahead with the train (ie: roll out to group1, wait a bit, then roll out to wikipedis, as scheduled)

Tgr added a comment.Jan 21 2016, 9:26 PM

P2508 has some bot stats.

I did some live hacking earlier to log more data on the problem here. The logged data is available in Kibana from 21:08–21:49 UTC, the relevant messages are prefixed with "Live hack".

For most of the bots running into the issue, it appears that they are somehow not honoring the Set-Cookie headers: the incoming requests from these bots did not contain the commonswikiSession cookie. The record of cookies set during the request in WebResponse::$setCookie did show the proper cookies going out, and of course the cookies are showing up correctly for others as shown in T124252#1951112.

One bot had a different bug: in addition to the session cookie, it was sending cookies that should have been expired such as commonswikiUserID=deleted and centralauth_User=deleted. This causes the SessionProvider to reject the session in line with OWASP recommendations, since the claimed user or user ID is not valid.

Tgr added a comment.Jan 22 2016, 2:23 AM

Some of the top bots spamming login requests:

  • SDBot, UKBot, Community Tech bot, FlickreviewR 2, DanmicholoBot, CatWatchBot, SteinsplitterBot (~10/sec)
  • Ботчо (~1/sec)
  • BOTzilla, Mathbot, jembot (~1/10sec)

I'll try to contact the authors.

Tgr added a comment.Jan 22 2016, 5:08 AM

One bot had a different bug: in addition to the session cookie, it was sending cookies that should have been expired such as commonswikiUserID=deleted and centralauth_User=deleted. This causes the SessionProvider to reject the session in line with OWASP recommendations, since the claimed user or user ID is not valid.

That's exactly what's happening with the Android app: T124384#1954608

The old behavior was not to delete any cookie with the first API response, just set the new session. We can't do that anymore since we are stricter about cookie consistency; we should just probably treat cookies with the value deleted as nonexistent (at least as a B/C hack). Although probably replace deleted with something that's not a valid username.

Tgr added a comment.EditedJan 22 2016, 6:00 AM

Although probably replace deleted with something that's not a valid username.

deleted is apparently PHP magic (we just set the empty string). Shouldn't matter, lowercase usernames are not allowed, even on case-sensitive wikis.

Change 265685 had a related patch set uploaded (by Niedzielski):
Remove expired cookies

https://gerrit.wikimedia.org/r/265685

Change 265686 had a related patch set uploaded (by Niedzielski):
Remove expired cookies

https://gerrit.wikimedia.org/r/265686

Change 265689 had a related patch set uploaded (by Gergő Tisza):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265689

Change 265690 had a related patch set uploaded (by Gergő Tisza):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265690

Lsj added a subscriber: Lsj.Jan 22 2016, 11:05 AM

I'm getting this error since last night, when trying to login through DotNetWikiBot 3.14. Any suggestions on how to get around it?

Here's what DNWB sends and receives during login:

Logging in...
postData = lgname=Lsjtest&lgpassword=XXX&lgdomain=

First response:
<?xml version="1.0"?><api><login result="NeedToken" token="266bf5000cd7057f231a00f1e3384050" cookieprefix="svwiki" sessionid="ob4no1uvpsbu7gg8ldbk6209035bdckg" /></api>

postData = lgname=Lsjtest&lgpassword=XXX&lgdomain=&lgtoken=266bf5000cd7057f231a00f1e3384050

Second response:
<?xml version="1.0"?><api><login result="NeedToken" token="83012cf23597c0769c2417d88f83337c" cookieprefix="svwiki" sessionid="q1aqcsgsjanf6c3da9o91tl2438picgd" /></api>

Unhandled Exception: DotNetWikiBot.WikiBotException:

Change 265686 merged by jenkins-bot:
Remove expired cookies

https://gerrit.wikimedia.org/r/265686

Change 265685 merged by jenkins-bot:
Remove expired cookies

https://gerrit.wikimedia.org/r/265685

I also ran into this, I think (and filed T124421 about crazy cookies).

I'm getting this error since last night, when trying to login through DotNetWikiBot 3.14. Any suggestions on how to get around it?

Here's what DNWB sends and receives during login:

Check your cookie handling to make sure (1) it's picking up the session cookie from the Set-Cookie header in the first response and (2) it's not sending "deleted" cookies.

Lsj added a comment.Jan 22 2016, 8:04 PM

Thank you @Anomie. Now it works. Though it does seem like a kluge to have to purge the "deleted" cookies "by hand" in the code.

Thank you @Anomie. Now it works. Though it does seem like a kluge to have to purge the "deleted" cookies "by hand" in the code.

You shouldn't have to purge "deleted" cookies by hand. These cookies have a Max-Age of 0 and an expires in 1970, both of which should cause your cookie-handling code to automatically delete the cookie as being expired. If your cookie-handling code doesn't do that, then it's broken and should be fixed.

greg added a comment.Jan 22 2016, 9:58 PM

Given this graph:

What is the current plan?

Change 265859 had a related patch set uploaded (by Gergő Tisza):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265859

Change 265860 had a related patch set uploaded (by Gergő Tisza):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265860

Tgr added a comment.Jan 22 2016, 10:38 PM

@greg: https://gerrit.wikimedia.org/r/#/c/265689/ / https://gerrit.wikimedia.org/r/#/c/265690/ would fix this for bots that send deleted.

I pushed https://gerrit.wikimedia.org/r/265859 / https://gerrit.wikimedia.org/r/265860 for the rest, it should prevent unnecessary delete headers. It's a more scary hack than the other set though, I didn't have much time to test and now I'm off to (narrowly) catch a plane :(

Tgr added a comment.EditedJan 22 2016, 10:39 PM

Login sequence with the patch applied:

$ curl -sD- -d 'format=json&action=login&lgname=Admin&lgpassword=vagrant' 'http://localhost/w/api.php'
HTTP/1.1 200 OK
Date: Fri, 22 Jan 2016 22:35:11 GMT
Server: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.14
X-Powered-By: PHP/5.5.9-1ubuntu4.14
X-Content-Type-Options: nosniff
Cache-control: private, must-revalidate, max-age=0
Set-Cookie: wiki_session=8kqltku8q2auler58vs3c4jkpfsqi90h; path=/; httponly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
X-Frame-Options: DENY
Content-Length: 144
Content-Type: application/json; charset=utf-8

{"login":{"result":"NeedToken","token":"0c6ae5d1512e1b6b7a2c9bcc1037a2b8","cookieprefix":"wiki","sessionid":"8kqltku8q2auler58vs3c4jkpfsqi90h"}}
$ curl -sD- -H 'Cookie: wiki_session=8kqltku8q2auler58vs3c4jkpfsqi90h' -d 'format=json&action=login&lgname=Admin&lgpassword=vagrant&lgtoken=0c6ae5d1512e1b6b7a2c9bcc1037a2b8' 'http://localhost/w/api.php'
HTTP/1.1 200 OK
Date: Fri, 22 Jan 2016 22:35:46 GMT
Server: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.14
X-Powered-By: PHP/5.5.9-1ubuntu4.14
X-Content-Type-Options: nosniff
Cache-control: private, must-revalidate, max-age=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Set-Cookie: wiki_session=8kqltku8q2auler58vs3c4jkpfsqi90h; path=/; httponly
Set-Cookie: wikiUserID=1; expires=Wed, 20-Jul-2016 22:35:47 GMT; Max-Age=15552000; path=/; httponly
Set-Cookie: wikiUserName=Admin; expires=Wed, 20-Jul-2016 22:35:47 GMT; Max-Age=15552000; path=/; httponly
Set-Cookie: wikiToken=357ea2dd5f4084723ca9c567374d1235; expires=Wed, 20-Jul-2016 22:35:47 GMT; Max-Age=15552000; path=/; httponly
X-Frame-Options: DENY
Content-Length: 178
Content-Type: application/json; charset=utf-8

{"login":{"result":"Success","lguserid":1,"lgusername":"Admin","lgtoken":"357ea2dd5f4084723ca9c567374d1235","cookieprefix":"wiki","sessionid":"8kqltku8q2auler58vs3c4jkpfsqi90h"}}

Change 265689 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265689

Change 265690 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265690

Change 265869 had a related patch set uploaded (by Legoktm):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265869

Change 265870 had a related patch set uploaded (by Legoktm):
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265870

Change 265869 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265869

Change 265870 merged by Legoktm:
Ignore auth cookies with value 'deleted'

https://gerrit.wikimedia.org/r/265870

Change 265871 had a related patch set uploaded (by BryanDavis):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265871

Change 265872 had a related patch set uploaded (by BryanDavis):
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265872

Change 265859 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265859

Change 265871 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265871

Change 265872 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265872

Change 265860 merged by jenkins-bot:
Only delete cookies which are actually set

https://gerrit.wikimedia.org/r/265860

bd808 added a subscriber: bd808.Jan 23 2016, 12:22 AM

For once I'm glad to see a graph move down and to the right.

For once I'm glad to see a graph move down and to the right.

Looks like I got excited too quickly. Graphite was just lagging on showing the data. A couple of minutes after I took the screen shot the data showed up.

Tgr added a comment.Jan 23 2016, 3:44 AM

Some more data in P2508#10411.

Those million lines spanned about 3.5 minutes. IIRC it wasn't much different when SessionManager was only on group1 (5 minutes or something like that) which would suggest that not many bots were affected. OTOH per grafana the group1 peak was about 30 req/s and the final peak about ten times more, so not sure how to square that.

Change 265928 had a related patch set uploaded (by Anomie):
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/265928

Change 265928 merged by jenkins-bot:
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/265928

Change 266284 had a related patch set uploaded (by Anomie):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266284

Change 266285 had a related patch set uploaded (by Anomie):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266285

Change 266284 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266284

Change 266285 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266285

Change 266416 had a related patch set uploaded (by BryanDavis):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266416

Change 266421 had a related patch set uploaded (by BryanDavis):
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/266421

Change 266426 had a related patch set uploaded (by BryanDavis):
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266426

Change 266421 merged by jenkins-bot:
Move avoidance of setting deleted cookies into WebResponse

https://gerrit.wikimedia.org/r/266421

Change 266426 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266426

Change 266416 merged by jenkins-bot:
Use $wgSecureCookie to decide whether to actually mark secure cookies as 'secure'

https://gerrit.wikimedia.org/r/266416

Tgr added a comment.Jan 28 2016, 4:59 AM

No permanent increase in NeedToken (or any other error) this time. There was a small spike at the time of group1 deployment, which probably means some bots have been logged out but could re-login successfully.

Lsj added a comment.EditedJan 28 2016, 9:42 PM

Login works well enough now, but bot sessions are much less stable than they used to be. Far too often, the bot either gets logged out or starts receiving error messages in the middle of a run. The most frequent error is "Invalid token", but I have also seen messages saying that "The Wiki is currently in read-only mode". If I break the run and start over, everything works fine again.

Login works well enough now, but bot sessions are much less stable than they used to be. Far too often, the bot either gets logged out or starts receiving error messages in the middle of a run. The most frequent error is "Invalid token", but I have also seen messages saying that "Wikipedia is in read-only mode". If I break the run and start over, everything works fine again.

... which makes my multi-threaded bot unable to complete any task.

Lsj added a comment.Jan 29 2016, 6:33 AM

I started four bot tasks last night. Three of them, running on the same computer, failed at the same time with an Invalid Token error. The fourth one, on a different computer, is still running.

Tgr added a comment.Jan 29 2016, 3:06 PM

We are still running the script for T124440: Invalidate all users sessions I believe?

Lsj added a comment.Jan 29 2016, 5:27 PM

I started four bot tasks last night. Three of them, running on the same computer, failed at the same time with an Invalid Token error. The fourth one, on a different computer, is still running.

Of the three tasks I restarted this morning, two failed during the day, but at different times (several hours apart). One with the "read-only" error, the other through a failed search operation. The third one completed without failure. This kind of failure rate, which is a totally new phenomenon since the recent software changes, makes it very tedious to do any kind of serious bot work.

@Lsj: Please reserve comments on this task for inability to log in due to NeedToken loops. Also, your bot should be coded to retry on transient errors such as read-only so complaining about single transient failures anywhere is unlikely to obtain a positive result.

Lsj added a comment.Jan 29 2016, 7:28 PM

@Anomie: So where is it appropriate to comment on these errors? In any case, my bot IS coded to retry on the usual kinds of single transient errors that do occur every now and then. That's not what I'm talking about here.

Anomie closed this task as Resolved.Feb 4 2016, 4:57 PM

We didn't have a crazy spike last week like we did two weeks ago, the increase last week seems to have been just one bot. So let's call this resolved, and if that one bot is still a problem next week I'll see about having a Tool Labs admin shut it off until the owner can be bothered to fix it.

Anomie added subscribers: Andrew, chasemp.EditedMar 2 2016, 3:19 PM

So let's call this resolved, and if that one bot is still a problem next week I'll see about having a Tool Labs admin shut it off until the owner can be bothered to fix it.

It turned out the bot owner wasn't bothered to fix it. On 2016-02-25, @Andrew killed and disabled Rezabot jobs 3813521 "zamandara", 3808244 "rademaker_RC", 3811634 "otherwiki", 3782104 "neededarticle", and 3792970 "purge_cat" and the operator was notified here.

On 2016-03-02, @chasemp killed Rezabot jobs 3758653 "lighttpd-rezabot" and 3956069 "porbinandeh"; since the operator was previously notified that their bot jobs were disabled, no additional notification was posted.

The operator of Obaid-bot was notified of the same problem on 2016-02-25 here; if not fixed by 2015-03-03 I'll ping for the same treatment.

The operator of JAnDbot was notified of the same problem on 2016-03-02 here; if not fixed by 2016-03-09 I'll ping for the same treatment.

In all these cases it appears the bots are running an old version of pywikibot. I suspect a simple upgrade and restart on the part of the operators of these bots would solve their problems.

2526 added a subscriber: 2526.Mar 20 2016, 6:49 PM
2526 added a comment.Aug 7 2017, 11:00 AM

User:Obaid-bot not working since May 2017

Aklapper added a comment.EditedAug 7 2017, 1:16 PM

@2526: This task has been resolved for 18 months. If something (what exactly?) is wrong with some bot account on some wiki (which wiki?), and you think that is related to this task (why?), feel free to bring that problem up on the talk page of that bot account. Or in a new task if you think that the problem is not related to the code of that bot account but to MediaWiki software code. Thanks.