Page MenuHomePhabricator

Problem with getMobileToken in MobileContext causing browser tests to fail
Closed, ResolvedPublic

Description

Seen often in browser tests run locally and on Cloudbees, this happens:

  • default status of Beta toggle is OFF
  • click Beta toggle to ON
  • click Save
  • get error: "Error saving settings. Please make sure that you have cookies enabled"
  • Beta toggle goes back to OFF
  • test fails

I have not yet triggered this manually. (but I haven't tried that hard either)


Version: unspecified
Severity: normal

Details

Reference
bz62614

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:07 AM
bzimport set Reference to bz62614.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1770

This looks like an issue on Cloudbees. The error shows only when cookies are disabled. Are we able to say with 100% certainty Cloudbees is not doing something funky?

As I said originally, this is pretty easy to repro locally, it is not just Cloudbees builds.

I've not been able to replicate this issue myself. However if there is a bug it's going to be in getMobileToken - the only time when this error should happen is if the token in the settings form differs from the result of getMobileToken

Max any ideas to why this might happen?

Chris, how do I repro this problem locally?

Max, for me when I run the browser test "editor_ve.feature" (or any test with the step "Given I am in alpha mode") against beta labs, I see the "Error saving settings. Please make sure that you have cookies enabled" pretty often, but not always.

I wonder if it's possibly an issue with Selenium? This is not something I've been able to repro either and certainly not something I've experienced locally or in production.

To the extent that it is an issue with Selenium, the speed of the test between clicking the "ON" toggle and the "Save" button is probably triggering a race condition that is difficult to trigger manually.

But if I run "editor_ve.feature" with the "Given I am in alpha mode" step 3 times on my local environment pointed to beta labs, I'll get the error 1 or 2 times.

If someone can help me get Chrome driver setup maybe I can find a way around this... does delaying the opt in event with a wait help at all Chris?

I have run this 6 times locally against beta labs with the latest ChromeDriver (v2.9.248307) and all passed.

I'm still not sure what's going on here.

I got the problem the first time I ran, on the second scenario:

$ bundle exec cucumber features/editor_ve.feature
@en.m.wikipedia.beta.wmflabs.org @login @test2.m.wikipedia.org
Feature: VisualEditor

Scenario: Toolbar VisualEditor                      # features/editor_ve.feature:4
  Given I am in alpha mode                          # features/step_definitions/common_steps.rb:77
  And I am logged into the mobile website           # features/step_definitions/common_steps.rb:10
  And I am on the "Selenium Edit Test" page         # features/step_definitions/common_steps.rb:88
  When I click the edit button                      # features/step_definitions/editor_steps.rb:7
  Then I see the VisualEditor overlay               # features/step_definitions/editor_ve_steps.rb:9
  And I see a toolbar in the overlay header         # features/step_definitions/editor_ve_steps.rb:17
  And The VisualEditor toolbar has a bold button    # features/step_definitions/editor_ve_steps.rb:21
  And The VisualEditor toolbar has an italic button # features/step_definitions/editor_ve_steps.rb:25

Scenario: I can edit a page using VisualEditor # features/editor_ve.feature:14
  Given I am in alpha mode                     # features/step_definitions/common_steps.rb:77
    timed out after 5 seconds, waiting for {:class=>"mw-mf-settings-off", :tag_name=>"span"} to become present (Watir::Wait::TimeoutError)
    ./features/step_definitions/common_steps.rb:82:in `block (3 levels) in <top (required)>'
    ./features/step_definitions/common_steps.rb:81:in `block (2 levels) in <top (required)>'
    ./features/step_definitions/common_steps.rb:78:in `/^I am in alpha mode$/'
    features/editor_ve.feature:15:in `Given I am in alpha mode'

<snip>

$ date
Tue Mar 18 11:35:44 MST 2014

And in the browser I can clearly see the red error message "Error saving settings. Please make sure that you have cookies enabled"

I understand but I ran exactly the same test 6 times and can't replicate this. This always passes for me.

What version of Chrome Driver are you using?

Created attachment 14852
error saving settings

Attached:

I understand - I can see this in the browser test screencast but I'm not seeing this so your environment must be setup slightly different to mine?

What version of Chrome Driver are you using?

ChromeDriver (v2.9.248304)

but that has nothing to do with the problem.

(In reply to Jon from comment #9)

If someone can help me get Chrome driver setup maybe I can find a way around
this... does delaying the opt in event with a wait help at all Chris?

Jon, did you manage to get chromedriver working on your machine?

Yep but still unable to replicate this bug in any way either locally or on beta labs...

Seems like a race condition in lines 204-208 of SpecialMobileOptions.php in submitSettingsForm.

As I read this:

  • submitSettingsForm calls out to do "$context = MobileContext::singleton();" and also "$request = $this->getRequest();"
  • both of these must take time
  • then the code immediately compares tokens with "if ( $request->getVal( 'token' ) != $context->getMobileToken() )"

It seems to me like submitSettingForm should wait for both $request and $context to complete before comparing values obtained from those operations.

Chris, these are synchronous operations, it's really hard to do stufff in parallel in PHP:) So no race condition.

As stated before, I really thing this is an environment issue. Chris what else might be different in your setup to us?

I think the fact that this is passing anywhere is an environment issue.

Same error happened twice in Firefox in the build just hours ago: https://wmf.ci.cloudbees.com/job/MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox/438/

I agree with Max, in this case a race condition doesn't make sense due to the nature of PHP. However, it is conceivable that there are occasional hiccups/something going wrong with session storage on the cluster. If that were the case though, I would expect we would hear about more things like this happening - that said, it is probably worth checking into since this has been persistent and we have not been able to adequately explain it otherwise.

jeremygrajales wrote:

Is there anything I could do to attempt to replicate this on my end?

Jeremy if you (In reply to Jeremy L Grajales from comment #24)

Is there anything I could do to attempt to replicate this on my end?

Thanks Jeremy - if you don't have it set up already, the first thing to do would be to get MediaWiki and MobileFrontend setup and working. You would also need to install a few other extensions - ConfirmEdit (enable FancyCaptcha), GeoData, Echo, PageImages, VisualEditor (and get parsoid installed/running as well).

Then, you'd need to get your browser testing environment set up (https://www.mediawiki.org/wiki/Quality_Assurance/Browser_testing/Running_tests), and be sure to follow MobileFrontend-specific instructions in MobileFrontend/tests/browser/README.mediawiki

If you run into any problems, let us know and we can probably help you out. If you join #wikimedia-mobile on IRC (Freenode), you can usually get support faster than via Bugzilla.

(In reply to Arthur Richards from comment #23)

I agree with Max, in this case a race condition doesn't make sense due to
the nature of PHP. However, it is conceivable that there are occasional
hiccups/something going wrong with session storage on the cluster. If that
were the case though, I would expect we would hear about more things like
this happening - that said, it is probably worth checking into since this
has been persistent and we have not been able to adequately explain it
otherwise.

After chatting a bit with Chris today, I wonder if the problem is related to the platform and/or platform/browser combo. The jobs on sauce labs run in FF/Chrome on Linux - and when Chris tests locally he does so with a Linux box. AFAIK everyone else who has not been able to replicate this problem has been testing from a Mac.

I chatted with Chris Steipp about this a bit:

(03:53:51 PM) csteipp: So yeah, either the user doesn't have a session, wsMobileToken in the session is getting overwritten, or the form isn't correctly sending the token

<Antoine restarts all the caches on beta labs>

(04:06:19 PM) csteipp: chrismcmahon: Yeah, times out for me. So the thing is to narrow down which of those 3 is happening. A header capture of a failure would help determine that.

I'll look into getting those headers with WireShark or something.

jeremygrajales wrote:

I tested this on Firefox (Windows).
http://pastebin.com/yi9FGPVU

Seems to reproduce the same issue.
Although, I'm not sure.

Visually:
The browser gets to the settings page.
Toggles beta switch ON.
Saves settings.
And when the page refreshes, the beta button is toggled to OFF.

No obvious HTML errors are displayed as reported previously (ie. red text in the browser warning of disabled cookies).

jeremygrajales wrote:

(In reply to Jon from comment #9)

If someone can help me get Chrome driver setup maybe I can find a way around
this... does delaying the opt in event with a wait help at all Chris?

I also added a "sleep 5" below line 79 in common_steps.rb to see if that would resolve anything. Not sure if that is what you were suggesting to do.

But
(In reply to Jeremy L Grajales from comment #28)
adding that delay didn't change any of the previous results on Firefox (Windows).

(In reply to Chris McMahon from comment #27)

(03:53:51 PM) csteipp: So yeah, either the user doesn't have a session,
wsMobileToken in the session is getting overwritten, or the form isn't
correctly sending the token

I think this is heading in the right direction. My only objection to this is that when I attempt to reproduce this manually, everything works fine. This seems like a test handling error and not a session issue. Wouldn't a session issue affect all tests manual/automated?

If the opt into alpha happens after login does this resolve the issue? I wonder if it's something particularly strange with opting in whilst anonymous...

jeremygrajales wrote:

(In reply to Jon from comment #30)

If the opt into alpha happens after login does this resolve the issue? I
wonder if it's something particularly strange with opting in whilst
anonymous...

Don't you mean the opt into beta?
And this is something I thought of before.
I tried this manually, and anon or logged-in works fine.

Not sure about running an automated test. But I do not have enough experience editing tests to attempt this fix you suggest. If someone could show me how to do it, I'd love to try it out. But someone else would probably have the chance to do it before I got the chance to learn how to.

Change 120959 had a related patch set uploaded by Jdlrobson:
See if sleeping helps avoid token issues on Special:MobileOptions

https://gerrit.wikimedia.org/r/120959

jeremygrajales wrote:

(In reply to Gerrit Notification Bot from comment #32)

Change 120959 had a related patch set uploaded by Jdlrobson:
See if sleeping helps avoid token issues on Special:MobileOptions

https://gerrit.wikimedia.org/r/120959

I get the same behavior as before with the patch using Firefox on Windows.

jeremygrajales wrote:

(In reply to Gerrit Notification Bot from comment #32)

Change 120959 had a related patch set uploaded by Jdlrobson:
See if sleeping helps avoid token issues on Special:MobileOptions

https://gerrit.wikimedia.org/r/120959

I get the same behavior as before with the patch using Firefox on Windows.

Change 120959 abandoned by Jdlrobson:
See if sleeping helps avoid token issues on Special:MobileOptions

Reason:
Thanks to Jeremy L Grajales we can confirm this doesn't have any effect. Back to drawing board.

https://gerrit.wikimedia.org/r/120959

(In reply to Chris McMahon from comment #27)

I chatted with Chris Steipp about this a bit:

(03:53:51 PM) csteipp: So yeah, either the user doesn't have a session,
wsMobileToken in the session is getting overwritten, or the form isn't
correctly sending the token

<Antoine restarts all the caches on beta labs>

(04:06:19 PM) csteipp: chrismcmahon: Yeah, times out for me. So the thing is
to narrow down which of those 3 is happening. A header capture of a failure
would help determine that.

I'll look into getting those headers with WireShark or something.

Chris, any luck dumping those headers? I'm increasingly curious to see what's going on there.

jeremygrajales wrote:

(In reply to Arthur Richards from comment #36)

Chris, any luck dumping those headers? I'm increasingly curious to see
what's going on there.

What headers are you looking for at what point?
I'd like to see it on my end to see if the Windows reproduction diverges any further from the original bug.

Requested by Chris Steipp, Request headers and stuff pulled from the Network tab in Firefox after incurring the token error (I also have the entire response that contains the error message if you need it):

request header:

Host: en.m.wikipedia.beta.wmflabs.org
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://en.m.wikipedia.beta.wmflabs.org/wiki/Special:MobileOptions?useformat=mobile
Cookie: centralnotice_bucket=1-4.2; mf_useformat=true; enwikiSession=ed4a7df6517c9aa79ce5231905291c84
Connection: keep-alive

request body:

Content-Type: application/x-www-form-urlencoded
Content-Length: 87
enableImages=on&enableBeta=on&token=b9a06d2beb98ed9b35acded20a41e9fb&returnto=Main+Page

timings:
Waiting 892 ms
Receiving 3 ms

jeremygrajales wrote:

(Partial) Wireshark capture of editor_ve test

[attachments]
I captured the test with Wireshark.
The POST is obviously the first packet in the log.
I find the same behavior as Chris does in Windows Firefox.

Attached:

(In reply to Jeremy L Grajales from comment #39)

Created attachment 14914 [details]
(Partial) Wireshark capture of editor_ve test

[attachments]
I captured the test with Wireshark.
The POST is obviously the first packet in the log.
I find the same behavior as Chris does in Windows Firefox.

So in both Chris'es headers and Jeremy's pcap, there is a session cookie being sent. So that's good.

What I'm confused on is that Jeremy's pcap shows the POST to /wiki/Special:MobileOptions, and you see coming back a Location: http://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page header, which means that the inequality check in SpecialMobileOptions.php (line 208) was NOT true (i.e., the token parameter matched getMobileToken()), otherwise a form would have been returned.

Is the pcap not showing the issue? Or is this not a token issue?

Attached:

Created attachment 14943
token mismatch on POST shown in WireShark on Linux

Attached:

The screen shot above I think shows the token-mismatch problem on the POST.

However, the test itself behaves strangely on my local machine with WireShark running so I may have done this incorrectly

Chris, can you do "follow TCP Stream" and screenshot the result? I'd like to confirm that you're NOT seeing a Location header returned in the result of that call.

CSteipp, Wireshark is killing my poor underspec'd laptop. I spent a couple hours on this, but I cannot even get the test to complete or my wlan interface to stay functional with WireShark running.

Chris, in the meantime, can we point the saucelabs test to not use Linux, perhaps to use FF/Chrome in OS X since we haven't been able to replicate the issue there?

This is now the main source of our failing test builds (the last 3 builds failed solely because of this issue)

Thanks for all the attention being given here.

I wonder if in the mean time we should alter Special:MobileOptions to use a preference for logged in users rather than a cookie - this would be a way of getting around the failures and would be a useful user facing feature - opt in once, get beta/alpha everywhere...

Is anyone able to reproduce this in the office yet? If so, I'm happy to work on this in person tomorrow. Or if someone can walk me through the steps to reproduce, I can attempt from a couple of installs I have.

(In reply to Jon from comment #46)

This is now the main source of our failing test builds (the last 3 builds
failed solely because of this issue)

Thanks for all the attention being given here.

I wonder if in the mean time we should alter Special:MobileOptions to use a
preference for logged in users rather than a cookie - this would be a way of
getting around the failures and would be a useful user facing feature - opt
in once, get beta/alpha everywhere...

We would still have the same problem - at least in terms of setting the preference. In order to use that form, we need to set the CSRF token which requires the session/cookie. So when the form was submitted to set the preference, you would get stuck with the same behavior. I suppose though once the preference were set, the issue would be moot. It could be a stopgap measure... just as I think running the tests against FF/Chrome in OSX rather than linux/windows would be.

Chris Steipp:

Pretty reliable repo for me has been to run the editor_ve.feature test, and the second Scenario fails with the error almost every time. Watch the browser as the test runs to see the red error message appear before the ON/OFF toggle switches back to OFF:

$ export MEDIAWIKI_URL=http://en.m.wikipedia.beta.wmflabs.org/wiki/
$ export MEDIAWIKI_USER=USER
$ export MEDIAWIKI_PASSWORD=PASSWORD

:~/MobileFrontend/tests/browser$ bundle exec cucumber features/editor_ve.feature
@chrome @en.m.wikipedia.beta.wmflabs.org @firefox @login @test2.m.wikipedia.org
Feature: VisualEditor

Background:                               # features/editor_ve.feature:4
  Given I am in alpha mode                # features/step_definitions/common_steps.rb:77
  And I am logged into the mobile website # features/step_definitions/common_steps.rb:10

Scenario: Switch from VisualEditor to source editor               # features/editor_ve.feature:8
  Given I am on the "Selenium Edit Test" page                     # features/step_definitions/common_steps.rb:88
  And I click the edit button                                     # features/step_definitions/editor_steps.rb:7
  And I see the VisualEditor overlay                              # features/step_definitions/editor_ve_steps.rb:9
  And The VisualEditor overlay has an editor mode switcher button # features/step_definitions/editor_ve_steps.rb:29
  And I click the editor mode switcher button                     # features/step_definitions/editor_steps.rb:32
  When I click the source editor button                           # features/step_definitions/editor_steps.rb:36
  Then I see the wikitext editor overlay                          # features/step_definitions/editor_steps.rb:11

Scenario: Ensure we load the correct section # features/editor_ve.feature:17
  Given I am in alpha mode                # features/step_definitions/common_steps.rb:77
    timed out after 5 seconds, waiting for {:class=>"mw-mf-settings-off", :tag_name=>"span"} to become present (Watir::Wait::TimeoutError)
    ./features/step_definitions/common_steps.rb:82:in `block (3 levels) in <top (required)>'
    ./features/step_definitions/common_steps.rb:81:in `block (2 levels) in <top (required)>'
    ./features/step_definitions/common_steps.rb:78:in `/^I am in alpha mode$/'
    features/editor_ve.feature:5:in `Given I am in alpha mode'

chris steipp, it's however worth noting that we've been unable to replicate this with the above steps on OS X

Created attachment 14959
WireShark capture of error conversation

This should be the entire conversation containing the POST to MobileOptions that causes the token error.

(I happen to have another Ubuntu laptop. After 90 OS upgrades, installing RVM and installing Wireshark, I was able to capture the traffic on a repro of the error.)

Attached:

Thanks Chris. That confirms that this is different from what Jeremy posted, since his had a Location header come back from the POST request, whereas yours showed the form with the <div class="error">Error saving settings. Please make sure that you have cookies enabled.</div>.

The only codepath that sets wsMobileToken seems pretty safe, and since the Set-Cookie with enwikiSession is returned with the form, that seems pretty safe.

Could someone add a logging statement just after the token check to see what value MediaWiki thinks the mobile token is supposed to be?

https://gerrit.wikimedia.org/r/121879 added logging of token details to mobile.log

From the log in beta labs on eqiad just now:

-04-02 07:46:39 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received cb9248cad0ffc2d5886f2322ac58c2f9 - expected ff97fcfc9650819e3788b20c26680d6b
2014-04-02 14:36:50 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received b6752dcbd18953383d871c531d620f4c - expected a3e842905af2f624cb4511345978fdc4

Thanks Chris, that narrows down the problem significantly.

Can you rerun the tests with a different (or new) username, so we can rule out that this is a race condition with multiple tests using the same user account, or something is corrupted in the session cache?

I ran the tests from the office from my linux box, and the test (whenever they finally ran) passed without any errors. So it seems like there is either something about the user that you're using, or the network lag affecting a race condition.

These tests manipulate the ON/OFF toggle settings before logging in, and each browser starts with a fresh profile on every run of every scenario. Username should be immaterial.

I set the editor_ve.feature test to log in with a different user anyway and generated the following errors in the second run of the test on my local host:

2014-04-02 17:52:34 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 60fa1ffd7c4273c5c1e5ceeb7d44cfd2 - expected 361fda81efd0e50f37234db4384ee64e
2014-04-02 17:53:35 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 60cd4c3b308cac220e7a8629b0427056 - expected 34a28952e608764e4452a5b73b80175e
2014-04-02 17:53:50 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 60cd4c3b308cac220e7a8629b0427056 - expected 34a28952e608764e4452a5b73b80175e

Is it of interest that the last two errors have exactly the same mismatched tokens 15 seconds apart?

Jon updated the editor_ve.feature test today, but I cannot get a clean run of the test on beta labs.

In every case one token seems to persist longer than it should, on the order of 15+ seconds. I was tailing mobile.log for three successive unsuccessful runs of that test and got these results:

2014-04-09 22:04:34 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 0784e6702ac5df67d042c4a7969165bd - expected 0a84ee38edc0905b86d3620bd12765e9
2014-04-09 22:05:36 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 738fbdc0987c5b31cef77ae5e08aa9fb - expected a041b5f2632ef617d1b92681456d8866
2014-04-09 22:05:53 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 738fbdc0987c5b31cef77ae5e08aa9fb - expected a041b5f2632ef617d1b92681456d8866

2014-04-09 22:08:13 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received b9547af8f118ad0bca2834777d62ba1c - expected 29d3774af72c2a0e6f0768092480c982
2014-04-09 22:08:31 deployment-apache02 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received b9547af8f118ad0bca2834777d62ba1c - expected 29d3774af72c2a0e6f0768092480c982
2014-04-09 22:09:42 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received c6c71ff97b5a448c78f3e85aa27e343d - expected 3a78d9b84bb3873f503baf7d0e41b4cc

2014-04-09 22:11:14 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 1ec21965b2c0eb6dcfd6a01ef1c4eacc - expected 520e7ebbfa9f0449d3ebc0c79e350911
2014-04-09 22:11:31 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 1ec21965b2c0eb6dcfd6a01ef1c4eacc - expected 520e7ebbfa9f0449d3ebc0c79e350911
2014-04-09 22:12:38 deployment-apache01 enwiki: SpecialMobileOptions::submitSettingsForm(): token mismatch, received 9ce4d541e91887bed12a76e988f90965 - expected 467de8a293b2a6de7a6fef0905660b46

So this problem seems to only happen when the user is anonymous...
It seems that $this->getRequest()->getSessionData( 'wsMobileToken' ) is returning null which results in a new session getting setup on submitting the form which results in invalidation of the form.

Don't understand this code well enough to know if this is misbehaving and if it is why.

The problematic function seems to be MobileContext::getMobileToken

There have been 8708 hits to this error in the last 30 days on our production wikis.

We deep dived this issue again and concluded we should remove getMobileToken in favour of user->getToken

Chris says that soon anonymous editors will get a unique token when that is called.

In mean time there is a CSRF risk to anons using Special:MobileOptions but we decided since risk was low we can live with this.

(In reply to Jon from comment #64)

We deep dived this issue again and concluded we should remove getMobileToken
in favour of user->getToken

Chris says that soon anonymous editors will get a unique token when that is
called.

Out of curiosity, is there a timeline for this?

In mean time there is a CSRF risk to anons using Special:MobileOptions but
we decided since risk was low we can live with this.

This makes sense to me since this would only be an issue for anons. I'm ok with it so long as Chris S is OK with it.

Change 126873 had a related patch set uploaded by Jdlrobson:
Use core token handling

https://gerrit.wikimedia.org/r/126873

Change 126873 merged by jenkins-bot:
Use core token handling

https://gerrit.wikimedia.org/r/126873