Page MenuHomePhabricator

Login broken for beta cluster and Selenium tests
Closed, ResolvedPublic

Description

Login is not working on beta cluster sites, and all Selenium tests using authentication are failing as Selenium cannot login. The error is: "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form."

Example builds:

Successful build build #1911 triggered on Oct 10th at 20:31 UTC. It was for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/251312/119 which is not merged.

The first build that failed with Create account error is build #1912. It triggered at 20:40 UTC for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838733

Both patches were for mediawiki/core so something broke around that time. That might be figurable by doing a diff of the consoles outputs.

The last successful build is build #1913 triggered on Oct 10th at 20:49 for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/841191 which is not merged.

Event Timeline

kostajh triaged this task as Unbreak Now! priority.Oct 11 2022, 8:01 AM
kostajh created this task.

Looking at a GrowthExperiments test I see this error:

There seems to be a problem with your login session;
this action has been canceled as a precaution against session hijacking.
Please resubmit the form.

I can reproduce this problem on beta normally if I clear my browser cookies.

I can reproduce this problem on beta normally if I clear my browser cookies.

I now can reproduce even without clearing my browser cookies, just by logging out of an existing session and trying to login again.

hashar added a subscriber: hashar.

The wmf-quibble jobs triggers with core, vendor, Vector and the following extensions:

mediawiki/extensions/AbuseFilter
mediawiki/extensions/AntiSpoof
mediawiki/extensions/Babel
mediawiki/extensions/BetaFeatures
mediawiki/extensions/CheckUser
mediawiki/extensions/CirrusSearch
mediawiki/extensions/Cite
mediawiki/extensions/CiteThisPage
mediawiki/extensions/CodeEditor
mediawiki/extensions/ConfirmEdit
mediawiki/extensions/ContentTranslation
mediawiki/extensions/Disambiguator
mediawiki/extensions/Echo
mediawiki/extensions/Elastica
mediawiki/extensions/EventBus
mediawiki/extensions/EventLogging
mediawiki/extensions/EventStreamConfig
mediawiki/extensions/FileImporter
mediawiki/extensions/Flow
mediawiki/extensions/Gadgets
mediawiki/extensions/GeoData
mediawiki/extensions/GlobalCssJs
mediawiki/extensions/GlobalPreferences
mediawiki/extensions/Graph
mediawiki/extensions/GrowthExperiments
mediawiki/extensions/GuidedTour
mediawiki/extensions/ImageMap
mediawiki/extensions/InputBox
mediawiki/extensions/Interwiki
mediawiki/extensions/JsonConfig
mediawiki/extensions/Kartographer
mediawiki/extensions/Math
mediawiki/extensions/MobileApp
mediawiki/extensions/MobileFrontend
mediawiki/extensions/NavigationTiming
mediawiki/extensions/PageImages
mediawiki/extensions/PageViewInfo
mediawiki/extensions/ParserFunctions
mediawiki/extensions/PdfHandler
mediawiki/extensions/Poem
mediawiki/extensions/ProofreadPage
mediawiki/extensions/SandboxLink
mediawiki/extensions/Scribunto
mediawiki/extensions/SiteMatrix
mediawiki/extensions/SpamBlacklist
mediawiki/extensions/TemplateData
mediawiki/extensions/Thanks
mediawiki/extensions/TimedMediaHandler
mediawiki/extensions/Translate
mediawiki/extensions/UniversalLanguageSelector
mediawiki/extensions/VisualEditor
mediawiki/extensions/WikiEditor
mediawiki/extensions/Wikibase
mediawiki/extensions/WikibaseCirrusSearch
mediawiki/extensions/WikibaseMediaInfo
mediawiki/extensions/WikimediaMessages
mediawiki/extensions/cldr

I can login to https://login.wikimedia.beta.wmflabs.org, after which I can login to other beta sites.

kostajh renamed this task from All Selenium tests using authentication are broken to Login broken for beta cluster and Selenium tests.Oct 11 2022, 9:12 AM
kostajh updated the task description. (Show Details)

The Selenium job that has failing has ArticlePlaceholder, CentralAuth, FlaggedRevs, PropertySuggster, Renameuser, SyntaxHighlight, WikibaseLexeme, WikibaseLexemeCirrusSearch, WikibaseQualityConstraints, WikimediaBadges, WikimediaEvents cloned, while one that is passing (WikibaseQualityConstraints, with quibble-vendor) does not have those extensions cloned.

Change 841159 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] Revert "Skins: Config flag controls contributions link"

https://gerrit.wikimedia.org/r/841159

Change 841160 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@wmf/1.40.0-wmf.5] Revert "Skins: Config flag controls contributions link"

https://gerrit.wikimedia.org/r/841160

Change 841160 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@wmf/1.40.0-wmf.5] Revert "Skins: Config flag controls contributions link"

https://gerrit.wikimedia.org/r/841160

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/837634

The i18n files from this patch changes are not showing on master that is the cause

Change 841159 merged by jenkins-bot:

[mediawiki/core@master] Revert "Skins: Config flag controls contributions link"

https://gerrit.wikimedia.org/r/841159

Change 841160 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.5] Revert "Skins: Config flag controls contributions link"

https://gerrit.wikimedia.org/r/841160

kostajh lowered the priority of this task from Unbreak Now! to Needs Triage.Oct 11 2022, 11:09 AM
taavi assigned this task to kostajh.

The patch that got reverted is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838733 . It got a CR+2 by @Mabualruz and CI triggered reporting:

Starting gate-and-submit jobs.
https://integration.wikimedia.org/zuul/

Ten minutes later the change got V+2 and is submitted by @Mabualruz (effectively a force merge).

CI did not report back on the change, most probably cause the build failed which would cause a Verified -1 to be added and Gerrit forbid changing votes after a change is closed. I have looked at the CI logs (contint2001.wikimedia.org /var/log/zuul/error.log.2022-10-10).

2022-10-10 20:52:03,998 Exception: 
Gerrit error executing gerrit review --project mediawiki/core --message "Gate pipeline build failed.
- https://integration.wikimedia.org/ci/job/mwgate-node14-docker/69044/console : SUCCESS in 1m 48s
- https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php74-docker/847/console : SUCCESS in 11m 16s
- https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/1912/console : FAILURE in 1m 57s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php74-docker/1510/console : SUCCESS in 6m 12s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php74-docker/12330/console : SUCCESS in 6m 47s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php80-docker/291/console : SUCCESS in 6m 56s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php81-docker/204/console : SUCCESS in 6m 04s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-composertest-php74-docker/700/console : SUCCESS in 1m 07s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-selenium-vendor-mysql-php74-docker/650/console : SUCCESS in 1m 37s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/1215/console : SUCCESS in 2m 15s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-sqlite-php74-docker/122/console : SUCCESS in 5m 20s
- https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-postgres-php74-docker/122/console : SUCCESS in 9m 24s
- https://integration.wikimedia.org/ci/job/mediawiki-core-php74-phan-docker/783/console : SUCCESS in 3m 10s
" --tag autogenerated:ci --verified -1 838733,8', 'error: fatal: Cannot reduce vote on labels for closed change: Verified

fatal: one or more reviews failed; review output above

Which indicates: https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/1912/console FAILURE

As I found out earlier today when investigating, that build 1912 is the first one that failed with Create account and is the root cause of this incident.

Some more discussion happened on the revert change https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841159

@Mabualruz about your change https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838733 , it did pass on October 6th at 7:40 UTC. But that result is almost instantly obsolete since we trigger jobs which clone multiple repositories. Thus as code is being merged in eg ContentTranslation, another run of CI would lead to different results which is what happened in this case.

When a code causes a test to fail, every repositories relying on this cause ends up broken. In this case it affected a Selenium job which is shared by roughly 60 repositories (list at T320471#8306725). Which mean no changes made to this repositories can be merged which is thus blocking a lot of people. The policy we have is to revert immediately to unbreak CI given code can always be proposed again. In this case it additionally blocked the MediaWiki train :)

Most probably one could have reached out directly to you to have some synchronization and maybe a hot fix would have been easy to do. Then in most case who ever are presents at the time on IRC would take the easy path: revert and resolve the issue immediately then investigate for a proper patch later on. The aim is really to unblock everyone.

Also Gerrit should not let us submit changes manually, that is T226123: Make test pipline vote Verified+1 instead of +2 to avoid unintentional submit which would probably remove the Submit permission entirely.

Some more discussion happened on the revert change https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841159

@Mabualruz about your change https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838733 , it did pass on October 6th at 7:40 UTC. But that result is almost instantly obsolete since we trigger jobs which clone multiple repositories. Thus as code is being merged in eg ContentTranslation, another run of CI would lead to different results which is what happened in this case.

When a code causes a test to fail, every repositories relying on this cause ends up broken. In this case it affected a Selenium job which is shared by roughly 60 repositories (list at T320471#8306725). Which mean no changes made to this repositories can be merged which is thus blocking a lot of people. The policy we have is to revert immediately to unbreak CI given code can always be proposed again. In this case it additionally blocked the MediaWiki train :)

Most probably one could have reached out directly to you to have some synchronization and maybe a hot fix would have been easy to do. Then in most case who ever are presents at the time on IRC would take the easy path: revert and resolve the issue immediately then investigate for a proper patch later on. The aim is really to unblock everyone.

Also Gerrit should not let us submit changes manually, that is T226123: Make test pipline vote Verified+1 instead of +2 to avoid unintentional submit which would probably remove the Submit permission entirely.

@hashar Thanks for the clarification. Just extra context below:

Sorry for all the inconvenience, I wanted to see the actions in manual mode and since pipeline passed in the same day, I did not do a 100% due diligence to run "recheck", and I did not notice https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/837634 patch being merged.

Now I saw and learned the manual steps, and whenever they are really needed I will 100% make sure to do my due diligence in better ways. It is a bunch of lessons learned. I hope that adds to the clarity of the situation.

@Mabualruz no worries and thanks for the followup. As one of our esteemed colleagues once told me during a stressful failed deployment:

don't worry the Wikis are still up and running

:-]