'MapCacheLRU::has called with invalid key' exception in CentralAuthUser::getInstanceByName at [[Special:GlobalUserRights]]
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

When [[Special:GlobalUserRights]] is called without user parameter, the special page code always tries to initialize an user calling SpecialGlobalGroupMembership::fetchUser.

As we don't have an user that throws the following exception:

MapCacheLRU::has called with invalid key. Must be string or integer.

The expected behavior is a form to be able to offer an username.

This can be checked at https://meta.wikimedia.org/wiki/Special:GlobalUserRights.

Details

	Subject	Repo	Branch	Lines +/-
	Handle invalid names in CentralAuthGroupMembershipProxy::newFromName	mediawiki/extensions/CentralAuth	wmf/1.29.0-wmf.6	+3 -0
	Handle invalid names in CentralAuthGroupMembershipProxy::newFromName	mediawiki/extensions/CentralAuth	master	+3 -0

Customize query in gerrit

Related Objects

Mentioned In: T153920: Special:GlobalUserRights returns 500 everywhere
T153591: MapCacheLRU::has called with invalid key exception in CentralAuthUser::getInstanceByName
Mentioned Here: T153591: MapCacheLRU::has called with invalid key exception in CentralAuthUser::getInstanceByName
T153920: Special:GlobalUserRights returns 500 everywhere
rMW70c2223843b6: MapCacheLRU: Properly handle bogus cache keys
T153071: beta: /wiki/Special:GlobalUserRights MWException from line 79 of /srv/mediawiki/php-master/includes/libs/MapCacheLRU.php: MapCacheLRU::has called with invalid key. Must be string or integer.

Event Timeline

• MZMcBride created this task.Dec 18 2016, 7:04 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2016, 7:04 AM

2016-12-18 07:03:22 [WFY0ugpAAEQAAEjNyusAAAAB] mw1273 metawiki 1.29.0-wmf.6 exception ERROR: [WFY0ugpAAEQAAEjNyusAAAAB] /wiki/Special:GlobalUserRights   MWException from line 79 of /srv/mediawiki/php-1.29.0-wmf.6/includes/libs/MapCacheLRU.php: MapCacheLRU::has called with invalid key. Must be string or integer. {"exception_id":"WFY0ugpAAEQAAEjNyusAAAAB"} 
[Exception MWException] (/srv/mediawiki/php-1.29.0-wmf.6/includes/libs/MapCacheLRU.php:79) MapCacheLRU::has called with invalid key. Must be string or integer.
  #0 /srv/mediawiki/php-1.29.0-wmf.6/includes/libs/MapCacheLRU.php(93): MapCacheLRU->has(boolean)
  #1 /srv/mediawiki/php-1.29.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php(125): MapCacheLRU->get(boolean)
  #2 /srv/mediawiki/php-1.29.0-wmf.6/extensions/CentralAuth/includes/CentralAuthGroupMembershipProxy.php(40): CentralAuthUser::getInstanceByName(boolean)
  #3 /srv/mediawiki/php-1.29.0-wmf.6/extensions/CentralAuth/includes/specials/SpecialGlobalGroupMembership.php(90): CentralAuthGroupMembershipProxy::newFromName(boolean)
  #4 /srv/mediawiki/php-1.29.0-wmf.6/includes/specials/SpecialUserrights.php(95): SpecialGlobalGroupMembership->fetchUser(NULL, boolean)
  #5 /srv/mediawiki/php-1.29.0-wmf.6/includes/specialpage/SpecialPage.php(522): UserrightsPage->execute(NULL)
  #6 /srv/mediawiki/php-1.29.0-wmf.6/includes/specialpage/SpecialPageFactory.php(577): SpecialPage->run(NULL)
  #7 /srv/mediawiki/php-1.29.0-wmf.6/includes/MediaWiki.php(283): SpecialPageFactory::executePath(Title, RequestContext)
  #8 /srv/mediawiki/php-1.29.0-wmf.6/includes/MediaWiki.php(851): MediaWiki->performRequest()
  #9 /srv/mediawiki/php-1.29.0-wmf.6/includes/MediaWiki.php(512): MediaWiki->main()
  #10 /srv/mediawiki/php-1.29.0-wmf.6/index.php(43): MediaWiki->run()
  #11 /srv/mediawiki/w/index.php(3): include(string)
  #12 {main}

Confirm as user with access to the page:

[WFZjMApAAEAAAZ6Sbp0AAABD] 2016-12-18 10:21:36: Fatal exception of type MWException

UBN since this has been working previously and now blocks management of wiki sites.

Restricted Application added subscribers: Jay8g, Luke081515, TerraCodes. · View Herald TranscriptDec 18 2016, 10:23 AM

CentralAuthGroupMembershipProxy

	/**
	 * @param $name
	 * @return CentralAuthGroupMembershipProxy|null
	 */
	public static function newFromName( $name ) {
		$name = User::getCanonicalName( $name );
		$globalUser = CentralAuthUser::getInstanceByName( $name );
		return $globalUser->exists() ? new CentralAuthGroupMembershipProxy( $globalUser ) : null;
	}

getCanonicalName can return bool... Which then passes it unchecked to CentralAuthUser::getInstanceByName() which means looking up a bool

#4 /srv/mediawiki/php-1.29.0-wmf.6/includes/specials/SpecialUserrights.php(95): SpecialGlobalGroupMembership->fetchUser(NULL, boolean)

looks spurious too... It only takes 1 parameter

@MarcoAurelio, hold on a second. Does https://meta.wikimedia.org/wiki/Special:GlobalUserRights/Legoktm work for you? It at least doesn't fatal for me.

Yes, with username it doesn't fatal, and it does not fatal either when adding/removing rights. Back to NT.

MarcoAurelio mentioned this in T153591: MapCacheLRU::has called with invalid key exception in CentralAuthUser::getInstanceByName.Dec 18 2016, 2:47 PM

Dereckson renamed this task from Fatal exception at Special:GlobalUserRights on Meta-Wiki to 'MapCacheLRU::has called with invalid key' exception in CentralAuthUser::getInstanceByName at [[Special:GlobalUserRights]].Dec 18 2016, 5:08 PM

Dereckson triaged this task as Medium priority.

Dereckson updated the task description. (Show Details)

Dereckson merged a task: T153591: MapCacheLRU::has called with invalid key exception in CentralAuthUser::getInstanceByName.

Dereckson added a subscriber: Linedwell.

Dereckson subscribed.

Teles subscribed.Dec 18 2016, 5:16 PM

TTO merged a task: T153071: beta: /wiki/Special:GlobalUserRights MWException from line 79 of /srv/mediawiki/php-master/includes/libs/MapCacheLRU.php: MapCacheLRU::has called with invalid key. Must be string or integer..Dec 21 2016, 5:25 AM

TTO added a subscriber: hashar.

So this issue was caught on Beta Labs (T153071) and someone still went ahead and deployed this to production? I'm very confused.

I am the one that filled the beta cluster task on T153071 . IIRC I noticed that while I was doing something else, got interrupted to add a user right for a user, I filled the task and forgot about it. I guess I should have made it a blocker to the deployment :-/

Not all issues found on the Beta Cluster are production deployment blockers. If it is, it needs to be marked as such by someone. The original task that @hashar opened had Beta-Cluster-reproducible. This task does not. There is no way for the release engineering team to discover this task nor know of it's blocking worthiness with any certainty without us being informed.

Also, this is marked as a normal priority task (it was UBN! for 4 minutes, but that was undone) and as such it definitely wouldn't have garnered much attention.

In T153578#2893943, @greg wrote:

Not all issues found on the Beta Cluster are production deployment blockers. If it is, it needs to be marked as such by someone. The original task that @hashar opened had Beta-Cluster-reproducible. This task does not. There is no way for the release engineering team to discover this task nor know of it's blocking worthiness with any certainty without us being informed.

Also, this is marked as a normal priority task (it was UBN! for 4 minutes, but that was undone) and as such it definitely wouldn't have garnered much attention.

I'm wondering what the value of the beta wikis is if really obvious bugs like this aren't getting caught. When you go to https://meta.wikimedia.org/wiki/Special:GlobalUserRights, it throws HTTP/1.1 500 Internal Server Error. Is there no test or check in place when we deploy to make sure that URLs like this are not erroring?

Is this page going to stay broken until the new year?

In T153578#2894469, @MZMcBride wrote:

I'm wondering what the value of the beta wikis is if really obvious bugs like this aren't getting caught. When you go to https://meta.wikimedia.org/wiki/Special:GlobalUserRights, it throws HTTP/1.1 500 Internal Server Error. Is there no test or check in place when we deploy to make sure that URLs like this are not erroring?

Welcome to the world of testing :) If it's not tested (and reported) it's not found. *Why* this page itself isn't tested by automated means is a question for the maintainers of the code (more on that below).

Is this page going to stay broken until the new year?

This is a question for the team maintaining the code in question. Well, two questions. 1) how bad is it and does it deserve a revert to a known good state and 2) if not, how long until a fix is in place accounting for developer time. Deployment time is a rounding error in that and we can push out emergency/high need fixes at any point (the freeze is only in effect for the normal train and SWATs).

This was reduced from UBN! to Normal after 4 minutes, suggesting to me (as a non-user of the specialpage in question) that it isn't an emergency and can wait. I could be wrong and always welcome more information/guidance.

According to https://www.mediawiki.org/wiki/Developers/Maintainers it looks like @Legoktm is the best bet here.

@Legoktm do you have an opinion here?

From the stack trace this looks like the special page code needs to be patched. That should not be a big deal, right? Worst case, the page gets even more broken (in which case the patch can be reverted). There is no risk to fundraising or general site stability.

I agree, @Tgr.

Matanya mentioned this in T153920: Special:GlobalUserRights returns 500 everywhere.Dec 21 2016, 11:48 PM

Change 328617 had a related patch set uploaded (by Gergő Tisza):
Make mediawiki/core cloning shallow by default

https://gerrit.wikimedia.org/r/328617

gerritbot added a project: Patch-For-Review.Dec 22 2016, 3:19 AM

Oops, sorry, wrong task.

Broken by 70c2223 (looking at logstash, Special:GlobalUserRights seems to be the only thing that's significantly affected). Should be fixed in CentralAuthGroupMembershipProxy::newFromName or CentralAuthUser::getInstanceByName as other callers might use those, and I don't want to touch those files during a deploy freeze. Can put a temporary fix in the special page if you feel this problem is urgent enough.

Aklapper merged a task: T153920: Special:GlobalUserRights returns 500 everywhere.Dec 22 2016, 1:35 PM

Aklapper added a subscriber: Matanya.

In T153578#2894725, @greg wrote:

In T153578#2894469, @MZMcBride wrote:

I'm wondering what the value of the beta wikis is if really obvious bugs like this aren't getting caught. When you go to https://meta.wikimedia.org/wiki/Special:GlobalUserRights, it throws HTTP/1.1 500 Internal Server Error. Is there no test or check in place when we deploy to make sure that URLs like this are not erroring?

Welcome to the world of testing :)

Welcome? Are you all just getting to the party now? I can provide some tips, such as setting up a script that checks common HTTP and HTTPS end-points and ensures that they return appropriate HTTP status codes, preferably before breaking this page and causing four tasks to be filed (T153071, T153920, T153591, and T153578).

I can't speak for the others, but when filing this task, I was thinking "surely this can't already have been reported, I need to tell someone since this is obviously broken and needs to be fixed." That's a decent amount of volunteer time wasted on this trivial bug, especially if you consider you're only seeing the reported cases in Maniphest.

*Why* this page itself isn't tested by automated means is a question for the maintainers of the code (more on that below).

Maintainers? I think anyone can make HTTP and HTTPS requests and check status codes.

Deployment time is a rounding error in that and we can push out emergency/high need fixes at any point (the freeze is only in effect for the normal train and SWATs).

Cool, thank you, this is good to know. The wikitech-l e-mails made it sound like all deployments were frozen until 2017.

I don't think this task is a particularly high priority, but I do think it exposes pretty large holes in our testing infrastructure, or lack thereof. Setting up an entire beta cluster while simultaneously not doing basic checks like "does this page that used to load still load after the new code has been applied?" is kind of insane.

In T153578#2893943, @greg wrote:

Not all issues found on the Beta Cluster are production deployment blockers. If it is, it needs to be marked as such by someone.

I don't like relying on humans. It creates issues like the one we're seeing in this task, where hashar dutifully filed a task, but it was not properly tagged/categorized/flagged. We should look at closing this gap with automation, so that if an HTTPS end-point that used to return a 200 OK now returns a 500 Internal Server Error, we know about it and address it prior to continuing the deployment.

In T153578#2896273, @MZMcBride wrote:

In T153578#2894725, @greg wrote:

*Why* this page itself isn't tested by automated means is a question for the maintainers of the code (more on that below).

Maintainers? I think anyone can make HTTP and HTTPS requests and check status codes.

[snip]

Setting up an entire beta cluster while simultaneously not doing basic checks like "does this page that used to load still load after the new code has been applied?" is kind of insane.

I don't like relying on humans. It creates issues like the one we're seeing in this task, where hashar dutifully filed a task, but it was not properly tagged/categorized/flagged. We should look at closing this gap with automation, so that if an HTTPS end-point that used to return a 200 OK now returns a 500 Internal Server Error, we know about it and address it prior to continuing the deployment.

To summarize your point: Why isn't there an automated test that tests the functionality of this specialpage?

For background, we have "unit" tests (in scare quotes because most aren't true unit tests) and integration tests that run on each patchset set to Gerrit. Some repos even has some (quick) browser tests included in that. Then we have the larger corpus of browser tests running against the Beta Cluster on a daily basis (some take multiple hours to run). You can see those at https://integration.wikimedia.org/ci/view/Selenium/ Different teams use their browser tests differently, as you can see.

Who maintains those E2E/browser tests? The developers/maintainers of the component in question. If they don't write tests for their code (nor manually test that it doesn't break something) there's nothing we can do to catch it other than looking at logs (which we do when deploying to production but Beta has the issue that it does not get organic/real-enough traffic patterns, it's mostly the browser tests and some manual/exploratory testing from the 2 exploratory testers at the foundation, so Beta logs aren't the most indicative of anything useful).

As Chad said in the change that caused the breakage:
"We should be throwing exceptions here so developers find this early when running tests and manual testing."

tl;dr: We need higher test coverage. When fixing this issue the person doing so should probably write a test to catch this (we need to get better at this as Wikimedia developers; writing tests to accompany bug fixes). It could be a unit test or browser test, whichever makes sense for the fix.

In T153578#2895774, @Tgr wrote:

Broken by 70c2223 (looking at logstash, Special:GlobalUserRights seems to be the only thing that's significantly affected). Should be fixed in CentralAuthGroupMembershipProxy::newFromName or CentralAuthUser::getInstanceByName as other callers might use those, and I don't want to touch those files during a deploy freeze. Can put a temporary fix in the special page if you feel this problem is urgent enough.

Doesn't appear to be so; don't risk it. Thanks @Tgr .

greg merged a task: T153961: Special:GlobalUserRights shows an error at meta.wikimedia.org.Dec 22 2016, 6:22 PM

greg added subscribers: Urbanecm, Bawolff.

Krenair merged a task: T153961: Special:GlobalUserRights shows an error at meta.wikimedia.org.Dec 22 2016, 6:23 PM

Trijnstel subscribed.Dec 23 2016, 12:27 AM

Shizhao raised the priority of this task from Medium to High.Dec 23 2016, 2:51 AM

Liuxinyu970226 subscribed.Dec 23 2016, 5:52 AM

@Shizhao As a workaround you can append /<username>: the 'useful' part of the special page is still working.

@Dereckson The problem that the page had is that it affects the tool to find some particular user.

Stang subscribed.Dec 24 2016, 11:27 AM

Change 329207 had a related patch set uploaded (by Gergő Tisza):
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/329207

Shizhao added a project: Chinese-Sites.Dec 27 2016, 3:51 AM

jcrespo merged a task: T154185: Fatal exception of type MWException on all the Special:GlobalUserRights pages.Dec 29 2016, 3:09 PM

jcrespo added subscribers: Meno25, dgstranz, abian.

Change 329207 merged by jenkins-bot:
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/329207

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-01-03_(1.29.0-wmf.7)).Jan 2 2017, 9:00 AM

Change 330345 had a related patch set uploaded (by MaxSem):
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/330345

Change 330345 merged by jenkins-bot:
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/330345

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2016-12-13_(1.29.0-wmf.6)).Jan 4 2017, 2:00 AM

I deployed Gergő's fix yesterday, this is fixed.

Shizhao added a project: User-notice.Jan 5 2017, 2:07 AM

Removing User-notice - it's for changes that needs to be announced to the community while this one is a simple bug.

Liuxinyu970226 unsubscribed.Jan 5 2017, 12:46 PM

Meno25 unsubscribed.Jan 13 2017, 6:32 AM

MarcoAurelio moved this task from Untriaged to Closed on the Stewards-and-global-tools board.Jan 22 2017, 12:55 PM

Liuxinyu970226 moved this task from Backlog to Closed on the Chinese-Sites board.Feb 18 2017, 3:25 AM

MarcoAurelio moved this task from Backlog to Done on the MediaWiki-extensions-CentralAuth board.Nov 22 2017, 9:46 AM

MarcoAurelio moved this task from Backlog to Done on the MediaWiki-extensions-CentralAuth board.

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM

Stang unsubscribed.Nov 13 2021, 8:48 PM

'MapCacheLRU::has called with invalid key' exception in CentralAuthUser::getInstanceByName at [[Special:GlobalUserRights]]Closed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

'MapCacheLRU::has called with invalid key' exception in CentralAuthUser::getInstanceByName at [[Special:GlobalUserRights]]
Closed, ResolvedPublicPRODUCTION ERROR
Actions