Page MenuHomePhabricator

'MapCacheLRU::has called with invalid key' exception in CentralAuthUser::getInstanceByName at [[Special:GlobalUserRights]]
Closed, ResolvedPublicPRODUCTION ERROR

Description

When [[Special:GlobalUserRights]] is called without user parameter, the special page code always tries to initialize an user calling SpecialGlobalGroupMembership::fetchUser.

As we don't have an user that throws the following exception:

MapCacheLRU::has called with invalid key. Must be string or integer.

The expected behavior is a form to be able to offer an username.

This can be checked at https://meta.wikimedia.org/wiki/Special:GlobalUserRights.

Event Timeline

2016-12-18 07:03:22 [WFY0ugpAAEQAAEjNyusAAAAB] mw1273 metawiki 1.29.0-wmf.6 exception ERROR: [WFY0ugpAAEQAAEjNyusAAAAB] /wiki/Special:GlobalUserRights   MWException from line 79 of /srv/mediawiki/php-1.29.0-wmf.6/includes/libs/MapCacheLRU.php: MapCacheLRU::has called with invalid key. Must be string or integer. {"exception_id":"WFY0ugpAAEQAAEjNyusAAAAB"} 
[Exception MWException] (/srv/mediawiki/php-1.29.0-wmf.6/includes/libs/MapCacheLRU.php:79) MapCacheLRU::has called with invalid key. Must be string or integer.
  #0 /srv/mediawiki/php-1.29.0-wmf.6/includes/libs/MapCacheLRU.php(93): MapCacheLRU->has(boolean)
  #1 /srv/mediawiki/php-1.29.0-wmf.6/extensions/CentralAuth/includes/CentralAuthUser.php(125): MapCacheLRU->get(boolean)
  #2 /srv/mediawiki/php-1.29.0-wmf.6/extensions/CentralAuth/includes/CentralAuthGroupMembershipProxy.php(40): CentralAuthUser::getInstanceByName(boolean)
  #3 /srv/mediawiki/php-1.29.0-wmf.6/extensions/CentralAuth/includes/specials/SpecialGlobalGroupMembership.php(90): CentralAuthGroupMembershipProxy::newFromName(boolean)
  #4 /srv/mediawiki/php-1.29.0-wmf.6/includes/specials/SpecialUserrights.php(95): SpecialGlobalGroupMembership->fetchUser(NULL, boolean)
  #5 /srv/mediawiki/php-1.29.0-wmf.6/includes/specialpage/SpecialPage.php(522): UserrightsPage->execute(NULL)
  #6 /srv/mediawiki/php-1.29.0-wmf.6/includes/specialpage/SpecialPageFactory.php(577): SpecialPage->run(NULL)
  #7 /srv/mediawiki/php-1.29.0-wmf.6/includes/MediaWiki.php(283): SpecialPageFactory::executePath(Title, RequestContext)
  #8 /srv/mediawiki/php-1.29.0-wmf.6/includes/MediaWiki.php(851): MediaWiki->performRequest()
  #9 /srv/mediawiki/php-1.29.0-wmf.6/includes/MediaWiki.php(512): MediaWiki->main()
  #10 /srv/mediawiki/php-1.29.0-wmf.6/index.php(43): MediaWiki->run()
  #11 /srv/mediawiki/w/index.php(3): include(string)
  #12 {main}
MarcoAurelio triaged this task as Unbreak Now! priority.Dec 18 2016, 10:23 AM
MarcoAurelio subscribed.

Confirm as user with access to the page:

[WFZjMApAAEAAAZ6Sbp0AAABD] 2016-12-18 10:21:36: Fatal exception of type MWException

UBN since this has been working previously and now blocks management of wiki sites.

CentralAuthGroupMembershipProxy

	/**
	 * @param $name
	 * @return CentralAuthGroupMembershipProxy|null
	 */
	public static function newFromName( $name ) {
		$name = User::getCanonicalName( $name );
		$globalUser = CentralAuthUser::getInstanceByName( $name );
		return $globalUser->exists() ? new CentralAuthGroupMembershipProxy( $globalUser ) : null;
	}

getCanonicalName can return bool... Which then passes it unchecked to CentralAuthUser::getInstanceByName() which means looking up a bool

#4 /srv/mediawiki/php-1.29.0-wmf.6/includes/specials/SpecialUserrights.php(95): SpecialGlobalGroupMembership->fetchUser(NULL, boolean)

looks spurious too... It only takes 1 parameter

MarcoAurelio lowered the priority of this task from Unbreak Now! to Needs Triage.Dec 18 2016, 10:27 AM

Yes, with username it doesn't fatal, and it does not fatal either when adding/removing rights. Back to NT.

Dereckson renamed this task from Fatal exception at Special:GlobalUserRights on Meta-Wiki to 'MapCacheLRU::has called with invalid key' exception in CentralAuthUser::getInstanceByName at [[Special:GlobalUserRights]].Dec 18 2016, 5:08 PM
Dereckson triaged this task as Medium priority.
Dereckson updated the task description. (Show Details)
Dereckson added a subscriber: Linedwell.
Dereckson subscribed.

So this issue was caught on Beta Labs (T153071) and someone still went ahead and deployed this to production? I'm very confused.

I am the one that filled the beta cluster task on T153071 . IIRC I noticed that while I was doing something else, got interrupted to add a user right for a user, I filled the task and forgot about it. I guess I should have made it a blocker to the deployment :-/

Not all issues found on the Beta Cluster are production deployment blockers. If it is, it needs to be marked as such by someone. The original task that @hashar opened had Beta-Cluster-reproducible. This task does not. There is no way for the release engineering team to discover this task nor know of it's blocking worthiness with any certainty without us being informed.

Also, this is marked as a normal priority task (it was UBN! for 4 minutes, but that was undone) and as such it definitely wouldn't have garnered much attention.

Not all issues found on the Beta Cluster are production deployment blockers. If it is, it needs to be marked as such by someone. The original task that @hashar opened had Beta-Cluster-reproducible. This task does not. There is no way for the release engineering team to discover this task nor know of it's blocking worthiness with any certainty without us being informed.

Also, this is marked as a normal priority task (it was UBN! for 4 minutes, but that was undone) and as such it definitely wouldn't have garnered much attention.

I'm wondering what the value of the beta wikis is if really obvious bugs like this aren't getting caught. When you go to https://meta.wikimedia.org/wiki/Special:GlobalUserRights, it throws HTTP/1.1 500 Internal Server Error. Is there no test or check in place when we deploy to make sure that URLs like this are not erroring?

Is this page going to stay broken until the new year?

I'm wondering what the value of the beta wikis is if really obvious bugs like this aren't getting caught. When you go to https://meta.wikimedia.org/wiki/Special:GlobalUserRights, it throws HTTP/1.1 500 Internal Server Error. Is there no test or check in place when we deploy to make sure that URLs like this are not erroring?

Welcome to the world of testing :) If it's not tested (and reported) it's not found. *Why* this page itself isn't tested by automated means is a question for the maintainers of the code (more on that below).

Is this page going to stay broken until the new year?

This is a question for the team maintaining the code in question. Well, two questions. 1) how bad is it and does it deserve a revert to a known good state and 2) if not, how long until a fix is in place accounting for developer time. Deployment time is a rounding error in that and we can push out emergency/high need fixes at any point (the freeze is only in effect for the normal train and SWATs).

This was reduced from UBN! to Normal after 4 minutes, suggesting to me (as a non-user of the specialpage in question) that it isn't an emergency and can wait. I could be wrong and always welcome more information/guidance.

According to https://www.mediawiki.org/wiki/Developers/Maintainers it looks like @Legoktm is the best bet here.

@Legoktm do you have an opinion here?

From the stack trace this looks like the special page code needs to be patched. That should not be a big deal, right? Worst case, the page gets even more broken (in which case the patch can be reverted). There is no risk to fundraising or general site stability.

Change 328617 had a related patch set uploaded (by Gergő Tisza):
Make mediawiki/core cloning shallow by default

https://gerrit.wikimedia.org/r/328617

Broken by 70c2223 (looking at logstash, Special:GlobalUserRights seems to be the only thing that's significantly affected). Should be fixed in CentralAuthGroupMembershipProxy::newFromName or CentralAuthUser::getInstanceByName as other callers might use those, and I don't want to touch those files during a deploy freeze. Can put a temporary fix in the special page if you feel this problem is urgent enough.

I'm wondering what the value of the beta wikis is if really obvious bugs like this aren't getting caught. When you go to https://meta.wikimedia.org/wiki/Special:GlobalUserRights, it throws HTTP/1.1 500 Internal Server Error. Is there no test or check in place when we deploy to make sure that URLs like this are not erroring?

Welcome to the world of testing :)

Welcome? Are you all just getting to the party now? I can provide some tips, such as setting up a script that checks common HTTP and HTTPS end-points and ensures that they return appropriate HTTP status codes, preferably before breaking this page and causing four tasks to be filed (T153071, T153920, T153591, and T153578).

I can't speak for the others, but when filing this task, I was thinking "surely this can't already have been reported, I need to tell someone since this is obviously broken and needs to be fixed." That's a decent amount of volunteer time wasted on this trivial bug, especially if you consider you're only seeing the reported cases in Maniphest.

*Why* this page itself isn't tested by automated means is a question for the maintainers of the code (more on that below).

Maintainers? I think anyone can make HTTP and HTTPS requests and check status codes.

Deployment time is a rounding error in that and we can push out emergency/high need fixes at any point (the freeze is only in effect for the normal train and SWATs).

Cool, thank you, this is good to know. The wikitech-l e-mails made it sound like all deployments were frozen until 2017.

I don't think this task is a particularly high priority, but I do think it exposes pretty large holes in our testing infrastructure, or lack thereof. Setting up an entire beta cluster while simultaneously not doing basic checks like "does this page that used to load still load after the new code has been applied?" is kind of insane.

Not all issues found on the Beta Cluster are production deployment blockers. If it is, it needs to be marked as such by someone.

I don't like relying on humans. It creates issues like the one we're seeing in this task, where hashar dutifully filed a task, but it was not properly tagged/categorized/flagged. We should look at closing this gap with automation, so that if an HTTPS end-point that used to return a 200 OK now returns a 500 Internal Server Error, we know about it and address it prior to continuing the deployment.

*Why* this page itself isn't tested by automated means is a question for the maintainers of the code (more on that below).

Maintainers? I think anyone can make HTTP and HTTPS requests and check status codes.

[snip]

Setting up an entire beta cluster while simultaneously not doing basic checks like "does this page that used to load still load after the new code has been applied?" is kind of insane.

I don't like relying on humans. It creates issues like the one we're seeing in this task, where hashar dutifully filed a task, but it was not properly tagged/categorized/flagged. We should look at closing this gap with automation, so that if an HTTPS end-point that used to return a 200 OK now returns a 500 Internal Server Error, we know about it and address it prior to continuing the deployment.

To summarize your point: Why isn't there an automated test that tests the functionality of this specialpage?

For background, we have "unit" tests (in scare quotes because most aren't true unit tests) and integration tests that run on each patchset set to Gerrit. Some repos even has some (quick) browser tests included in that. Then we have the larger corpus of browser tests running against the Beta Cluster on a daily basis (some take multiple hours to run). You can see those at https://integration.wikimedia.org/ci/view/Selenium/ Different teams use their browser tests differently, as you can see.

Who maintains those E2E/browser tests? The developers/maintainers of the component in question. If they don't write tests for their code (nor manually test that it doesn't break something) there's nothing we can do to catch it other than looking at logs (which we do when deploying to production but Beta has the issue that it does not get organic/real-enough traffic patterns, it's mostly the browser tests and some manual/exploratory testing from the 2 exploratory testers at the foundation, so Beta logs aren't the most indicative of anything useful).

As Chad said in the change that caused the breakage:
"We should be throwing exceptions here so developers find this early when running tests and manual testing."

tl;dr: We need higher test coverage. When fixing this issue the person doing so should probably write a test to catch this (we need to get better at this as Wikimedia developers; writing tests to accompany bug fixes). It could be a unit test or browser test, whichever makes sense for the fix.

Broken by 70c2223 (looking at logstash, Special:GlobalUserRights seems to be the only thing that's significantly affected). Should be fixed in CentralAuthGroupMembershipProxy::newFromName or CentralAuthUser::getInstanceByName as other callers might use those, and I don't want to touch those files during a deploy freeze. Can put a temporary fix in the special page if you feel this problem is urgent enough.

Doesn't appear to be so; don't risk it. Thanks @Tgr .

Shizhao raised the priority of this task from Medium to High.Dec 23 2016, 2:51 AM

@Shizhao As a workaround you can append /<username>: the 'useful' part of the special page is still working.

@Dereckson The problem that the page had is that it affects the tool to find some particular user.

Change 329207 had a related patch set uploaded (by Gergő Tisza):
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/329207

Change 329207 merged by jenkins-bot:
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/329207

Change 330345 had a related patch set uploaded (by MaxSem):
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/330345

Change 330345 merged by jenkins-bot:
Handle invalid names in CentralAuthGroupMembershipProxy::newFromName

https://gerrit.wikimedia.org/r/330345

MaxSem claimed this task.
MaxSem subscribed.

I deployed Gergő's fix yesterday, this is fixed.

Removing User-notice - it's for changes that needs to be announced to the community while this one is a simple bug.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM