Page MenuHomePhabricator

beta-update-databases-eqiad failing: Uncaught exception 'MediaWiki\\Services\\NoSuchServiceException' with message 'No such service: MobileFrontend.Config'
Closed, ResolvedPublic

Description

22:20:02 Exception: (
	'command: ',
		'/usr/local/bin/mwscript update.php --wiki=aawiki --quick',
	'output: ',
		"#!/usr/bin/env php\n
		Fatal error: Uncaught exception 'MediaWiki\\Services\\NoSuchServiceException' with message 'No such service: MobileFrontend.Config' in /mnt/srv/mediawiki-staging/php-master/includes/Services/ServiceContainer.php:185\n
		Stack trace:\n
		#0 /mnt/srv/mediawiki-staging/php-master/includes/MediaWikiServices.php(201): MediaWiki\\Services\\ServiceContainer->peekService('MobileFrontend....')\n
		#1 /mnt/srv/mediawiki-staging/php-master/includes/MediaWikiServices.php(185): MediaWiki\\MediaWikiServices->salvage(Object(MediaWiki\\MediaWikiServices))\n
		#2 /mnt/srv/mediawiki-staging/php-master/includes/Setup.php(506): MediaWiki\\MediaWikiServices::resetGlobalInstance(Object(GlobalVarConfig), 'quick')\n
		#3 /mnt/srv/mediawiki-staging/php-master/maintenance/doMaintenance.php(97): require_once('/mnt/srv/mediaw...')\n
		#4 /mnt/srv/mediawiki-staging/php-master/maintenance/update.php(216): require_once('/mnt/srv/mediaw...')\n
		#5 /mnt/srv/mediawiki-staging/multiversion/MWScript.php(97): require_once('/mnt/srv/mediaw...')\n
		#6 {main}\n  thrown in /mnt/srv/mediawiki-staging/php-master/includes/Services/ServiceContainer.php on line 185\n"
)

https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/10876/console for example.

Event Timeline

Legoktm created this task.Aug 26 2016, 5:34 AM
Restricted Application added subscribers: Luke081515, TerraCodes, Aklapper. · View Herald TranscriptAug 26 2016, 5:34 AM

For some context: this is currently preventing the beta-update-databases-eqiad job from running for beta: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/

Florian claimed this task.Aug 28 2016, 2:10 PM
Florian added a subscriber: Florian.

Ok, I'll take this task. I'll try to fix the problem after I figured out, what the problem is. If this doesn't seem to be possible in a reasonable time, I'll revert the change, which caused this bug.

Florian updated the task description. (Show Details)Aug 28 2016, 2:12 PM
Florian updated the task description. (Show Details)
Florian added a comment.EditedAug 28 2016, 2:35 PM

That doesn't seem to be right @Paladox: Even if I'm pretty new in the Services stuff, I can say you, that the MobileFrontend.Config isn't a php class, it's a service, which instantiates a new object of the GlobalVarConfig class (which implements the Config interface). And as long as I don't miss something, the service name does not need to exist as a php class.

However, for now, I'm still not able to reproduce this problem locally. Any advise about where the problem could be would be helpful. My current idea is, that this could have to do something with ObjectCaching, but that's just a guess.

@Paladox: Like I said before. As far as I know, Services doesn't need to be named as the php classes, see, e.g. MediaInfoFilePageLookup in your example, which returns an object of the FilePageLookup class.

I guess we should revert then re do the patch.

The exception is thrown in the initialisation process of MediaWiki (when the global MediaWikiServices object is reset to a fresh one), not when calling getMFConfig() of the MobileContext.

Is there a possibility that because it has a . in MobileFrontend.Config that it may not work?

Could we see if changing to MobileFrontendConfig works please?

Is there a possibility that because it has a . in MobileFrontend.Config that it may not work?
Could we see if changing to MobileFrontendConfig works please?

This sounds like a blind shoot, and because I'm still not able to reproduce the error locally it would require to merge a change to master. Because I don't see anything in the implementation that backs your suggestion, I think this isn't a very good idea. If you really think this can fix the problem, could you provide some more information why you think that?

Paladox added a comment.EditedAug 28 2016, 3:19 PM

@Florian because looking at https://github.com/wikimedia/mediawiki/search?utf8=%E2%9C%93&q=_MediaWikiTitleCodec&type=Code it seems to work. Ie it dosent include the . in the name. Also I have never seen a class name with . in it because that would break it.

This is all guesses.

I doint even know why we needed a . in there. Or we can just replace it with _.

But it may or may not fix it. But we wont know unless we try.

Change 307126 had a related patch set uploaded (by Paladox):
Replace MobileFrontend.Config with MobileFrontend_Config

https://gerrit.wikimedia.org/r/307126

Change 307126 abandoned by Paladox:
Replace MobileFrontend.Config with MobileFrontend_Config

Reason:
Nope dosent fix things

https://gerrit.wikimedia.org/r/307126

Florian added a comment.EditedAug 28 2016, 4:28 PM

Ok, so far, so good :P Thanks to the LocalSettings.php from @Paladox I now was able to reproduce the problem locally, however in a slightly different way and the error now occurs on page loads, too. What I did:
I simply set my $wgMainCacheType from the production-near memcached to CACHE_ANYTHING (My assumption is, that my memcached setup isn't as production/beta-labs near as I thought ;)). Now, after loading a page I'll get the same error as described in this task. However, my entry point is WebStart.php and not doMaintenance.php, but ok.

So, what I found out so far is:
ExtensionRegistry calls wfMemcKey on L132 to check if the load queue is already in any cache, the resulting stack trace is:

Stack trace:\n
#0 /var/www/html/mediawiki/w/includes/objectcache/ObjectCache.php(231): MediaWiki\\MediaWikiServices::getInstance()\n
#1 /var/www/html/mediawiki/w/includes/objectcache/ObjectCache.php(175): ObjectCache::newAnything()\n
#2 /var/www/html/mediawiki/w/includes/objectcache/ObjectCache.php(131): ObjectCache::newFromParams()\n
#3 /var/www/html/mediawiki/w/includes/objectcache/ObjectCache.php(95): ObjectCache::newFromId()\n
#4 /var/www/html/mediawiki/w/includes/objectcache/ObjectCache.php(337): ObjectCache::getInstance()\n
#5 /var/www/html/mediawiki/w/includes/GlobalFunctions.php(3051): ObjectCache::getLocalClusterInstance()\n
#6 /var/www/html/mediawiki/w/includes/registration/ExtensionRegistry.php(132): wfMemcKey()\n
#7 /var/www/html/mediawiki/w/includes/Setup.php(40): ExtensionRegistry->loadFromQueue()\n
#8 /var/www/html/mediawiki/w/includes/WebStart.php(137): include()\n
#9 /var/www/html/mediawiki/w/index.php(40): include()\n
#10 {main}

This will create a new MediaWikiServices instance (ObjectCache uses MediaWikiServices). Now, ExtensionRegistry::loadFromQueue is called from Setup.php _before_ the MediaWikiServices instance is reset. This results in the following problem:
In the getInstance call to MediaWikiServices, where the "old"/"first" instance is created, no extension is loaded, yet (as this is the process where we check if we've to load all JSON files from fs or if we've the definition cached already). This means, no extra wiring file is loaded, as no MediaWikiServices could be registered so far. This works fine, anythign goes like expected, ExtensionRegistry loads the definition from cache or from fs (doesn't really matter) and sets all extension up as expected (at this point, the MediaWikiServices hook has at least one entry (MobileFrontend::onMediaWikiServices).

Now the process in Setup.php reaches line 506:

MediaWikiServices::resetGlobalInstance( new GlobalVarConfig(), 'quick' );

resetGlobalInstance will now save the "old" instance (created _before_ extensions were loaded, as explaine dabove) and creates a new instance with newInstance. This time, however, all extensions are already loaded, which means, that all extension services are registered in the new instance, too (e.g. MobileFrontend.Config). However, now the quick loading logic tries to get all services, which are registered in the new object, from the old object (from the first instantiation before extensions are loaded). This, of course, fails, because at least the services from extensions (like MobileFrontend.Config) does not exist in the old instance (as the extension was not loaded at this time).

So, my quick fix would be a patch to core, which first checks, if the service of the new instance is even registered in the old instance or not. However, this needs a review from someone who better understands this logic at all to make sure, this is the desried behaviour.

Btw.: Adding Architecture, as the "salvaged-logic" was implemented in T132707: When resetting the global service locator, allow some resources to be "salvaged".. So this could be a possible cause, too.

Change 307133 had a related patch set uploaded (by Florianschmidtwelzow):
MediaWikiServices: Don't assume, that old and new instances contains the same services

https://gerrit.wikimedia.org/r/307133

This comment was removed by Paladox.

I added the T142117: MW-1.28.0-wmf.17 deployment blockers as a parent task, so we either merge the above patch or revert the MobileFrontend change. Even if it seems there's no user-visible problem with that, this is (at least for me) too risky. That said, before Tuesday, 30th Aug, the problem will be fixed.

Hold on a second. Why is MobileFrontend loading its ServiceWiring file via a hook instead of $wgServiceWiringFiles?

Also, I don't think it is acceptable to leave beta broken for this long (it's nearly 4 days), so I'm going to revert.

And, thank you for looking into this Florian, it's very much appreciated.

Change 307224 had a related patch set uploaded (by Legoktm):
Revert "Introduce MediaWikiServices"

https://gerrit.wikimedia.org/r/307224

@Legoktm: I'm not pretty sure. Both ways are possible (and the comment for the MediaWikiServices hook is pretty clear, why it exists):
// Provide a traditional hook point to allow extensions to configure services.

Ok, we probably should switch to the config variable. However, if my analyses is correct (:P), this wouldn't prevent this issue, as the service wiring files (added via the config variable) aren't loaded before the first MediaWikiServices instance is created, too (as the extension isn't processed at this moment, so no config options, no hooks and nothing else, is loaded, yet). :)

@Legoktm hi, would you be able to review @Florian patch https://gerrit.wikimedia.org/r/307133 please?

Yeah, I just finished reading your analysis, and think your patch is the correct way to go.

@Florian thank you a ton for the investigation and @Legoktm for the revert! I guess we will want to loop back with @phuedx to polish it up.

Change 307224 merged by jenkins-bot:
Revert "Introduce MediaWikiServices"

https://gerrit.wikimedia.org/r/307224

Paladox added a comment.EditedAug 28 2016, 9:32 PM

Yeah, I just finished reading your analysis, and think your patch is the correct way to go.

@Legoktm Thanks :), could you +2 please?

RobLa-WMF added a project: Architecture.
RobLa-WMF added a subscriber: RobLa-WMF.

I'll put this on the agenda for the next private ArchCom planning meeting (E265)

hashar lowered the priority of this task from Unbreak Now! to High.Aug 29 2016, 8:10 AM

Lowering priority slightly since the revert has fixed update.php run on the beta cluster.

Thanks to @Florian for the excellent detective work and write up, and thanks to @Legoktm for the revert.


Hold on a second. Why is MobileFrontend loading its ServiceWiring file via a hook instead of $wgServiceWiringFiles?

I chose to use the hook as: it's clearer, IMO; it's more easily tested in isolation – shame on me for not including a unit test in 0805aa960; and its inline documentation suggested that this was the method for extensions to register services/service wiring files.


Also, I don't think it is acceptable to leave beta broken for this long (it's nearly 4 days), so I'm going to revert.

@dr0ptp4kt has previously brought up this scenario as a reason for the Reading Web team to meet on Fridays, which they tend not to. By "this scenario", I mean a critical bug that wasn't detected during code review/QA being detected on the last Friday of the sprint. I'll be bringing this up during the Reading Web team's next sprint retrospective, which falls on Tuesday, 30th August /cc @MBinder_WMF

The conversation may not have gone any differently but many hands make light work and @Florian may not have had to sacrifice some of his weekend.

Change 307133 merged by jenkins-bot:
MediaWikiServices: Don't assume, that old and new instances contains the same services

https://gerrit.wikimedia.org/r/307133

The core change was merged, I'll let this task open as long as the MF change isn't re-applied (so we can check, that the problem is really fixed :)).

Change 307716 had a related patch set uploaded (by Paladox):
Revert "Re-Apply Introduce MediaWikiServices"

https://gerrit.wikimedia.org/r/307716

Change 307716 abandoned by Paladox:
Revert "Re-Apply Introduce MediaWikiServices"

https://gerrit.wikimedia.org/r/307716

Change 307716 restored by Paladox:
Revert "Re-Apply Introduce MediaWikiServices"

https://gerrit.wikimedia.org/r/307716

Change 307716 merged by jenkins-bot:
Revert "Re-Apply Introduce MediaWikiServices"

https://gerrit.wikimedia.org/r/307716

Change 307720 had a related patch set uploaded (by Paladox):
Add missing use statement to MediaWikiServices.php

https://gerrit.wikimedia.org/r/307720

Change 307721 had a related patch set uploaded (by Phuedx):
MediaWikiServices: Import NoSuchServiceException

https://gerrit.wikimedia.org/r/307721

Change 307720 abandoned by Paladox:
Add missing use statement to MediaWikiServices.php

Reason:
We can go with https://gerrit.wikimedia.org/r/#/c/307721/

https://gerrit.wikimedia.org/r/307720

Change 307726 had a related patch set uploaded (by Paladox):
Revert "Revert "Re-Apply Introduce MediaWikiServices""

https://gerrit.wikimedia.org/r/307726

Change 307721 merged by jenkins-bot:
MediaWikiServices: Import NoSuchServiceException

https://gerrit.wikimedia.org/r/307721

Change 307726 merged by jenkins-bot:
Revert "Revert "Re-Apply Introduce MediaWikiServices""

https://gerrit.wikimedia.org/r/307726

greg added a subscriber: greg.Aug 31 2016, 4:42 PM

@Florian thank you a ton for the investigation and @Legoktm for the revert! I guess we will want to loop back with @phuedx to polish it up.

Next steps here? I just noticed it being moved into the "tracking" column for reading web, but I *believe* that is who has the next step/action (the issue is resolved on beta cluster's side).

So, for me: I now wait for the neyt beta-update job execution to verify, that anything is working fine now. After that, I'll plan to close this task :)

@greg: The change to MobileFrontend has been merged – again, again, and again! – and we'll see if the @Florian's fix did the job.

What would it take to catch issues like this in CI?

Florian closed this task as Resolved.Aug 31 2016, 10:04 PM

So, we had some runs since the MF change was merged and all are green, which is great! :D So I'm closing this task now as resolved.

@GWicke: The problem is, that it's highly cache-related and related to the setup order of MediaWiki, so I'm not sure, if we can always, with any possible configuration combination, catch these kind of errors. To be honest: It took me a lot of time to think about this kind of problem and tracking it down to the problem (even if I think that I could've done the job faster if I really thought about the possible problems). However, to bring my thoughts to one point: I don't think, that we can catch all these erros in the CI directly (even if we could do it in this specific case, by simply using the production near caching options, or an object cache at all). And I would count beta-labs as some kind of our QA infrastructure, too, and even if it does not count as "CI" directly, I would say, that it helped us to catch this error before it could break something :]