Page MenuHomePhabricator

Move Wikitech onto the production MW cluster
Open, HighPublic

Description

Right now Wikitech runs on wmcs-managed systems in isolation from the main MediaWiki hosting cluster. This "snowflake" deployment leads to both confusion for other MediaWiki support teams and reduced functionality for Wikitech as a wiki.

Now that Wikitech doesn't involve any OpenStack API integration, it can and should move to the standard production wiki cluster.

Note that this does not mean that Wikitech will become an SUL wiki; that will happen later. Wikitech will still create, manage, and consume ldap credentials.

Requirements:

Related Objects

StatusSubtypeAssignedTask
ResolvedClement_Goubert
OpenClement_Goubert
OpenNone
OpenNone
OpenNone
Resolvedtaavi
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedMarcoAurelio
ResolvedAndrew
Resolvedtaavi
DeclinedNone
DuplicateNone
OpenNone
ResolvedSLyngshede-WMF
ResolvedNone
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedMarostegui
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedNone
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenNone
Opentaavi
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenSLyngshede-WMF
ResolvedSLyngshede-WMF
ResolvedBUG REPORTSLyngshede-WMF
InvalidNone
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenNone
OpenNone
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenSLyngshede-WMF
OpenSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenSLyngshede-WMF
Resolvedtaavi
Resolvedtaavi
ResolvedFeatureSLyngshede-WMF
ResolvedBUG REPORTSLyngshede-WMF
Resolvedbd808
Resolvedyuvipanda
Resolvedbd808
Resolvedbd808
Resolvedbd808
OpenSLyngshede-WMF
ResolvedNone
OpenNone
OpenFeatureNone
StalledFeatureNone
OpenFeatureSLyngshede-WMF
OpenNone
OpenSLyngshede-WMF
OpenSLyngshede-WMF
ResolvedABran-WMF
Resolvedtaavi
OpenNone
In ProgressSLyngshede-WMF
DuplicateNone
OpenNone
OpenNone
ResolvedMarostegui
ResolvedAndrew
ResolvedMarostegui
ResolvedAndrew
DeclinedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedLadsgroup
DuplicateNone
Resolved Bstorm
DeclinedNone
Resolvedtaavi
ResolvedJdforrester-WMF
DeclinedNone
Openjijiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bd808 triaged this task as High priority.Feb 28 2020, 12:46 AM
bd808 updated the task description. (Show Details)

I have a very specific concern with this, and it's of isolation.

When MW is down, people still want to reach wikitech. Sure, we have wikitech-static, but for example *search won't work* there if wikitech is down.

So having a separation between the main cluster and wikitech is a welcome fact. What I think we could do is to create a couple of machines managed by the same puppet code as appservers, pointing to different resources though (so, mcrouter points to two memcached servers, etc).

Given the importance wikitech has for troubleshooting documentation, I would like to keep it separated from the main infrastructure as much as possible.

Sure, we have wikitech-static, but for example *search won't work* there if wikitech is down.

If so, that is a bug. The search on wikitech static was broken for a bit (T243730), but that was a simple MySQL table corruption. I don't know of any quantum entanglement between wikitech and wikitech-static outside of the daily data export/import jobs.

Given the importance wikitech has for troubleshooting documentation, I would like to keep it separated from the main infrastructure as much as possible.

The concern above about search is true here though; wikitech uses the main cirrussearch cluster. It is also behind the shared CDN layer. It also uses a database on the m5 cluster. It also lives in the eqiad DC. There is some separation, but that separation is actually a burden and not a benefit.

Features keep disappearing from Wikitech as the main cluster and the Puppet manifests around it are refactored and updated. The fact that we can't even have Visual Editor now feels like a big last straw for me. We can't have VE; we can't have beta features; we can't have Flow; we can't expose the metadata of the wiki on the Wiki Replicas; the bounce handler is broken; etc. The bugs I file about such things are given the broad response of "oops, sorry that only works if you are hooked up to <<infrastructure X>>". So, I believe we have to choose one of:

Of all of these options, moving wikitech seems the least disruptive to me. It also gets us one step closer to my ideal world of T161859: Make Wikitech an SUL wiki which very certainly can't happen until wikitech is running from an isolation zone that can access s7.

Related idea that @Reedy brought up in irc chat: what if we skip the legacy cluster and make wikitech the first MW-on-K8s wiki? That could allow isolation of the php-ldap bits from T237889: Install php-ldap on all MW appservers and would give a low access rate but actively used wiki for indefinite testing of how to connect all the other things up to the Kubernetes cluster deployment. It doens't even need multi-version and the related baggage there. If the deployed version on wikitech lagged the train by minutes, hours, days, or even weeks that would be ok too.

re: isolation -- I'd like us to continue to regard wikitech-static as the backstop for technical docs. If we have concerns about the availability/reliability of wikitech-static then we should list and address those issues

re: wikitech-on-k8s -- if someone wants to take this on I have no objection, but my understanding is that making wikitech a 'normal' wiki is only a small amount of work, and I'd hate to see an ambitious unrelated task stand in the way of that. We could certainly still move wikitech to k8s after the fact if that seems useful.

I just checked the search on wikitech-static and it returned the following error:

[b17175d42b31c3fb3373c2cc] /w/index.php?search=data+persistence&title=Special%3ASearch&go=Go Error: Call to a member function caseFold() on null

Backtrace:

from /srv/mediawiki/w/extensions/TitleKey/TitleKey_body.php(60)
#0 /srv/mediawiki/w/extensions/TitleKey/TitleKey_body.php(232): TitleKey::normalize(string)
#1 /srv/mediawiki/w/extensions/TitleKey/TitleKey_body.php(228): TitleKey::exactMatch(integer, string)
#2 /srv/mediawiki/w/extensions/TitleKey/TitleKey_body.php(215): TitleKey::exactMatchTitle(Title)
#3 /srv/mediawiki/w/includes/HookContainer/HookContainer.php(330): TitleKey::searchGetNearMatch(string, NULL)
#4 /srv/mediawiki/w/includes/HookContainer/HookContainer.php(137): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#5 /srv/mediawiki/w/includes/HookContainer/HookRunner.php(3329): MediaWiki\HookContainer\HookContainer->run(string, array)
#6 /srv/mediawiki/w/includes/search/SearchNearMatcher.php(162): MediaWiki\HookContainer\HookRunner->onSearchGetNearMatch(string, NULL)
#7 /srv/mediawiki/w/includes/search/SearchNearMatcher.php(64): SearchNearMatcher->getNearMatchInternal(string)
#8 /srv/mediawiki/w/includes/specials/SpecialSearch.php(341): SearchNearMatcher->getNearMatch(string)
#9 /srv/mediawiki/w/includes/specials/SpecialSearch.php(200): SpecialSearch->goResult(string)
#10 /srv/mediawiki/w/includes/specialpage/SpecialPage.php(646): SpecialSearch->execute(NULL)
#11 /srv/mediawiki/w/includes/specialpage/SpecialPageFactory.php(1382): SpecialPage->run(NULL)
#12 /srv/mediawiki/w/includes/MediaWiki.php(309): MediaWiki\SpecialPage\SpecialPageFactory->executePath(Title, RequestContext)
#13 /srv/mediawiki/w/includes/MediaWiki.php(913): MediaWiki->performRequest()
#14 /srv/mediawiki/w/includes/MediaWiki.php(546): MediaWiki->main()
#15 /srv/mediawiki/w/index.php(53): MediaWiki->run()
#16 /srv/mediawiki/w/index.php(46): wfIndexMain()
#17 {main}

I also get randomly redirected to regular Wikitech when trying to reach Wikitech-static. Couldn't figure out the exact conditions to trigger one or the other.

static needs updating to somewhere more near HEAD of REL1_36...

static needs updating to somewhere more near HEAD of REL1_36...

  • done, although I had to hack around some dependency issues because composer.json asked for some impossible combinations.

Search is still failing. I also see the issue with random redirects but haven't found a fix.

@Reedy, did I ever succeed in getting you the root password for wikitech-static?

@Reedy, did I ever succeed in getting you the root password for wikitech-static?

You did! I just couldn't remember where I'd saved it. I found it again!

https://wikitech.wikimedia.org/w/index.php?search=data+persistence&title=Special%3ASearch&go=Go&ns0=1&ns12=1&ns116=1&ns498=1 now works, I think? The extensions were a bit out of skew with MW, so I've fixed that, and that looks to have helped.

I've tidied up the composer stuff too.

T257643: https://wikitech-static.wikimedia.org/wiki/ redirecting improperly was filed for the redirects before. But no real luck on hammering down

T257643: https://wikitech-static.wikimedia.org/wiki/ redirecting improperly was filed for the redirects before. But no real luck on hammering down

And is probably fixed now

With search and redirects fixed, looks like we should be in a good state to get back to the discussion about making Wikitech a regular wiki.

As a side note, we could use a process to check the state of Wikitech and other rarely used tools that are crucial in incident response. I captured this idea in T290130.

Related idea that @Reedy brought up in irc chat: what if we skip the legacy cluster and make wikitech the first MW-on-K8s wiki? That could allow isolation of the php-ldap bits from T237889: Install php-ldap on all MW appservers and would give a low access rate but actively used wiki for indefinite testing of how to connect all the other things up to the Kubernetes cluster deployment. It doens't even need multi-version and the related baggage there. If the deployed version on wikitech lagged the train by minutes, hours, days, or even weeks that would be ok too.

This is a good idea indeed, something I wanted to propose myself. Not sure how much of a disruption it would be to build another variant of the mediawiki image, but maybe we can include php-ldap in the debug image that also includes the profiler, and just disable the profiler on the wikitech installation.

Once the issues we have with mediawiki on k8s are all ironed out, I'll get back to this.

ebernhardson@mwdebug1002:~/mw-phpdbg$ mwscript shell.php --wiki=labswiki
>>> wfMessage( 'ok' )->text()
LogicException with message 'Process cache for 'en' should be set by now.'

[…] the labswiki-specific memcached not being available from mwmaint*

I ran into this again when running a maintenance script. I've documented it on Maintenance server so that, while people will inevitably try this again, it should reduce the time spent debugging.

T328768: Wikitech issues for datacentre switchover (March 2023) highlights yet again the extra costs of keeping Wikitech a snowflake deployment without the benefit of shared infrastructure used by every other wiki we host.

Can we give this some more priority?

This is ruining my day today because I have a fix for an important Horizon bug which I cant roll out because it require python >3.7 which means I'm having to totally rearrange my Horizon deploy rather than just upgrading the cloudweb hosts to a modern OS.

JFTR, the currently planned OKR for the buildout for https://idm.wikimedia.org is to "eliminate the need for the OpenStack Manager extension on WikiTech" (i.e.. account creating and changing email addresses/core attributes/SSH keys will be handled within the IDM). This should also vastly simplify this migration since we no longer need to care about special cases like php-ldap.

JFTR, the currently planned OKR for the buildout for https://idm.wikimedia.org is to "eliminate the need for the OpenStack Manager extension on WikiTech" (i.e.. account creating and changing email addresses/core attributes/SSH keys will be handled within the IDM). This should also vastly simplify this migration since we no longer need to care about special cases like php-ldap.

I'm happy to hear that this work is being done, but not happy to hear that a new blocker has been inserted in line ahead of the wikitech migration. For a while there wikitech was at the head of the k8s-migration queue, now it seems to have moved to the end :(

JFTR, the currently planned OKR for the buildout for https://idm.wikimedia.org is to "eliminate the need for the OpenStack Manager extension on WikiTech" (i.e.. account creating and changing email addresses/core attributes/SSH keys will be handled within the IDM). This should also vastly simplify this migration since we no longer need to care about special cases like php-ldap.

Wikitech will still need php-ldap until there is time and energy to do a SUL migration of the accounts there. Making that a blocker of moving Wikitech into core hosting is not a reasonable ask.

I have no idea why you suddenly assume this is treated as a blocker? I've only mentioned what we have planned as the OKR and that it will simplify things, I never said about any priotirisation of the cloudweb migration to k8s?

I have no idea why you suddenly assume this is treated as a blocker? I've only mentioned what we have planned as the OKR and that it will simplify things, I never said about any priotirisation of the cloudweb migration to k8s?

My apologies Moritz. This is stored trauma on my side. This work has been blocked in the past due to concerns about php-ldap (T237889: Install php-ldap on all MW appservers) so the mention of it in your comment promoted my response that a) we will need php-ldap until the there is a major social project to convert wikitech use SUL account and b) that I feel this would be an unreasonable constraint on this task.

My apologies Moritz. This is stored trauma on my side. This work has been blocked in the past due to concerns about php-ldap (T237889: Install php-ldap on all MW appservers) so the mention of it in your comment promoted my response that a) we will need php-ldap until the there is a major social project to convert wikitech use SUL account and b) that I feel this would be an unreasonable constraint on this task.

No worries, at point I'm confident that the days of separate cloudweb installs are finally numbered :-)

I have no idea why you suddenly assume this is treated as a blocker? I've only mentioned what we have planned as the OKR and that it will simplify things, I never said about any priotirisation of the cloudweb migration to k8s?

My apologies as well -- I throw a little fit every time I have a new problem on the cloudwebs but it's totally not related to your helpful comment.