Page MenuHomePhabricator

ExtensionDistributor is broken on k8s
Closed, ResolvedPublicBUG REPORT

Description

Try to visit any ExtensionDistributor page, like https://www.mediawiki.org/wiki/Special:ExtensionDistributor/WikiEditor

It fails with "Unable to fetch extension list! "

Error
message
MediaWiki\Extension\ExtensionDistributor\Providers\GerritExtDistProvider::makeGerritApiRequest: Could not fetch "https://gerrit.wikimedia.org/r/projects/mediawiki%2Fextensions%2FComments/branches", received: * Error fetching URL: Failed to connect to gerrit.wikimedia.org port 443: Connection timed out
* There was a problem during the HTTP request: 0 Error
Impact
Notes

First entry at Jun 26, 2023 @ 10:31:10.163

Event Timeline

taavi changed the subtype of this task from "Task" to "Bug Report".Jun 26 2023, 7:15 PM

mw.o apparently moved to MW-on-K8s earlier today, so I suspect that's related. My best guess is that the Kubernetes network policies are missing egress rules that ExtDist uses to talk to Gerrit, but I'm not finding anything useful on Logstash to verify this.

hashar renamed this task from ExtDist is broken to ExtensionDistributor is broken.Jun 26 2023, 7:28 PM
hashar updated the task description. (Show Details)
hashar triaged this task as Unbreak Now! priority.Jun 26 2023, 7:37 PM
hashar updated the task description. (Show Details)

Change 933151 had a related patch set uploaded (by Reedy; author: Reedy):

[operations/puppet@production] Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s"

https://gerrit.wikimedia.org/r/933151

Change 933152 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s"

https://gerrit.wikimedia.org/r/933152

Change 933152 abandoned by Majavah:

[operations/puppet@production] Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s"

Reason:

in favour of 933151

https://gerrit.wikimedia.org/r/933152

Change 933151 merged by Alexandros Kosiaris:

[operations/puppet@production] Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s"

https://gerrit.wikimedia.org/r/933151

Mentioned in SAL (#wikimedia-operations) [2023-06-26T19:48:36Z] <akosiaris> revert "Redirect www.mediawiki.org to mw-on-k8s", debugging T340483

Mentioned in SAL (#wikimedia-operations) [2023-06-26T19:49:17Z] <akosiaris> force puppet run on cp hosts T340483

Change 933179 had a related patch set uploaded (by Reedy; author: Reedy):

[operations/mediawiki-config@master] CommonSettings.php: Set a proxy for $wgExtDistAPIConfig

https://gerrit.wikimedia.org/r/933179

taavi@mwmaint1002 ~ $ mwscript shell.php mediawikiwiki
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> $cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
= WANObjectCache {#590}

> $cache->delete( $cache->makeGlobalKey( 'extdist-list', 'extensions' ) ); $cache->delete( $cache->makeGlobalKey( 'extdist-list', 'skins' ) );
= true
Reedy renamed this task from ExtensionDistributor is broken to ExtensionDistributor is broken on k8s.Jun 26 2023, 8:02 PM
Reedy lowered the priority of this task from Unbreak Now! to High.Jun 26 2023, 8:05 PM

We 've chatted with @claime regarding this today. We 've discussed alternatives, and added a few more datapoints:

  • Things are going to start moving to Gitlab eventually
  • Gitlab is going to be participating in the Switchover regularly, following MediaWiki across DCs
  • Gitlab is more actively maintained, seeing changes, maintenance windows, downtimes at a higher rate than Gerrit
  • It is more probable than Gerrit that Gitlab might have to change IP addresses in the future
  • Both Gerrit AND Gitlab are not behind the edge caches (aka CDN), because we want them functional in case the CDN has a meltdown
  • Both Gerrit AND Gitlab have public IP addresses. Both Gerrit AND Gitlab don't have internal endpoints. In fact, overall the nature of their HTTPS API endpoints can be considered, out of necessity and prudent architectural goals, as external to the rest of the infrastructure. Even internal workloads (case in point, this extension), access them in the exact same way as the general public would (again, that's a good decision).

Given the above, we discussed the following alternatives:

  • Add Network Policy egress rules for Gerrit on mw-on-k8s. Due to the many, possible shifting IP addresses as pointed out above, this doesn't sound like a very reliable or sustainable path forward.
  • Add Gerrit/Gitlab to the Service Mesh/Proxy. This doesn't sound like a good idea to begin with. The service mesh is meant to address internal services, not external ones. The only reason we even considered this approach is because of our policy that says to use internal endpoints when available (and there isn't one). Furthermore, we 'd still need to do mess with egress rules.
  • Just use the proxy as @Reedy's patch above does. This sounds like the best path forward for this. It decouples the extension from the implementation details of a different service and allows the external service to evolve as the owning teams see best.

Thus, I 'll be merging and deploying https://gerrit.wikimedia.org/r/933179

Change 933179 merged by Alexandros Kosiaris:

[operations/mediawiki-config@master] CommonSettings.php: Set a proxy for $wgExtDistAPIConfig

https://gerrit.wikimedia.org/r/933179

Tried a couple of URLs, I am getting Error fetching URL: Received HTTP code 404 from proxy after CONNECT, looking into it

I 've had to switch to $wgCopyUploadProxy as $wgLocalHTTPProxy actually talks to mediawiki itself and thus didn't work. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/933397 instructed the extension to use url-downloader.wikimedia.org to talk to gerrit, which is what we wanted, but some outdated ACLs on that front caused issues.

I 've spent most of the day re-working that very old piece of puppet code, altering behavior of it, by allowing to talk to hosts (BUT NOT LVS endpoints) on the various public networks we have. The situation is vastly different to what we had years ago, those public networks now have just things that need to be there, the risk factor is substantially smaller. It also allows us to treat those endpoints as external, which makes sense as pointed out in the comment above.

The list of patches is at https://gerrit.wikimedia.org/r/q/topic:url_downloader_updates (as well as one at https://gerrit.wikimedia.org/r/c/operations/puppet/+/933430)

I think this issue is resolved, I can see mw on legacy infra using the urldownloader proxy. @Clement_Goubert will follow up with a patch to move mediawiki.org again to mw-on-k8s, widening the communication this time around.

Thank you so much @akosiaris
I'd like to apologize for the lack of communication surrounding the first migration, and the subsequent scramble.
The migration will happen again on June 29th at 10:00 UTC, and this time a rollback procedure has been sent to all ops.

Clement_Goubert claimed this task.

ExtensionDistributor now confirmed working on kubernetes via url-downloader after cache purge, resolving.

hashar subscribed.

The error was not showing in the mediawiki-errors dashboard cause it counterintuitively filters out ERROR messages coming from non error channels (eg channel=ExtensionDistributor). I filed a follow up action: T341815: Kibana dashboard mediawiki-errors lacks channel errors and exceptions