Page MenuHomePhabricator

Mediawiki maint scripts using service proxied by the tls proxy might fail when running with mwscript-k8s
Closed, DuplicatePublic

Description

When executing mwscript-k8s --follow -- extensions/CirrusSearch/maintenance/Metastore.php --wiki testwiki --cluster codfw --show-index-version there's a failure when MW tried to connect to the search cluster:

Elastica\Exception\Connection\HttpException from line 186 of /srv/mediawiki/php-1.44.0-wmf.8/vendor/ruflin/elastica/src/Transport/Http.php: Couldn't connect to host, Elasticsearch down?
#0 /srv/mediawiki/php-1.44.0-wmf.8/vendor/ruflin/elastica/src/Request.php(183): Elastica\Transport\Http->exec(Object(Elastica\Request), Array)
#1 /srv/mediawiki/php-1.44.0-wmf.8/vendor/ruflin/elastica/src/Client.php(545): Elastica\Request->send()
[...]

This execution is supposed to initiate a connection to the service search-omega-codfw:

  • proxied via http://localhost:6203 to https://search.svc.codfw.wmnet:9443

Quickly looking at the egress rules & envoy configuration they appear to be correct.

A possible explanation might be that the tls proxy container is not yet ready when the php container entrypoint (the mwscript itself) is started.

Event Timeline

We (Data Platform SRE) encountered the same issue when attempting to deploy an envoy sidecar alongside each airflow task container, thus allowing the task to egress to the mesh.

What we eventually decided was to run an envoy Deployment instead of a sidecar, as it was a simpler solution all around:

  • no need to deploy the k8s-job-sidecar-controller to exec into the envoy container and SIGTERM it when the task container was done
  • no inherent race condition between both containers

I think that barring deploying envoy as a full-fledged sidecar (which is only available from kubernetes 1.29 apparently), which would start envoy as an initcontainer and let it run for the whole lifetime of the main container, I think a solution here might be to add some retries in case of a connection error?

I think that barring deploying envoy as a full-fledged sidecar (which is only available from kubernetes 1.29 apparently), which would start envoy as an initcontainer and let it run for the whole lifetime of the main container, I think a solution here might be to add some retries in case of a connection error?

We are targeting version 1.31 for the next Kubernetes version and progress toward the upgrade is good. See T341984: Update Kubernetes clusters to 1.31 and k8s-sig notes.

So we might want to keep this sidecar option in mind for a future iteration regarding Airflow. Having an envoyproxy deployed within each task pod would perhaps reduce the network hops, or allow for a specific listener configuration per task, or something like that.

T387208: Ensure tls-proxy container is started before launching main container should work around that issue until we have a proper sidecar. Basically MwScript.php checks that the tls-proxy is up before proceeding when run in our production environment (so mw-cron, mw-script-k8s)

@Clement_Goubert the script mentioned in this ticket now runs properly, will mark this task as dup of T387208, thanks!