Page MenuHomePhabricator

Ensure tls-proxy container is started before launching main container
Closed, ResolvedPublic

Description

The k8s-controller-sidecar we are using as a stopgap until T386694: Replace k8s-controller-sidecars with built in Sidecar containers on k8s 1.31 does not ensure sidecars are completely started before launching the main container with its payload.

This causes unexpected failure modes where for instance mediawiki tries to reach out to a service through the tls-proxy sidecar but it is not started yet.

We should find a way to ensure proper functionality of at least the envoy tls-proxy before starting mediawiki payloads in mw-cron and mw-script

Related Objects

StatusSubtypeAssignedTask
Resolveddancy
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
DeclinedNone
ResolvedRLazarus
ResolvedBUG REPORTRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedLucas_Werkmeister_WMDE
ResolvedArian_Bozorg
OpenNone
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
OpenNone
OpenNone
ResolvedRLazarus
Resolvedtstarling
DuplicateNone
InvalidNone
DuplicateRLazarus
ResolvedJoe
Resolvedhashar
OpenNone
OpenRLazarus
ResolvedClement_Goubert
ResolvedScott_French
OpenNone
ResolvedRLazarus

Event Timeline

Change #1122578 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] When executing cli scripts, wait for the service mesh

https://gerrit.wikimedia.org/r/1122578

Change #1122606 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/docker-images/production-images@master] mwscript: do not run mesh checks when running in a loop

https://gerrit.wikimedia.org/r/1122606

In the case of pods accepting traffic, our readiness probe should be enough to ensure this.

For scripts, I think the simplest thing is to add a curl request in MWScript.php, which I did now.

I added the ability to skip the mesh check that is useful in foreachwiki (see the patch to production-images attached to the task) and foreachwikiindblist.

What remains to be done:

  • Add the same trick to the puppet versions of the foreachwiki.. scripts
  • [Debatable] add the ability to mwscript-k8s to pass the env variable to skip the check to k8s. This should be done as part of the parent task.

Change #1123377 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mwscript: Do not run mesh checks in loops

https://gerrit.wikimedia.org/r/1123377

Change #1122606 merged by Clément Goubert:

[operations/docker-images/production-images@master] mwscript: do not run mesh checks when running in a loop

https://gerrit.wikimedia.org/r/1122606

Change #1123377 merged by Clément Goubert:

[operations/puppet@production] mwscript: Do not run mesh checks in loops

https://gerrit.wikimedia.org/r/1123377

Change #1122578 merged by jenkins-bot:

[operations/mediawiki-config@master] When executing cli scripts, wait for the service mesh

https://gerrit.wikimedia.org/r/1122578

Mentioned in SAL (#wikimedia-operations) [2025-02-27T15:46:05Z] <cgoubert@deploy2002> Started scap sync-world: Backport for [[gerrit:1122578|When executing cli scripts, wait for the service mesh (T387208)]]

Had to revert the mediawiki change as scap uses MWScript.php in a few places and this breaks it since there's no mesh on the deployment hosts.
I'll inverse the logic of the patch tomorrow so we explicitly *enable* the mesh check in mw-script and mw-cron, with a possible override, and disable the check in the general case, that way there's no need to modify scap.

We can revisit making the check default when all use-cases except scap have been moved to kubernetes.

Had to revert the mediawiki change as scap uses MWScript.php in a few places and this breaks it since there's no mesh on the deployment hosts.
I'll inverse the logic of the patch tomorrow so we explicitly *enable* the mesh check in mw-script and mw-cron, with a possible override, and disable the check in the general case, that way there's no need to modify scap.

We can revisit making the check default when all use-cases except scap have been moved to kubernetes.

Scap is the exception, not the rule. A lot of maintenance scripts might be broken if the mesh isn't available. This is a common issue on k8s but it's still a very valid check on bare metal.

Change #1124051 had a related patch set uploaded (by Clément Goubert; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] Revert^2 "When executing cli scripts, wait for the service mesh"

https://gerrit.wikimedia.org/r/1124051

Change #1124051 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert^2 "When executing cli scripts, wait for the service mesh"

https://gerrit.wikimedia.org/r/1124051

Mentioned in SAL (#wikimedia-operations) [2025-03-04T16:47:35Z] <cgoubert@deploy2002> Started scap sync-world: Backport for [[gerrit:1124051|Revert^2 "When executing cli scripts, wait for the service mesh" (T387208)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-04T16:50:32Z] <cgoubert@deploy2002> cgoubert, oblivian: Backport for [[gerrit:1124051|Revert^2 "When executing cli scripts, wait for the service mesh" (T387208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-04T16:58:18Z] <cgoubert@deploy2002> Finished scap sync-world: Backport for [[gerrit:1124051|Revert^2 "When executing cli scripts, wait for the service mesh" (T387208)]] (duration: 10m 42s)

RLazarus assigned this task to Joe.

There's a nondeterministic element to the original bug obviously, but as far as I can tell from repeated testing on mw-script, this is now working consistently. Thanks!

Change #1135379 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/docker-images/production-images@master] php: mwscript bugfix

https://gerrit.wikimedia.org/r/1135379

Change #1133935 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/mediawiki-config@master] MWScript.php: exit code on mesh, longer timeout

https://gerrit.wikimedia.org/r/1133935

Change #1135379 merged by Clément Goubert:

[operations/docker-images/production-images@master] php: mwscript bugfix

https://gerrit.wikimedia.org/r/1135379

Mentioned in SAL (#wikimedia-operations) [2025-04-10T09:40:06Z] <claime> Rebuilding php base images to pick up 1135379 - T387208

Mentioned in SAL (#wikimedia-operations) [2025-04-10T09:44:04Z] <cgoubert@deploy1003> Started scap sync-world: Rebuilding mediawiki images to pick up new base images 1135379 - T387208

Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:19:15Z] <cgoubert@deploy1003> sync-world aborted: Rebuilding mediawiki images to pick up new base images 1135379 - T387208 (duration: 35m 23s)

Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:19:35Z] <cgoubert@deploy1003> Started scap sync-world: Rebuilding mediawiki images to pick up new base images 1135379 - T387208

Change #1133935 merged by jenkins-bot:

[operations/mediawiki-config@master] MWScript.php: exit code on mesh, longer timeout

https://gerrit.wikimedia.org/r/1133935

Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:45:56Z] <cgoubert@deploy1003> Started scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:54:08Z] <cgoubert@deploy1003> cgoubert: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-10T11:08:11Z] <cgoubert@deploy1003> Finished scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] (duration: 22m 15s)

Mentioned in SAL (#wikimedia-operations) [2025-04-10T11:28:54Z] <claime> Rebuilding php base images to pick up 1135694 - T387208

Mentioned in SAL (#wikimedia-operations) [2025-04-10T11:32:06Z] <cgoubert@deploy1003> Started scap sync-world: Rebuilding mediawiki images to pick up new base images 1135694 - T387208

Mentioned in SAL (#wikimedia-operations) [2025-04-10T12:15:32Z] <cgoubert@deploy1003> Finished scap sync-world: Rebuilding mediawiki images to pick up new base images 1135694 - T387208 (duration: 44m 51s)