Page MenuHomePhabricator

mwscript-k8s purgeList does not reliably purge cached URLs
Closed, DuplicatePublic

Description

In T387125 I tried to run the purgeList maintenance script using mwscript-k8s, but still got cached logo files afterwards. Only purgeList in legacy mwscript seemed to work (unless the logo just fixed itself coincidentally around the same time). See the SAL messages for the exact commands.

We need purgeList to work reliably (/static/ files are cached for up to a year otherwise), so IMHO this is a blocker for decommissioning the bare-metal maintenance script host.

Event Timeline

Speculation: something in our network blocks the purge request that MediaWiki sends to the cache servers? And/or the cache servers aren’t configured to trust these commands from the mw-on-k8s network?

RLazarus subscribed.

purgeList from mwscript-k8s has worked previously, so my instinct is this is an intermittent problem rather than any of the possible "we just forgot to set this up" situations.

In the serviceops meeting this morning we speculated the cause might be the Envoy sidecar isn't ready yet when the maintenance script starts trying to talk to it. And indeed, purgeList happily declares itself finished 170 ms before Envoy starts listening for requests:

rzl@deploy2002:~$ kubectl logs --timestamps mw-script.codfw.8vzjxdim-69b4x mediawiki-8vzjxdim-app
2025-02-24T14:51:00.355629443Z Purging 5 urls
2025-02-24T14:51:00.390015622Z Done!

rzl@deploy2002:~$ kubectl logs --timestamps mw-script.codfw.8vzjxdim-69b4x mediawiki-8vzjxdim-tls-proxy
[...]
2025-02-24T14:51:00.561167933Z [2025-02-24 14:51:00.561][1][info][main] [source/server/server.cc:897] starting main dispatch loop

So this has worked in the past, when we got lucky with the startup order, and failed today, when we got unlucky with it. We should arrange for the app container to start only when the tls-proxy container goes ready. (This will necessarily make maintenance script startup a few hundred milliseconds slower, but it means when outgoing requests are fired off at the very beginning, they'll work consistently.)

But purgeList's behavior also surprises me here -- the purge requests can't possibly have succeeded. I don't remember exactly what you'd get from Envoy at that point in its startup, either a 5xx or hopefully a "connection refused" but certainly some kind of error. Does purgeList not look at the response?