Page MenuHomePhabricator

Custom dblists for mwscript-k8s
Closed, ResolvedPublic

Description

We added --dblist to mwscript-k8s so that you can invoke foreachwikiindblist within the container, sequentially operating on n wikis using 1 Kubernetes job. This is less resource-intensive than running mwscript-k8s in a for loop, which creates n Kubernetes jobs.

The limitation is that foreachwikiindblist, of course, can only read dblists from the container. Complex expressions like --dblist "s1 + s2" are supported, but if you want to prepare a truly custom list of wikis, there's no easy way to do it.

@Ladsgroup's use case involves constructing a list of wikis in a shell one-liner, like expanddblist s3 | grep -v foo | grep something_weird | xargs -I{} .... @Urbanecm_WMF uses a one-off set of wikis in a loose text file, not committed in the config repo. Presently mwscript-k8s doesn't support either workflow.

We could go a few ways with this:

  1. Modify readDbListFile() in WmfConfig.php to make it possible to read from /data, and upload dblists to the container with --file.
  2. Add a flag to mwscript-k8s to take a dblist file specifically (e.g. --local-dblist=./mywikis), and ConfigMap it into dblists/.
  3. Add a flag to mwscript-k8s to take a list of wikis (e.g. --wikis="foowiki barwiki bazwiki") and ConfigMap that into a file in dblists/.

Open to other ideas, and to preferences between these. I tend to think #3 is the best UX; the shell one-liner case would look something like --wikis="$(expanddblist s3 | munge | twiddle | frobnicate)". (The trouble with #2 is we already have a --dblist flag which can't be used for this, so it's confusion-prone -- for this to be really good, we'd need an even clearer flag name than --local-dblist, I think. I included #1 for completeness but I think both the implementation and the resulting UX are worse than either 2 or 3.)

However, @Scott_French points out with #3 we might need to be careful about shell tokenization. That's true; I think as long as it's just "make sure you quote the $()" then that's okay. (Failure to do so would end up as mwscript-k8s --wikis=foowiki barwiki bazwiki Script.php. That'd get you an error, but a harmless one, unless barwiki was also the name of a maintenance script(!).) I think that's not a dealbreaker, but I'm open to the perspective that it is, and we could proceed with something like --local-dblist <(expanddblist s3 | munge | twiddle | frobnicate) instead.

One downside of all three approaches is opacity: all we'd be able to tell from examining the Kubernetes object is the dblist's filename, not its contents. To tell what wikis it's operating on, we'd have to go to the logs (or, while the script is running, reach in and cat the dblist).

One more option, for posterity:

  1. Stop using foreachwikiindblist and build our own Kubernetes-native thing. That new thing could take either a dblist filename or a custom set of wikis, it could provide better transparency, and it could even add cool stuff like checkpointing (if we restart the container, start from the wiki we were working on instead of going back to aawiki). If we had the engineering time to totally modernize the maintenance infrastructure, this would be on my list, but to solve this problem alone it's the wrong investment.

Event Timeline

My 2c on the solutions:

(1) upload dblists to the container with --file
Definitely workable, but it means the resulting syntax will be quite long and complex: I'd need to specify the dblist via both --file and --dblist (presumably). It is also not very clear how would the tool determine between built-in datasets and custom ones, so that there would be a similar problem to what @RLazarus mentions about --dblist and --local-dblist. It is not my preferred pathway.

(2) Add a flag to mwscript-k8s to take a dblist file specifically (e.g. --local-dblist=./mywikis), and ConfigMap it into dblists/
This one is my favourite so far. It is close to the existing workflow (for me at least), and it does not involve specifying things at multiple places.

If we can get something like mwscript-k8s --local-dblist=<(expanddblist s3 | munge | twiddle | frobnicate)(causes bash to map the output of the command under /dev/fd/XX), then it would also support the other workflow. However, I am unsure whether ConfigMap would accept such a special file – maybe it is not something we can make work easily.

@RLazarus says this would be confusing, as we'd need two flags (--dblist and --local-dblist). I'm not sure why that is true – maybe we can just assume that if the user specified an existing file, then they want to use the list from the file, even if it name-conflicts with a built-in dblist. If this seems too dangerous, maybe we can require specifying filepaths more explicitly (as in, --dblist=./wikipedia rather than just --dblist=wikipedia), so it is very clear the user meant to provide a file named wikipedia.

(3) Add a flag to mwscript-k8s to take a list of wikis (e.g. --wikis="foowiki barwiki bazwiki") and ConfigMap that into a file in dblists/
To me, this doesn't seem to be a good idea. It would work for small lists, but if you need to operate on a longer list, it would result in an awfully long command line. This would be potentially problematic to work with when looking at eg. ps aux, in htop or similar contexts.

Of course, #4 seems to be the clear winner :). However, since we probably do not have the time to build it, I think #2 would be an useful path forward at this point.

One downside of all three approaches is opacity: all we'd be able to tell from examining the Kubernetes object is the dblist's filename, not its contents. To tell what wikis it's operating on, we'd have to go to the logs (or, while the script is running, reach in and cat the dblist).

Would it make sense to create a paste of the user-provided file and log it...somewhere? I occasionally simulate that manually when logging into SAL, see the following log entry as an example

Mentioned in SAL (#wikimedia-operations) [2024-11-28T14:39:54Z] <urbanecm> [urbanecm@deploy2002 ~]$ while read wiki; do echo "== $wiki"; mwscript-k8s extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php -- --wiki=$wiki; done < wikis.txt # wikis.txt is at P71349 # T378827

I could imagine cases when even the list of wikis might be sensitive (such as running a script to remedy a certain vulnerability), so this might have to be a feature the user can disable (cf. scap's --no-log-messages option). On the other hand, I do not recall any such case actually happening, so it might not be an issue at all.

However, I am unsure whether ConfigMap would accept such a special file

This would work fine (the Python wrapper reads the file and creates the ConfigMap with the contents) so using <(...) is a viable approach.

maybe we can just assume that if the user specified an existing file, then they want to use the list from the file, even if it name-conflicts with a built-in dblist

I'm mostly concerned about the usability implications of this. It means the behavior of the command depends on an external factor, which is a recipe for surprises ("whoa, it ran on the original dblist instead of my modified copy! why did it do that? oh, mine had the wrong filename, that's annoying"). I'm not ruling it out, I'd just like to find something better if we can.

maybe we can require specifying filepaths more explicitly (as in, --dblist=./wikipedia rather than just --dblist=wikipedia)

Maybe! The extra syntax is a downside but it's a possible tradeoff.

Would it make sense to create a paste of the user-provided file and log it...somewhere?

We can definitely consider doing this when --sal is passed.

One other note about all three ConfigMap approaches: they would insert the file into dblists/ but wouldn't update dblists-index.php. I don't think that's a problem for the mwscript use case, but if it is it might be a dealbreaker for this whole approach.

Change #1178666 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mediawiki: Add support for mounting a custom dblist

https://gerrit.wikimedia.org/r/1178666

Change #1178667 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add --local-dblist to mwscript-k8s

https://gerrit.wikimedia.org/r/1178667

Change #1178666 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add support for mounting a custom dblist

https://gerrit.wikimedia.org/r/1178666

Change #1178667 merged by RLazarus:

[operations/puppet@production] deployment_server: Add --local_dblist to mwscript-k8s

https://gerrit.wikimedia.org/r/1178667

This is implemented, as option #2 (--local_dblist rather than --local-dblist for consistency with other flags).

Thanks to @Scott_French for the discussion offline; we agreed this was the clearest and least error-prone UX. (The remaining source of confusion is "so wait, when should I say --dblist and when should I say --local_dblist?" which I find plausibly close to some user questions about the application of --file. We'll keep an ear out and adjust if necessary.)

Updating dblists-index.php isn't an issue; the implementation of evalDbExpressionForCli is all we care about, and this just patch means that implementation is load-bearing for this use case (continues to be so, as it always was in the old days of dropping your own file into dblists/).

I'll update wikitech shortly to document the new flag, and I'll follow up with a separate patch to make the dblist visible when logging to the SAL.

Change #1181779 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Include --local_dblist contents when logging to SAL

https://gerrit.wikimedia.org/r/1181779

Change #1181779 merged by RLazarus:

[operations/puppet@production] deployment_server: Include --local_dblist contents when logging to SAL

https://gerrit.wikimedia.org/r/1181779