Page MenuHomePhabricator

[NEEDS GROOMING] schema services should be moved to k8s
Open, LowPublic

Description

@BTullis pointed out that schema services currently run on VMs, but would be better to have them run in k8s instead. This to follow SRE practices have some better reliability

This task is a placehoder, that requires work estimation.

I wonder if we could deprecate them entirely (probably not - eventgate-analytics-external rely on dynamic configs), or at least require that eventgate-main uses bundled schemas only.

related

Might be a good moment to re-think how we deploy / deliver schema repos:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think this will be harder than it sounds. I don't think there is a way to automate dynamic deployments of data to k8s workers. Something has to host data in some data store (git+http or whatever) if it is going to be dynamic, and the k8s workers will request it. So, while we could maybe move the HTTP schema server to k8s, the schemas themselves have to be queryable from somewhere.

I think this will be harder than it sounds. I don't think there is a way to automate dynamic deployments of data to k8s workers. Something has to host data in some data store (git+http or whatever) if it is going to be dynamic, and the k8s workers will request it. So, while we could maybe move the HTTP schema server to k8s, the schemas themselves have to be queryable from somewhere.

We had a quick discussion about this. I don't think that it's going to be as hard as all that. Naturally, I could be completely wrong, but here is my thinking.

The amount of data is relatively small, at just under 8 MB.

btullis@marlin:~/tmp/schemas$ git clone -q --depth 1 "ssh://btullis@gerrit.wikimedia.org:29418/schemas/event/primary"
Total 262 (delta 108), reused 188 (delta 80)

btullis@marlin:~/tmp/schemas$ git clone -q --depth 1 "ssh://btullis@gerrit.wikimedia.org:29418/schemas/event/secondary"
Total 703 (delta 353), reused 465 (delta 235)

btullis@marlin:~/tmp/schemas$ du -sh *
1.8M	primary
6.1M	secondary

I think that we could use an emptyDir for this, which is either backed by memory or node-ephemeral-storage.

image.png (135×767 px, 42 KB)

Although this emptyDir would be empty whenever the pod starts up, we would have a post-install hook that runs a Job to populate the directory with a git clone before making the pod ready for work.
Then in order to keep it up-to-date, we would create a CronJob object that runs git pull as frequently as we like. (e.g. every 30 minutes to replicate what puppet does.)

If the pod crashes, the same emptyDir doesn't get wiped, it is reused (which is nice).

We have several other examples of where we are using emptyDir objects, some backed by memory, some by node-ephemeral-storage.

I'll add the Data-Platform-SRE tag because I'm sure that we would be happy to work on this. That said, I've just upgraded the four schema VMs to bookworm in T349286 so I'm not sure that there's much operational benefit to be had in the short term by migrating this service to Kubernetes. Running the schema service under Ganeti and LVS is working perfectly well at the moment, but it's certainly something to consider at some point.

Gehel triaged this task as Low priority.Jan 10 2024, 9:35 AM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.