[NEEDS GROOMING] schema services should be moved to k8s
Open, LowPublic
Actions

Assigned To

None

Authored By

	gmodena
	Sep 26 2023, 6:46 PM

Description

@BTullis pointed out that schema services currently run on VMs, but would be better to have them run in k8s instead. This to follow SRE practices have some better reliability

This task is a placehoder, that requires work estimation.

I wonder if we could deprecate them entirely (probably not - eventgate-analytics-external rely on dynamic configs), or at least require that eventgate-main uses bundled schemas only.

Might be a good moment to re-think how we deploy / deliver schema repos:

T274901: Stop using puppet + git pull for auto deployment of schema repos

Related Objects

Mentioned Here: T349286: Upgrade schema hosts to bookworm
T274901: Stop using puppet + git pull for auto deployment of schema repos

Event Timeline

gmodena created this task.Sep 26 2023, 6:46 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptSep 26 2023, 6:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

gmodena updated the task description. (Show Details)Sep 27 2023, 11:55 AM

I think this will be harder than it sounds. I don't think there is a way to automate dynamic deployments of data to k8s workers. Something has to host data in some data store (git+http or whatever) if it is going to be dynamic, and the k8s workers will request it. So, while we could maybe move the HTTP schema server to k8s, the schemas themselves have to be queryable from somewhere.

gmodena moved this task from Event Platform Backlog to Event Platform Maintenance (current quarter) on the Data Engineering and Event Platform Team board.Oct 23 2023, 1:24 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 1:40 PM

In T347421#9216784, @Ottomata wrote:

I think this will be harder than it sounds. I don't think there is a way to automate dynamic deployments of data to k8s workers. Something has to host data in some data store (git+http or whatever) if it is going to be dynamic, and the k8s workers will request it. So, while we could maybe move the HTTP schema server to k8s, the schemas themselves have to be queryable from somewhere.

We had a quick discussion about this. I don't think that it's going to be as hard as all that. Naturally, I could be completely wrong, but here is my thinking.

The amount of data is relatively small, at just under 8 MB.

btullis@marlin:~/tmp/schemas$ git clone -q --depth 1 "ssh://btullis@gerrit.wikimedia.org:29418/schemas/event/primary"
Total 262 (delta 108), reused 188 (delta 80)

btullis@marlin:~/tmp/schemas$ git clone -q --depth 1 "ssh://btullis@gerrit.wikimedia.org:29418/schemas/event/secondary"
Total 703 (delta 353), reused 465 (delta 235)

btullis@marlin:~/tmp/schemas$ du -sh *
1.8M	primary
6.1M	secondary

I think that we could use an emptyDir for this, which is either backed by memory or node-ephemeral-storage.

Although this emptyDir would be empty whenever the pod starts up, we would have a post-install hook that runs a Job to populate the directory with a git clone before making the pod ready for work.
Then in order to keep it up-to-date, we would create a CronJob object that runs git pull as frequently as we like. (e.g. every 30 minutes to replicate what puppet does.)

If the pod crashes, the same emptyDir doesn't get wiped, it is reused (which is nice).

We have several other examples of where we are using emptyDir objects, some backed by memory, some by node-ephemeral-storage.

I'll add the Data-Platform-SRE tag because I'm sure that we would be happy to work on this. That said, I've just upgraded the four schema VMs to bookworm in T349286 so I'm not sure that there's much operational benefit to be had in the short term by migrating this service to Kubernetes. Running the schema service under Ganeti and LVS is working perfectly well at the moment, but it's certainly something to consider at some point.

Gehel triaged this task as Low priority.Jan 10 2024, 9:35 AM

Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.

Ottomata added a subscriber: Ahoelzl.Jan 10 2024, 10:49 PM

lbowmaker moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Feb 14 2024, 3:51 PM

	F41614783: image.png
	Dec 20 2023, 5:33 PM

[NEEDS GROOMING] schema services should be moved to k8sOpen, LowPublicActions

Description

related

Related Objects

Event Timeline

[NEEDS GROOMING] schema services should be moved to k8s
Open, LowPublic
Actions