Page MenuHomePhabricator

[Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid
Closed, ResolvedPublic

Description

There is already a need for a Helm config that allows an AQS service to query both Cassandra and Druid in T355536.
In the Commons Impact Metrics project it's very likely that we'll need that Helm config as well, since it's almost certain that the endpoints we design will target both serving layers.

Tasks:

So, this task is about either refactoring an existing Helm config (either Druid one or Cassandra one) to be able to query both datastores; or else creating a new Helm config designed to query both datastores.

Definition of done:

  • Mediawiki History Reduced services can be deployed using the new Helm configuration
  • Commons Impact Metrics new service can be deployed using the new Helm configuration

Currently blocks deployment of SDS 2.6.5 (automate MW History snapshot in AQS 2)
Commons Impact Metrics Deadline: April 19th (three weeks before alpha release launch at the Hackathon)

Event Timeline

VirginiaPoundstone set Due Date to Apr 19 2024, 4:00 AM.

I've had an initial look at this and it seems that it's likely going to be easier to merge the existing cassandra-https-gateway and druid-http-gateway charts into a new combined chart, rather than make an additional one.

There are only five files that differ between the two existing charts.

btullis@marlin:~/wmf/deployment-charts/charts$ diff -rq cassandra-http-gateway/ druid-http-gateway/
Files cassandra-http-gateway/Chart.yaml and druid-http-gateway/Chart.yaml differ
Files cassandra-http-gateway/templates/configmap.yaml and druid-http-gateway/templates/configmap.yaml differ
Files cassandra-http-gateway/templates/_config.yaml and druid-http-gateway/templates/_config.yaml differ
Files cassandra-http-gateway/values-test.yaml and druid-http-gateway/values-test.yaml differ
Files cassandra-http-gateway/values.yaml and druid-http-gateway/values.yaml differ

Expressing that more visually..

image.png (559×864 px, 51 KB)

So the question becomes, what shall we call this new chart? What about...

  • http-gateway
  • cassandra-druid-http-gateway
  • aqs-http-gateway
  • combined-http-gateway

I can take a look at that while @BTullis is away next week

Change #1014655 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Create a new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014655

Change #1014656 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate editor-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014656

Change #1014657 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate edit-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014657

Change #1014658 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate device-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014658

Change #1014659 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate geo-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014659

Change #1014660 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate image-suggestions to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014660

Change #1014661 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate media-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014661

Change #1014662 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate page-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014662

Change #1014663 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove separate charts for druid and cassandra AQS services

https://gerrit.wikimedia.org/r/1014663

I have created a stack of small patches to implement this.
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1014655

image.png (222×454 px, 29 KB)

First the new aqs-http-gateway chart is created, then each of the six existing AQS2 endpoint services are switched to use the new chart in a separate patch.
This is just to avoid a big bang change to all services.
Finally, the two older charts are removed.

There isn't really any requirement to merge them all before continuing work on the Commons Impact Metrics endpoint. Only the first one in the stack is required, but I thought that I would create the migration and cleanup patches now, so that we don't forget.

There is also an opportunity to remove the need to specify all of the cassandra server IPs by implementing T359423: Migrate charts to Calico Network Policies for this chart.
However, this would require adding the AQS cassandra cluster as a registered external services.
See: https://wikitech.wikimedia.org/wiki/Kubernetes/Deployment_Charts#Enabling_egress_to_services_external_to_Kubernetes for more information on that.
I decided not to implement that change as part of this chart migration, owing to the time-critical nature of this ticket, but it would be a good follow-up.

@mforns - Does this unblock your work on the MW history and/or commons metrics?

@mforns I can pick up this work until Ben comes back if necessary, depending on whether you're still blocked or if you have enough to proceed.

@BTullis and @brouberol sorry for the delay.
This is blocking the MW history snapshot automation work, but it's not urgent, so don't worry.
It doesn't block the commons metrics work so far. So also no worries there.
Thanks for pinging :-)

[ ... ]

So the question becomes, what shall we call this new chart? What about...

  • http-gateway
  • cassandra-druid-http-gateway
  • aqs-http-gateway
  • combined-http-gateway

As I (poorly) explained on the gerrit, the OG chart —cassandra-http-gateway — became something of a misnomer. It was originally intended to be the chart that corresponded to services (or maybe ultimately service, singular) for The Cassandra HTTP Gateway™, a limited REST query interface to Cassandra (or an HTTP gateway). Currently, the only thing making use of this is image suggestions (this being repos/generated-data-platform/cassandra-http-gateway, not the chart), but it may yet still become a bigger thing (honestly, we should probably be talking about whether any future services under the AQS umbrella also do this, but I digress...). Apparently what this chart did matched the requirements of the AQS 2.0 services, and @hnowlan was unaware of where the naming came from, or any of the broader plans. It makes complete sense to reuse of course, but it will surely become a source of confusion, particularly if The Cassandra HTTP Gateway™ ever becomes more than a footnote related to image suggestions.

If we step back and think about the class of things we want to use this chart for, is there anything about them that makes the term "http gateway" meaningful (or is that just being carried forward from the original name)?

Something else to think about: AQS means something pretty specific, but the AQS Cassandra cluster has been made multi-tenant to host the AQS datasets, in addition to other services that fit the similar pattern of precomputing and persisting results for low latency read access (image suggestions for example). I think the rest of the infrastructure has seen similar changes as well. I made an attempt at coming up with a different name —Generated Datasets— even if wasn't well received/hasn't stuck. The idea was that Generated Datasets™ (or whatever we called it) would describe what we were now doing with that cluster, and AQS would be one of the tenants supported, and maybe we could avoid the confusion between The Analytics Query Service™, and all of the various clusters, code, etc that once existed solely for it, but now encompass a bunch of other things. TL;DR I'm not married to Generated Datasets, but perhaps some other terminology that means everything we do with this infrastructure could be incorporated into whatever we name this chart. :)

Change #1014655 merged by jenkins-bot:

[operations/deployment-charts@master] Create a new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014655

There is also an opportunity to remove the need to specify all of the cassandra server IPs by implementing T359423: Migrate charts to Calico Network Policies for this chart.

Only seeing this now so this is a bit of a driveby comment that's coming a little late, but there is also a cassandra sextant module in deployment-charts that will slot in easily also. Not sure how the the network policy module and it are reconciled though as they overlap as far as defining the network side of things.

Thanks for your considered reply @Eevans.
I confess that I hadn't spotted this comment until moments aftermerging the CR above, so apologies for jumping the gun there.
On the positive side, it's easier to rename the chart(s) later than it is to decide on a good name, so we can definitely carry on discussing it. I didn't mean to shut down the conversation by proceeding.

If we step back and think about the class of things we want to use this chart for, is there anything about them that makes the term "http gateway" meaningful (or is that just being carried forward from the original name)?

I do see your point, but is there anything about it that makes the term 'http gateway' incorrect?
When creating this new chart (based on merging the functionality of the druid and cassandra charts used by AQS 2.0 services) I only thought of the term 'http gateway' to be purely descriptive. I had no knowledge of The Cassandra HTTP Gateway™ as its precursor.

Something else to think about: AQS means something pretty specific, but the AQS Cassandra cluster has been made multi-tenant to host the AQS datasets, in addition to other services that fit the similar pattern of precomputing and persisting results for low latency read access (image suggestions for example). I think the rest of the infrastructure has seen similar changes as well.

I agree with you. There's also an element of diversification, since this change came about because of a need to access Druid and Cassandra from the same chart.

I made an attempt at coming up with a different name —Generated Datasets— even if wasn't well received/hasn't stuck. The idea was that Generated Datasets™ (or whatever we called it) would describe what we were now doing with that cluster, and AQS would be one of the tenants supported, and maybe we could avoid the confusion between The Analytics Query Service™, and all of the various clusters, code, etc that once existed solely for it, but now encompass a bunch of other things. TL;DR I'm not married to Generated Datasets, but perhaps some other terminology that means everything we do with this infrastructure could be incorporated into whatever we name this chart. :)

I see, thanks for explaining. I wasn't aware that you were behind the Generated Datasets™ label either, so maybe I just missed a meeting or two :-) - I think that this particular naming problem is getting to the point where we might need more input from data engineers. For example, how does Druid (in fact, both of our Druid clusters) fit into the long-term plan regarding the pre-computed and persisted datasets, alongside Cassandra? Are there any other data stores that fall under the same umbrella of Data Platform capabilities of low-latency query services, or are expected to do so in future?

Once again, apologies if I've muddied the water here. Hopefully there is some clean and elegant solution that we can find in the end.

There is also an opportunity to remove the need to specify all of the cassandra server IPs by implementing T359423: Migrate charts to Calico Network Policies for this chart.

Only seeing this now so this is a bit of a driveby comment that's coming a little late, but there is also a cassandra sextant module in deployment-charts that will slot in easily also. Not sure how the the network policy module and it are reconciled though as they overlap as far as defining the network side of things.

Thanks @hnowlan - I will check that out. I expect that it will also be of interest to @brouberol.
Maybe we can refactor and replace this cassandra module with the external services method, where it has been used. Not saying we have to, but overlapping functionality seems less than ideal in the long run.

I think that we can call this ticket done (for now) because the new chart is ready for use.
I'll move it to Done on the Data-Platform-SRE board, but I'll wait for confirmation from someone on the Experimentation Lab team to verify that it meets the requirements, before resolving it.

Change #1014656 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate editor-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014656

Change #1014657 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate edit-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014657

Change #1014658 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate device-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014658

Change #1014659 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate geo-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014659

Change #1014661 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate media-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014661

Change #1014662 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate page-analytics to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014662

The image-suggestions service is the last service to be migrated to the new helm chart, but it is not technically an AQS service.
I think it is best to get those most familiar with the service to review the change, so that it doesn't come as a surprise.

From the git history and the Wikitech page: https://wikitech.wikimedia.org/wiki/Image-suggestion it seems like the best people to contact would be @Cparle and @mfossati.
Once we have migrated this last service from the cassandra-http-gateway chart to the aqs-http-gateway chart, I will be able to carry on with removing the two separate cassandra and druid charts.

Change #1014660 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate image-suggestions to use the new aqs-http-gateway chart

https://gerrit.wikimedia.org/r/1014660

Change #1014663 merged by jenkins-bot:

[operations/deployment-charts@master] Remove separate charts for druid and cassandra AQS services

https://gerrit.wikimedia.org/r/1014663

Mentioned in SAL (#wikimedia-analytics) [2024-05-22T08:48:38Z] <btullis> deploying AQS device-analytics for T360531