Page MenuHomePhabricator

Migrate AQS and image-suggestions services to Calico Network Policies
Closed, ResolvedPublic

Description

We need to migrate all of the services that use the aqs-http-gateway chart to make use of the new external-services mechanism, which will make it easier to specify the correct network policies to enable access to their Cassandra and Druid data sources.

The following six are AQS services:

Service External Services
device-analyticscassandra aqs
edit-analyticscassandra aqs, druid-public
editor-analyticscassandra aqs, druid-public
geo-analyticscassandra aqs
media-analyticscassandra aqs
page-analyticscassandra aqs

In addition, we have one more service that uses the same chart:

Service External Services
image-suggestionscassandra aqs

Image-suggestions is not technically an AQS service, which is why it is listed separately.

Currently, we use symlinks to share a set of common network policies in an _aqs2-common_ folder for the cassandra hosts, specific to each data centre:

btullis@deploy1002:/srv/deployment-charts/helmfile.d/services$ find *-analytics -type l -exec ls -o {} \;
lrwxrwxrwx 1 root 34 Nov  1  2023 device-analytics/global-staging.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 device-analytics/global-eqiad.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 device-analytics/global-codfw.yaml -> ../_aqs2-common_/global-codfw.yaml
lrwxrwxrwx 1 root 34 May  2 17:43 editor-analytics/global-staging.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 May  2 17:43 editor-analytics/global-eqiad.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 May  2 17:43 editor-analytics/global-codfw.yaml -> ../_aqs2-common_/global-codfw.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 geo-analytics/global-staging.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 geo-analytics/global-eqiad.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 geo-analytics/global-codfw.yaml -> ../_aqs2-common_/global-codfw.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 media-analytics/global-staging.yaml -> ../_aqs2-common_/global-codfw.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 media-analytics/global-eqiad.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 media-analytics/global-codfw.yaml -> ../_aqs2-common_/global-codfw.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 page-analytics/global-staging.yaml -> ../_aqs2-common_/global-codfw.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 page-analytics/global-eqiad.yaml -> ../_aqs2-common_/global-eqiad.yaml
lrwxrwxrwx 1 root 34 Nov  1  2023 page-analytics/global-codfw.yaml -> ../_aqs2-common_/global-codfw.yaml

We noticed when modifying editor-analytics to use both Cassandra and Druid that this mechanism would no longer be the most suitable, so we should migrate to Calico Network Policies.

In addition the druid-public extry in the external-services contains all of the brokers, but it doesn't contain the LVS service IP, which is what clients currently use.
We may want to consider adding this IP address to the network policy.

btullis@puppetmaster1001:~$ host 10.2.2.38
38.2.2.10.in-addr.arpa domain name pointer druid-public-broker.svc.eqiad.wmnet.

The druid analytics cluster doesn't use LVS, so it doesn't need modifying.

Event Timeline

Gehel triaged this task as Medium priority.May 10 2024, 8:21 AM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.
BTullis added a subscriber: WDoranWMF.

Moving this into our current sprint, since we would really like to get this done before the end of the month.

Change #1033405 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Migrate AQS2 services and image-suggestions to calico network policies

https://gerrit.wikimedia.org/r/1033405

I believe that this is ready for review now.

I have only used 'external-services' for the cassandra policy, not the druid-public policy.
The reason for that is that both services that access druid so so using the VIP of the LVS service, rather than the IP addresses of hosts themselves.

So I've left this existing druid network policy in place and only added external_services to the global-[eqiad-codfw].yaml symlink targets, for now.

Change #1033405 merged by jenkins-bot:

[operations/deployment-charts@master] Migrate AQS2 services and image-suggestions to calico network policies

https://gerrit.wikimedia.org/r/1033405

BTullis added a subscriber: mfossati.

All of the AQS services are now deployed, which unblocks the deployment of new versions of edit-analytics and editor-analytics by the end of the month, in support of: T355536: [Sprint 08 GOAL][SDS 2.6.5] AQS 2: Mediawiki history reduced snapshot automation

I will wait until next week to deploy image-suggestions but it should be similarly low-risk.
There is a small change to the list of cassandra servers that the application uses, so I would like to make sure that @mfossati is able to help verify that it is working, if possible.

I have now deployed image-suggestion with the new chart too. Marking this as resolved.