Page MenuHomePhabricator

Export development_network_probe data to Puppet servers for CDN deployment
Open, MediumPublic

Description

As part of WE5.4.3, we need to export a subset of data that currently lives in Hive (event.development_network_probe) into the production Puppet servers.
The goal is to run a query daily, export its results, and make them available to Puppet so they can be deployed to the CDN servers.

Proposed approach (from Data Engineering feedback on Slack):

  • Generate the file daily on HDFS with Airflow + Spark.
  • Use an "archiver" to manage file naming and ensure consistency.
  • Configure the Puppet server to fetch the file from HDFS (e.g., via hdfs_rsync) and then deploy it to the CDN hosts.

Open questions / next steps:

Acceptance criteria:

  • Daily job exports query results to HDFS.
  • Puppet server fetches and stores the file in a way usable by Puppet manifests.
  • File is deployed to CDN servers via the normal Puppet workflows.

Details

Other Assignee
brouberol
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+2 -6
operations/puppetproduction+2 -0
operations/puppetproduction+4 -2
operations/puppetproduction+17 -15
operations/puppetproduction+2 -0
operations/puppetproduction+34 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -0
operations/puppetproduction+10 -7
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+17 -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+114 -2
operations/deployment-chartsmaster+41 -8
operations/deployment-chartsmaster+9 -1
operations/puppetproduction+3 -3
operations/puppetproduction+34 -0
operations/puppetproduction+8 -0
operations/puppetproduction+5 -0
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+71 -0
operations/deployment-chartsmaster+60 -0
operations/deployment-chartsmaster+10 -7
operations/dnsmaster+2 -2
operations/dnsmaster+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+18 -0
labs/privatemaster+0 -0
operations/puppetproduction+0 -10
operations/puppetproduction+16 -0
operations/puppetproduction+1 -0
operations/puppetproduction+9 -0
operations/puppetproduction+13 -2
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
sre: fix container/pod security context for rsync taskrepos/data-engineering/airflow-dags!2153elukeyelukey_T402512_fix_security_contextmain
Create the first webrequest-based pipeline for SRErepos/data-engineering/airflow-dags!2100elukeyelukey_T402512main
Add the netaddr dependency and regenerate poetry's lockrepos/data-engineering/airflow-dags!2082elukeynetaddr_depmain
Define the sre root DAG folderrepos/data-engineering/airflow-dags!1930brouberolT402512main
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I can take care of spinning up the airflow instance if required @BTullis.

That would be great. Thanks.

Sure thing. I'd need a couple of details. from you @elukey, namely the defaut team name DAGs would be labeled with, as well as a default alert email for failing DAGs.

Change #1227731 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/dns@master] Define the airflow-sre public and internal domains

https://gerrit.wikimedia.org/r/1227731

Change #1227732 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Define the airflow-sre kubeconfig files

https://gerrit.wikimedia.org/r/1227732

Change #1227733 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Setup the caching and ATS rules to publicly expose airflow-sre.wikimedia.org

https://gerrit.wikimedia.org/r/1227733

@brouberol I'd say team name "sre" and the root wikimedia email as starter, then later on we can tune it with a different one!

For logging into the instance we can use cn=ops,ou=groups,dc=wikimedia,dc=org

Change #1227732 merged by Brouberol:

[operations/puppet@production] Define the airflow-sre kubeconfig files

https://gerrit.wikimedia.org/r/1227732

Change #1227829 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] deployment_server: Fix group name typo

https://gerrit.wikimedia.org/r/1227829

Change #1227829 merged by Clément Goubert:

[operations/puppet@production] deployment_server: Fix group name typo

https://gerrit.wikimedia.org/r/1227829

Next steps:

  • DP to create the Airflow SRE instance.
  • Me and DP to configure the rsync settings and credentials to be able to push data from the DSE Cluster to the puppetservers.
  • Me/Chris to test the Airflow workflow and publish the first bit of data to the puppetservers.

Change #1227731 merged by Brouberol:

[operations/dns@master] Define the airflow-sre public and internal domains

https://gerrit.wikimedia.org/r/1227731

Change #1228315 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/dns@master] Fix formatting of airflow-sre domain declarations

https://gerrit.wikimedia.org/r/1228315

Change #1228315 merged by Brouberol:

[operations/dns@master] Fix formatting of airflow-sre domain declarations

https://gerrit.wikimedia.org/r/1228315

Change #1228420 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: add the airflow-sre to the ceph/PG operator tenant ns

https://gerrit.wikimedia.org/r/1228420

Change #1228421 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: define the postgresql-airflow-sre service

https://gerrit.wikimedia.org/r/1228421

Change #1228422 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: define the airflow-sre service

https://gerrit.wikimedia.org/r/1228422

Change #1228420 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad: add the airflow-sre to the ceph/PG operator tenant ns

https://gerrit.wikimedia.org/r/1228420

Change #1228421 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s-eqiad: define the postgresql-airflow-sre service

https://gerrit.wikimedia.org/r/1228421

@MoritzMuehlenhoff re cn=ops,ou=groups,dc=wikimedia,dc=org understood! The LDAP/Airflow role mapping is by default:

role_mappings:
  airflow-{{ $.Values.config.airflow.instance_name }}-ops: [Op]
  nda: [User]
  wmf: [User]
  ops: [Admin]

where the Op role is a powerful role, but not as powerful as admin. In our case, as we're provisioning the instance for SREs, all users will be members of ops anyway, meanining that they will all get Admin powers. There'd be no need to create the airflow-sre-ops LDAP group.

Change #1228422 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad: define the airflow-sre service

https://gerrit.wikimedia.org/r/1228422

Change #1228437 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad/airflow-sre: define deploy role and TLS SAN

https://gerrit.wikimedia.org/r/1228437

Change #1228437 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad/airflow-sre: define deploy role and TLS SAN

https://gerrit.wikimedia.org/r/1228437

Change #1227733 merged by Brouberol:

[operations/puppet@production] trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o

https://gerrit.wikimedia.org/r/1227733

Change #1228448 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Provision the OIDC config for airflow-sre

https://gerrit.wikimedia.org/r/1228448

Change #1228448 merged by Brouberol:

[operations/puppet@production] Provision the OIDC config for airflow-sre

https://gerrit.wikimedia.org/r/1228448

https://airflow-sre.wikimedia.org has been deployed!

Screenshot 2026-01-19 at 12.35.22.png (1,435×484 px, 78 KB)

Give it a good hour for the ATS changes to propagate everywhere.

Next steps:

  • DP to create the Airflow SRE instance.

That's done now.

  • Me and DP to configure the rsync settings and credentials to be able to push data from the DSE Cluster to the puppetservers.

We're ready to start configuring this now, @elukey

Could you tell me which posix user will be permitted to SSH into the puppetservers, please?
Will it be a new user, specifically for this purpose, or an existing account?

We will need the following resources to be configured.

In Puppet
  • An ssh::userkey resource on the puppetservers - (example here.) - This is an authorized_keys file containing the public part of an SSH keypair, along with some options to restrict the source addresses. We could force the command to be rsync --server <snip> as was shown in this commit, if you would like to lock down the options as much as possible.
  • The corresponding SSH private key goes into the private puppet repository
  • A firewall::service on the puppetservers that open up port TCP/22 on the puppetservers to the DSE_KUBEPODS network. (Example: here.)
  • A profile::ssh::server::match_config block on the puppetservers (example here) that further restricts the options that apply to an SSH session matching this user. (e.g. AllowTcpForwarding: 'no')
In deployment-charts
  • A secret that contains the SSH private key mentioned earlier
  • ConfigMaps that contain the following:
    • A known_hosts file, containing the SSH host keys for the puppet servers
    • An rsync_targets file, containing the puppet server host names, and the posix username to be used
    • A config file containing other useful settings, such as the ciphers to use

You can look at these examples for the files themselves, plus the way that extra-config resources can be managed by the airflow chart.

There is no corresponding way of creating arbitrary secrets in the Airflow chart, in the way that the ConfigMap objects are managed. Perhaps we will want to add this, or perhaps we could add the secret using a raw YAML object in the helmfile.yaml for airflow-sre.

Have we decided whether to use cephfs or S3 for the intermediate data store? Either is fine by me.

Change #1229590 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::puppetserver: add the analytics-sre user key and configs

https://gerrit.wikimedia.org/r/1229590

Thanks a lot for the detailed explanation Ben! I tried to work on the Puppet part in https://gerrit.wikimedia.org/r/1229590, and I'll do the deployment-charts hopefully tomorrow. From a quick glance I am a little concerned about the need for known_hosts, since it can easily get out of sync with reimages etc.. Have you duplicated them for the dumps use case, or did you use a different approach?

Change #1229590 merged by Elukey:

[operations/puppet@production] role::puppetserver: add the analytics-sre user key and configs

https://gerrit.wikimedia.org/r/1229590

Replying to my own question - in helmfile.d/dse-k8s-services/mediawiki-dumps-legacy/values-dumps.yaml I see the following:

ssh_known_hosts:
  - clouddumps1001.wikimedia.org,208.80.154.142,2620:0:861:2:208:80:154:142 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBMA/Al2V+CTWWEdMJmqzhSmtn5tche1OmBxh67/g8AP7wdtUSZ6urOUZBe8lcjiAif9heJb7jwWWSNe+VCKCq0g=
  - clouddumps1002.wikimedia.org,208.80.154.71,2620:0:861:3:208:80:154:71 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBExvftUq/vgRdl20f4hKYECNRYoZNI2C789OOxz92IQBrsDU8NqgMy1o9bdfVc2acZBC3VD/LNCtiLx1kWtetro=

This is probably fine for the moment but I am pretty sure it will cause some issues down the line when a host is reimaged/changed/etc.. and people forget to update these configs. I am wondering if there is a way in external-services or similar to pick the known hosts config from the deploy's trusted list and load it to a config map in k8s.

Change #1230970 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] dse-k8s-services: add service-secrets to airflow-sre's helmfile config

https://gerrit.wikimedia.org/r/1230970

I am wondering if there is a way in external-services or similar to pick the known hosts config from the deploy's trusted list and load it to a config map in k8s.

I'll admit that I don't know atm. Do you feel like this is a hard blocker? I'm happy to create a subticket for us to investigate, but still unblock you right now and proceed with keys you'd generate in puppet.

Once the keys are generated, I'll add some functionality to the airflow chart to support the definition of the known_hosts, rsync_targets and config file (config being .ssh/config) and mounting of these files in the task pods, when defined.

That being said, if that information is defined in a hiera value, we can export it to general-<env>.yaml via global_config.pp in puppet. If it's not in a hiera, we might still be able to fetch the known hosts with a PQL query?

Change #1235837 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] hadoop/yarn: allow analytics-sre to submit jobs to the production queue

https://gerrit.wikimedia.org/r/1235837

Change #1235837 merged by Brouberol:

[operations/puppet@production] hadoop/yarn: allow analytics-sre to submit jobs to the production queue

https://gerrit.wikimedia.org/r/1235837

Change #1230970 merged by Elukey:

[operations/deployment-charts@master] dse-k8s-services: add service-secrets to airflow-sre's helmfile config

https://gerrit.wikimedia.org/r/1230970

Change #1236729 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-sre: enable ssh access from task pods to the puppetservers

https://gerrit.wikimedia.org/r/1236729

Change #1236756 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: allow the definition of rsync/ssh configuration

https://gerrit.wikimedia.org/r/1236756

Change #1236756 merged by Brouberol:

[operations/deployment-charts@master] airflow: allow the definition of rsync/ssh configuration

https://gerrit.wikimedia.org/r/1236756

Change #1236729 merged by Brouberol:

[operations/deployment-charts@master] airflow-sre: enable ssh access from task pods to the puppetservers

https://gerrit.wikimedia.org/r/1236729

Change #1237221 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: ensure the ssh privatekey is b64 encoded in the Secret

https://gerrit.wikimedia.org/r/1237221

Change #1237221 merged by Brouberol:

[operations/deployment-charts@master] airflow: ensure the ssh privatekey is b64 encoded in the Secret

https://gerrit.wikimedia.org/r/1237221

We have deployed the "infrastructure" that will support rsync-ing files from airflow-sre task pods to the pupperserver hosts:

brouberol@deploy2002:~$ kubectl get configmap | grep puppetservers
airflow-rsync-puppetservers              1      54s
airflow-ssh-puppetservers                2      54s
brouberol@deploy2002:~$ kubectl get networkpolicies | grep puppetservers
airflow-production-task-pod-egress-ssh-puppetservers   app=airflow,component=task-pod,release=production   64s

We can see that task pods now have egress to the port 22 of the puppetservers enabled:

runuser@airflow-task-shell-66d66dd999-2mxbr:/opt/airflow$ ./usr/bin/is_port_open puppetserver1001.eqiad.wmnet 22
puppetserver1001.eqiad.wmnet:22 is open

Change #1239029 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] dse-k8s-services: add pvc config for airflow-sre

https://gerrit.wikimedia.org/r/1239029

Change #1239029 merged by Elukey:

[operations/deployment-charts@master] dse-k8s-services: add pvc config for airflow-sre

https://gerrit.wikimedia.org/r/1239029

To keep archives happy - I created a new pvc called airflow-sre-webrequest-to-cdn for airflow-sre, and created another one with the same name in airflow-dev (following T396495#11128447) for testing purposes.

The main idea is the following for the DAG:

Hive queries webrequest and saves to a file on HDFS ---> the file is copied in a persistent volume (from hdfs) ---> the file is copied from the pvc to puppetservers via rsync

Change #1240714 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::puppetserver: simplify analytics-sre's authorized key

https://gerrit.wikimedia.org/r/1240714

Change #1240714 merged by Elukey:

[operations/puppet@production] profile::puppetserver: simplify analytics-sre's authorized key

https://gerrit.wikimedia.org/r/1240714

Change #1241004 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] admin: set home dir for analytics-sre

https://gerrit.wikimedia.org/r/1241004

Change #1241004 merged by Elukey:

[operations/puppet@production] admin: set home dir for analytics-sre

https://gerrit.wikimedia.org/r/1241004

Change #1241012 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::puppetserver: rework and fix the analytics-sre config

https://gerrit.wikimedia.org/r/1241012

Change #1241012 merged by Elukey:

[operations/puppet@production] profile::puppetserver: rework and fix the analytics-sre config

https://gerrit.wikimedia.org/r/1241012

I am finally able to query a week worth of IPs from webrequest and dump a txt file on the puppetserver1001's volatile dir (one IP for each line, something that haproxy can ingest).

The airflow dag is still running in the dev env, I'll polish it and send a patch to merge it in main.

Next steps:

  • Refine the query to Hive.
  • File the merge request for the main branch of the airflow dag repo.

Change #1253422 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the size of the WAL volume for postgresql-airflow-sre

https://gerrit.wikimedia.org/r/1253422

Change #1253422 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the size of the WAL volume for postgresql-airflow-sre

https://gerrit.wikimedia.org/r/1253422

Change #1254887 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] dse-k8s-services: update the base Airflow image

https://gerrit.wikimedia.org/r/1254887

Change #1254887 merged by Elukey:

[operations/deployment-charts@master] dse-k8s-services: update the base Airflow image

https://gerrit.wikimedia.org/r/1254887

Change #1283821 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::haproxy: add webrequest-based ip reputation data

https://gerrit.wikimedia.org/r/1283821

Change #1283821 merged by Elukey:

[operations/puppet@production] profile::cache::haproxy: add webrequest-based ip reputation data

https://gerrit.wikimedia.org/r/1283821

Change #1289808 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] haproxy: Enable use_webrequest_ipreputation flag for cp7002/cp7012

https://gerrit.wikimedia.org/r/1289808

Change #1289808 merged by Elukey:

[operations/puppet@production] haproxy: Enable use_webrequest_ipreputation flag for cp7002/cp7012

https://gerrit.wikimedia.org/r/1289808

Change #1290047 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:cache::haproxy: guard webrequest IP reputation data for beta

https://gerrit.wikimedia.org/r/1290047

Change #1290047 merged by Ssingh:

[operations/puppet@production] P:cache::haproxy: guard webrequest IP reputation data for beta

https://gerrit.wikimedia.org/r/1290047

Change #1290767 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::haproxy: move webrequest ip reputation exp to all magru

https://gerrit.wikimedia.org/r/1290767

Change #1290767 merged by Elukey:

[operations/puppet@production] role::cache::haproxy: move webrequest ip reputation exp to all magru

https://gerrit.wikimedia.org/r/1290767

Change #1297641 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::{text,upload}: enable webrequest tagging in eqsin

https://gerrit.wikimedia.org/r/1297641

Change #1297641 merged by Elukey:

[operations/puppet@production] role::cache::{text,upload}: enable webrequest tagging in eqsin

https://gerrit.wikimedia.org/r/1297641

Change #1298318 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::{text,upload}: enable webrequest tagging globally

https://gerrit.wikimedia.org/r/1298318

Change #1298318 merged by Elukey:

[operations/puppet@production] role::cache::{text,upload}: enable webrequest tagging globally

https://gerrit.wikimedia.org/r/1298318