Page MenuHomePhabricator

Switch over the Dumps_v1 system to run from Airflow instead of snapshot servers
Closed, ResolvedPublic

Description

This ticket is tracking the go-live prepararation and process for the parent epic:
T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes

The intention is to switch over the service to Airflow in time for the first full set of dumps on July 1st.

I would like to continue to run the July 1st dumps in parallel on the snapshot servers, but to stop publishing them to the clouddumps servers.
This means that we would still have a backup process for the dumps, should it be required.

It will also cause an additional load on the ES database servers, from the article dumps.

Event Timeline

@Marostegui @Ladsgroup - I'd like to check in with you about our plans for this switch-over of the Dumps v1 system to Airflow, which is scheduled for next Tuesday.

Currently, the plan is for us to run two sets of dumps in parallel, just for this July 1st run.

  • The airflow based system will publish the dumps results to clouddumps100[1-2]:/srv/dumps/xmldatadumps/public
  • The existing dumps that run on the snapshot servers will still run, but dumps will no longer be synced from the dumpsdata servers to the clouddumps servers.

The idea is that the existing system will just be running as a backup, that we can use if we need it. Plus we can use it for data consistency checks.
If all is well, then we can stop running the backups on the snapshot hosts before the July 20th partial run.

This plan has implications on the DB servers:

  • the external store sections, specifially for the article dumps.
  • the s8 vslow replica in eqiuad, which I believe is db1167 at the moment. This is because of T389199 which means that the wikibase dumps still use it.

I intend to copy the 20250601 full set of dumps to the cephfs volume used by the airflow workers, in order that the prefetch mechanism will be available to this run.

The dumps will kick off at 08:00 UTC next Tuesday, but the article jump jobs will start at different times, depending on the size of the wikis and the different rates of progress.

In terms of parallelism, the current settings in Airflow are to run up to:

  • 32 regular wiki dump tasks concurrently
  • 16 large wiki dump task concurrently

These values can be adjusted dynamically (at https://airflow-test-k8s.wikimedia.org/cluster_activity) if they cause an issue.

image.png (566×1 px, 73 KB)

Are you OK with this approach, in general? Have you any concerns about whether the ES clusters can support two sets of dumps running in parallel?

Is there anything else that you feel might help us to make this switchover safer, from the DB perspective? Thanks.

Change #1164150 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Dumps_v1: Disable the sync job that publishes from dumpsdata servers

https://gerrit.wikimedia.org/r/1164150

Change #1164157 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system

https://gerrit.wikimedia.org/r/1164157

@Marostegui @Ladsgroup - I'd like to check in with you about our plans for this switch-over of the Dumps v1 system to Airflow, which is scheduled for next Tuesday.
Are you OK with this approach, in general? Have you any concerns about whether the ES clusters can support two sets of dumps running in parallel?

Is there anything else that you feel might help us to make this switchover safer, from the DB perspective? Thanks.

Manuel is ooo this week. From my POV, we have very little capacity in ES. We are adding replicas for next FY but they are not done yet. OTOH, to my understanding these must be behind memcached so it actually shouldn't cause a major uptick in queries. Please ping us the second they start so we can keep an eye on them.

Please ping us the second they start so we can keep an eye on them.

OK, thanks. Understood. They're due to start at 08:00 UTC on Tuesday July 1st.
We can monitor them here: https://airflow-test-k8s.wikimedia.org/home?tags=full-dump

...to my understanding these must be behind memcached so it actually shouldn't cause a major uptick in queries.

How can we check whether or not this is the case?

I'm looking at an example dump pod and I don't see an mcrouter sidecar or much in the way of memcached configuration.
e.g.

btullis@deploy1003:~$ kube-env mediawiki-dumps-legacy-deploy dse-k8s-eqiad

btullis@deploy1003:~$ kubectl describe pod enwiki-sql-xml-enwiki-dump-xmlstubsdump-partial-6jeydg3
Name:         enwiki-sql-xml-enwiki-dump-xmlstubsdump-partial-6jeydg3
Namespace:    mediawiki-dumps-legacy
Priority:     0
Node:         dse-k8s-worker1006.eqiad.wmnet/10.64.132.8
Start Time:   Thu, 26 Jun 2025 12:11:44 +0000
Labels:       airflow_kpo_in_cluster=True
              airflow_version=2.10.5
              app=mediawiki
              dag_id=mediawiki_dumps_sql_xml_large_a_to_z_partial
              deployment=mediawiki-dumps-legacy
              kubernetes_pod_operator=True
              release=production
              run_id=scheduled__2025-05-20T0800000000-e08291d03
              task_id=enwiki.dump_xmlstubsdump_partial
              try_number=3
Annotations:  cni.projectcalico.org/containerID: edf835f5e7ea3c773b10b9581accc07f5ede6f8b9c20ec08b558af35d4e664a3
              cni.projectcalico.org/podIP: 10.67.26.102/32
              cni.projectcalico.org/podIPs: 10.67.26.102/32,2620:0:861:302:32a6:61fe:4eeb:d0e6/128
              container.seccomp.security.alpha.kubernetes.io/mediawiki-dump-sql-xml: runtime/default
              container.seccomp.security.alpha.kubernetes.io/mediawiki-production-rsyslog: runtime/default
              container.seccomp.security.alpha.kubernetes.io/mediawiki-production-tls-proxy: runtime/default
              pod.kubernetes.io/sidecars: mediawiki-production-tls-proxy,mediawiki-production-rsyslog
Status:       Running
IP:           10.67.26.102
IPs:
  IP:  10.67.26.102
  IP:  2620:0:861:302:32a6:61fe:4eeb:d0e6
Containers:
  mediawiki-dump-sql-xml:
    Container ID:  containerd://b432a3108d671a801e40812deff113df75725d6ba91b2cb032aeed504ba85297
    Image:         docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli:2025-06-19-202638-publish-81
    Image ID:      docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli@sha256:ae225f27c5166564ebc7a8b8bbd8bbfa6944c0ea1f309a137bba06657c57fc5c
    Port:          <none>
    Host Port:     <none>
    Command:
      /srv/deployment/dumps/xmldumps-backup/worker
    Args:
      --date
      20250620
      --skipdone
      --log
      --configfile
      /etc/dumps/confs/wikidump.conf.dumps:enwiki
      --wiki
      enwiki
      --job
      xmlstubsdump
      --skipjobs
      metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine
    State:          Running
      Started:      Thu, 26 Jun 2025 12:12:18 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     4
      memory:  8Gi
    Requests:
      cpu:     2
      memory:  4Gi
    Environment:
      SERVERGROUP:                            kube-dumps
      PHP__opcache__memory_consumption:       500
      PHP__opcache__max_accelerated_files:    32531
      PHP__opcache__interned_strings_buffer:  50
      PHP__auto_prepend_file:                 /srv/mediawiki/wmf-config/PhpAutoPrepend.php
      FPM__request_terminate_timeout:         201
      PHP__apc__shm_size:                     768M
      FPM__pm__max_children:                  8
      FPM__request_slowlog_timeout:           5
      PHP__display_errors:                    Off
      PHP__error_reporting:                   30719
      PHP__pcre__backtrack_limit:             5000000
      PHP__max_execution_time:                210
      PHP__error_log:                         /var/log/php-fpm/error.log
      FCGI_ALLOW:                             127.0.0.1
      FPM__slowlog:                           /var/log/php-fpm/slowlog.log
      ENVOY_MW_API_HOST:                      http://localhost:6501
    Mounts:
      /etc/dumps/confs from mediawiki-dumps-legacy-configs-volume (ro)
      /etc/dumps/dblists from mediawiki-dumps-legacy-dblists-volume (ro)
      /etc/dumps/templs from mediawiki-dumps-legacy-templates-volume (ro)
      /etc/wikimedia-cluster from mediawiki-production-wikimedia-cluster (rw,path="wikimedia-cluster")
      /etc/wmerrors from mediawiki-production-wmerrors (rw)
      /mnt/dumpsdata from mediawiki-production-dumps (rw)
      /run/shared from shared-socket (rw)
      /usr/share/GeoIP/ from mediawiki-production-geoip (ro)
      /usr/share/GeoIPInfo/ from mediawiki-production-geoipinfo (ro)
      /var/log/php-fpm from php-logging (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hn2g2 (ro)
  mediawiki-production-tls-proxy:
    Container ID:   containerd://5f9c5eb1de4a869d939c7bad64dfd82cf419899757df4fa0d3ebd63339a6f150
    Image:          docker-registry.discovery.wmnet/envoy:1.23.10-3
    Image ID:       docker-registry.discovery.wmnet/envoy@sha256:1b47d8501df480d605a7aa2163d6ae130f3f5ec6d54f6e28a74d559b397f71b9
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 26 Jun 2025 12:12:18 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  500Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Readiness:  http-get http://:9361/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SERVICE_NAME:    production
      SERVICE_ZONE:    default
      CONCURRENCY:     12
      ADMIN_PORT:      1666
      DRAIN_TIME_S:    600
      DRAIN_STRATEGY:  gradual
    Mounts:
      /etc/envoy/ from envoy-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hn2g2 (ro)
  mediawiki-production-rsyslog:
    Container ID:   containerd://5e5a2ad0f609c44b1b42b98f600cdede31a3f8de779b6ee9c8e852d3e26308d4
    Image:          docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
    Image ID:       docker-registry.discovery.wmnet/rsyslog@sha256:4b47de25884bad139ce97dff9b09b785ed53e22fc19534edbbb7ff1c76c932dd
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 26 Jun 2025 12:12:18 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  300Mi
    Requests:
      cpu:     100m
      memory:  200Mi
    Environment:
      KUBERNETES_NAMESPACE:   mediawiki-dumps-legacy (v1:metadata.namespace)
      KUBERNETES_NODE:         (v1:spec.nodeName)
      KUBERNETES_POD_NAME:    enwiki-sql-xml-enwiki-dump-xmlstubsdump-partial-6jeydg3 (v1:metadata.name)
      KUBERNETES_RELEASE:      (v1:metadata.labels['release'])
      KUBERNETES_DEPLOYMENT:   (v1:metadata.labels['deployment'])
    Mounts:
      /etc/rsyslog.d from mediawiki-production-rsyslog-config (rw)
      /var/log/php-fpm from php-logging (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hn2g2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  mediawiki-production-wikimedia-cluster:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-production-wikimedia-cluster-config
    Optional:  false
  envoy-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-production-envoy-config-volume
    Optional:  false
  shared-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mediawiki-production-wmerrors:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-production-wmerrors
    Optional:  false
  mediawiki-production-rsyslog-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-production-rsyslog-config
    Optional:  false
  php-logging:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  mediawiki-production-geoip:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/share/GeoIP
    HostPathType:  
  mediawiki-production-geoipinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/share/GeoIPInfo
    HostPathType:  
  mediawiki-production-dumps:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mediawiki-dumps-legacy-fs
    ReadOnly:   false
  mediawiki-dumps-legacy-configs-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-dumps-legacy-configs
    Optional:  false
  mediawiki-dumps-legacy-templates-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-dumps-legacy-templates
    Optional:  false
  mediawiki-dumps-legacy-dblists-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mediawiki-dumps-legacy-dblists
    Optional:  false
  kube-api-access-hn2g2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
btullis@deploy1003:~$

If you want to get a bash shell to check it out, you could use our toolbox pod.

btullis@deploy1003:~$ kube-env mediawiki-dumps-legacy-deploy dse-k8s-eqiad

btullis@deploy1003:~$ kubectl exec -it mediawiki-dumps-legacy-toolbox-5fc88c9c76-k7l8c -- bash
Defaulted container "toolbox" out of: toolbox, mediawiki-dumps-legacy-resources-tls-proxy
www-data@mediawiki-dumps-legacy-toolbox-5fc88c9c76-k7l8c:/$

Memcached is central and not a sidecar in our production and even if there is a side-car, that'd be just another layer. If it uses mediawiki (and SqlBlobStore), it'll go through memcached. More information: https://wikitech.wikimedia.org/wiki/Memcached_for_MediaWiki

The current SQL/SML dumps are idle rightnow, so I think that it is a good time to switch over.

image.png (608×1 px, 136 KB)

Change #1164150 merged by Btullis:

[operations/puppet@production] Dumps_v1: Disable the sync job that publishes from dumpsdata servers

https://gerrit.wikimedia.org/r/1164150

Change #1164157 merged by Btullis:

[operations/puppet@production] Dumps_v1: Stop updating dumps monitor HTML/JSON from the legacy system

https://gerrit.wikimedia.org/r/1164157