Page MenuHomePhabricator

Repeated failures to resolve an-master100[3-4] from an-launcher1002 - resulting in pipeline failures
Closed, ResolvedPublic

Assigned To
Authored By
xcollazo
Aug 26 2025, 4:14 PM
Referenced Files
F66779507: image.png
Oct 22 2025, 9:58 AM
F66002029: image.png
Sep 10 2025, 12:53 PM
F65954949: image.png
Sep 3 2025, 5:08 PM
F65954911: image.png
Sep 3 2025, 5:08 PM
F65954892: image.png
Sep 3 2025, 5:08 PM
F65954859: image.png
Sep 3 2025, 5:08 PM

Description

Over the course of the current OpsWeek, I have noticed multiple network issues on an-launcher1002.eqiad.wmnet over multiple use cases.

Here are the detected symptoms:

Job: refine_event_sanitized_analytics_immediate

Example full email.

Occurrences as per email timestamps (ET) on my opsweek:

Aug 21, 3:14 AM
Aug 24, 3:07 AM
Aug 26, 4:14 AM

There is, however, email evidence of same stack since at least Thu, Jul 17, 9:29 AM ET.

Stack trace:

25/08/26 08:14:43 INFO RetryInvocationHandler: java.net.ConnectException: Call From an-launcher1002/10.64.21.109 to an-master1004.eqiad.wmnet:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over an-master1004-eqiad-wmnet after 449 failover attempts. Trying to failover after sleeping for 2022ms.
25/08/26 08:14:45 INFO ConfiguredRMFailoverProxyProvider: Failing over to an-master1003-eqiad-wmnet
Exception in thread "main" java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "an-master1003.eqiad.wmnet":8032; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
        at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:768)
        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:449)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1552)
        at org.apache.hadoop.ipc.Client.call(Client.java:1403)
        at org.apache.hadoop.ipc.Client.call(Client.java:1367)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy7.getClusterMetrics(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:271)
        at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy8.getClusterMetrics(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:605)
        at org.apache.spark.deploy.yarn.Client.$anonfun$submitApplication$1(Client.scala:179)
        at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
        at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
        at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:65)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1227)
        at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException
        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:450)
        ... 31 more

We have also seen this in recent sqoop failures.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+8 -7
operations/puppetproduction+3 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+2 -3
operations/puppetproduction+0 -1
operations/puppetproduction+0 -3
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+0 -4
operations/puppetproduction+7 -1
operations/puppetproduction+2 -0
operations/puppetproduction+12 -6
labs/privatemaster+0 -0
operations/puppetproduction+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -1
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Install our libssl1.1 package in order to support hadooprepos/data-engineering/spark!41btullisadd_snappy_ssl_supportmain
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Having investigated this, my feeling is that we should create a new VM called an-launcher1003 and migrate the remaining systemd jobs from an-launcher1002 to this new host.

+1 from my side to migrate if you think it will solve the connectivity issue. Plus, we move forward in the datacenter refresh as you mention.

Although an-launcher1002 has 64 GB of RAM, we can see that in the last 30 days, the amount of RAM used has only been around 6 GB.

Right, we no longer run the analytics Airflow instance (which used to peg the CPUs) nor refine on this server (refine is now on Airflow k8s). I think refine-sanitize is the critical workload here?

CC @JAllemandou, @Antoine_Quhen

@BTullis I just though of another use case of an-launcher1002 that I had forgot about: running ad-hoc queries while sudoing as the analytics user, to fix bugs in production, etc. One such example is the work I am currently doing for T404975.

Wanted to mention it so that in the event we do decommission this server that we first have another way to run such jobs.

@BTullis I just though of another use case of an-launcher1002 that I had forgot about: running ad-hoc queries while sudoing as the analytics user, to fix bugs in production, etc. One such example is the work I am currently doing for T404975.

Wanted to mention it so that in the event we do decommission this server that we first have another way to run such jobs.

Thanks @xcollazo - yes, that's a good shout.
In fact, there is already another way of doing this, which is to use the hadoop-shell container that we deploy along with each Airflow instance.

For example, you can see here that an hdfs dfs -ls command shows the contents of /user/analytics

btullis@deploy1003:~$ kube-env airflow-main-deploy dse-k8s-eqiad

btullis@deploy1003:~$ kubectl exec -it airflow-hadoop-shell-798d8f6846-mpssf -- bash

airflow@airflow-hadoop-shell-798d8f6846-mpssf:/opt/airflow$ hdfs dfs -ls
Found 13 items
drwxr-xr-x   - analytics analytics                            0 2025-09-19 00:00 .Trash
drwx------   - analytics analytics                            0 2023-01-25 17:28 .flink
drwxr-x---   - analytics analytics                            0 2025-09-19 16:24 .skein
drwxr-xr-x   - analytics analytics                            0 2025-09-19 16:24 .sparkStaging
drwx------   - analytics analytics                            0 2025-09-19 16:21 .staging
-rw-------   3 analytics analytics                           89 2023-06-23 13:44 aqs_testing_password.txt
drwxr-xr-x   - analytics analytics                            0 2019-12-18 17:13 data
-rw-r-----   3 analytics analytics                           12 2024-11-07 12:39 hello.txt
-rw-------   3 analytics analytics                           16 2019-06-01 07:55 mysql-analytics-labsdb-client-pw.txt
-rw-------   3 analytics analytics                           24 2019-06-01 07:55 mysql-analytics-research-client-pw.txt
-rw-r-----   3 hdfs      analytics                       860259 2024-11-05 14:43 spark-examples_2.10-1.1.1.jar
drwxr-x---   - analytics analytics                            0 2024-10-01 12:20 staging
-rw-r-----   3 analytics analytics-privatedata-users        133 2021-12-06 16:05 swift_auth_analytics_admin.env

The reason for this is that it's the kerberos keytab belonging to the instance, which in this case is analytics, that is used to authenticate you to HDFS.
That's maybe not as convenient or powerful as using a real server to do your work, so for the moment I think that we will continue to put the analytics keytab onto the new server (an-launcher1003) so that it works in the same way as it does on an-launcher1002.

I created T405341 to request a suitably-sized VM and check that Infrastructure-Foundations is happy with the idea.

Change #1190652 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add insetup role and partman config for new an-launcher host

https://gerrit.wikimedia.org/r/1190652

Change #1190652 merged by Btullis:

[operations/puppet@production] Add insetup role and partman config for new an-launcher host

https://gerrit.wikimedia.org/r/1190652

Change #1192107 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure an-launcher1003 with its role, but absent job timers

https://gerrit.wikimedia.org/r/1192107

Change #1192120 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add new dummy keytabs for an-launcher1003

https://gerrit.wikimedia.org/r/1192120

Change #1192120 merged by Btullis:

[labs/private@master] Add new dummy keytabs for an-launcher1003

https://gerrit.wikimedia.org/r/1192120

Change #1192107 merged by Btullis:

[operations/puppet@production] Configure an-launcher1003 with its role, but absent job timers

https://gerrit.wikimedia.org/r/1192107

Change #1192511 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add an-launcher1003 to the list of permitted rsync and nfs clients

https://gerrit.wikimedia.org/r/1192511

Change #1192511 merged by Btullis:

[operations/puppet@production] Add an-launcher1003 to the list of permitted rsync and nfs clients

https://gerrit.wikimedia.org/r/1192511

There are several small puppet issues with an-launcher1003, presumably because this is the first bookworm host with the Hadoop packages installed.

  1. There is an issue with component/openjdk-8-jdk - After this is installed, apt breaks and the error message is:
btullis@an-launcher1003:/etc/apt$ sudo apt update
E: Conflicting values set for option Signed-By regarding source http://apt.wikimedia.org/wikimedia/ bookworm-wikimedia: /etc/apt/keyrings/wikimedia-archive-keyring.gpg != 
E: The list of sources could not be read.

We can test this by manually removing the Signed-By: /etc/apt/keyrings/wikimedia-archive-keyring.gpg line from /etc/apt/sources.list.d/component-jdk8-apt.wikimedia.org-wikimedia-bookworm-wikimedia.sources

This fixes the issue once, but it comes back on the next puppet run.

ponent-jdk8-apt.wikimedia.org-wikimedia-bookworm-wikimedia.sources]/File[/etc/apt/sources.list.d/component-jdk8-apt.wikimedia.org-wikimedia-bookworm-wikimedia.sources]/content: 
--- /etc/apt/sources.list.d/component-jdk8-apt.wikimedia.org-wikimedia-bookworm-wikimedia.sources	2025-09-30 10:06:30.250893981 +0000
+++ /tmp/puppet-file20250930-643901-zqz4y7	2025-09-30 10:16:54.437000084 +0000
@@ -14,4 +14,4 @@
 URIs: http://apt.wikimedia.org/wikimedia
 Suites: bookworm-wikimedia
 Components: component/jdk8
-# Signed-By: /etc/apt/keyrings/wikimedia-archive-keyring.gpg
+Signed-By: /etc/apt/keyrings/wikimedia-archive-keyring.gpg
  1. There is an issue with the libssl1.1 package.
Error: /Stage[main]/Profile::Analytics::Cluster::Packages::Common/Package[libssl1.1]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libssl1.1' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package libssl1.1
E: Couldn't find any package by glob 'libssl1.1'
E: Couldn't find any package by regex 'libssl1.1'
  1. There is an issue with the libyaml-cpp0.6 package.
Error: /Stage[main]/Profile::Analytics::Cluster::Packages::Statistics/Package[libyaml-cpp0.6]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libyaml-cpp0.6' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package libyaml-cpp0.6
E: Couldn't find any package by glob 'libyaml-cpp0.6'
E: Couldn't find any package by regex 'libyaml-cpp0.6'
  1. There is an issue with the airflow package.
E: Unable to locate package airflow
Error: /Stage[main]/Airflow/Package[airflow]/ensure: change from 'purged' to '2.10.3-py3.10-20250212' failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install airflow=2.10.3-py3.10-20250212' returned 100: Reading package lists...

This should be easy to fix, because we no longer want to install airflow on this server.

The other issues will take a little more investigation before deciding what to do.

I have addressed the libssl1.1 issue by copying the debs that were built for haproxy version 2.6 into the thirdparty/bigtop15 component.

btullis@apt1002:~$ sudo -i reprepro -C thirdparty/bigtop15 includedeb bookworm-wikimedia /srv/wikimedia/pool/component/haproxy26/o/openssl11/libssl11-dev_1.1.1w-0+deb11u1+wmf2_amd64.deb
Exporting indices...
btullis@apt1002:~$ sudo -i reprepro -C thirdparty/bigtop15 includedeb bookworm-wikimedia /srv/wikimedia/pool/component/haproxy26/o/openssl11/libssl1.1-dbgsym_1.1.1w-0+deb11u1+wmf2_amd64.deb
Exporting indices...
btullis@apt1002:~$ sudo -i reprepro -C thirdparty/bigtop15 includedeb bookworm-wikimedia /srv/wikimedia/pool/component/haproxy26/o/openssl11/libssl1.1_1.1.1w-0+deb11u1+wmf2_amd64.deb
Exporting indices...
btullis@an-launcher1003:/etc/apt$ apt-cache policy libssl1.1
libssl1.1:
  Installed: (none)
  Candidate: 1.1.1w-0+deb11u1+wmf2
  Version table:
     1.1.1w-0+deb11u1+wmf2 1001
       1001 http://apt.wikimedia.org/wikimedia bookworm-wikimedia/thirdparty/bigtop15 amd64 Packages

The need for these files was discussed in T352744: OpenSSL 3.x performance issues

Change #1192560 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the libyaml-cpp version installed on bookworm

https://gerrit.wikimedia.org/r/1192560

Change #1192560 merged by Btullis:

[operations/puppet@production] Update the libyaml-cpp version installed on bookworm

https://gerrit.wikimedia.org/r/1192560

Change #1192889 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the airflow profile from the analytics_cluster::launcher role

https://gerrit.wikimedia.org/r/1192889

Change #1192889 merged by Btullis:

[operations/puppet@production] Remove the airflow profile from the analytics_cluster::launcher role

https://gerrit.wikimedia.org/r/1192889

Change #1194570 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable monitoring for an-launcher1003

https://gerrit.wikimedia.org/r/1194570

Change #1194570 merged by Btullis:

[operations/puppet@production] Disable monitoring for an-launcher1003

https://gerrit.wikimedia.org/r/1194570

I think that I'm going to reimage an-launcher1003 as bullseye, rather than continue with the process of getting our bigtop packages to work on bookworm.

Although puppet now runs and apt update works reliably, there are still errors that are going to take a reasonable amount of effort to solve.

These include:

  • Package[hive]: depends on python but it is not installable - This is likely to be related to a dependency on python2.7
  • Package[sqoop]: Unable to locate package sqoop - We were unable to build this for bookworm

I will create another ticket for ensuring that our bigtop stack works on bookworm, then we can work on that in isolation from this problem on an-launcher1002

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-launcher1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-launcher1003.eqiad.wmnet with OS bullseye completed:

  • an-launcher1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510081518_btullis_1838146_an-launcher1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1195767 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] analytics::launcher: Fix the ensure parameter on the drop_event timer

https://gerrit.wikimedia.org/r/1195767

Change #1195767 merged by Btullis:

[operations/puppet@production] analytics::launcher: Fix the ensure parameter on the drop_event timer

https://gerrit.wikimedia.org/r/1195767

Change #1195775 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable notifications for an-launcher1003

https://gerrit.wikimedia.org/r/1195775

Change #1195778 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable canary events on an-launcher1003

https://gerrit.wikimedia.org/r/1195778

Change #1195779 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove stray hiera value for migrated refinery job

https://gerrit.wikimedia.org/r/1195779

Change #1195780 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate data_check refinery job to an-launcher1003

https://gerrit.wikimedia.org/r/1195780

Change #1195781 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate the hdfs_cleaner refinery jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195781

Change #1195782 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate the import_*_dumps systemd jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195782

Change #1195783 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate the project_namespace_map refinery job to an-launcher1003

https://gerrit.wikimedia.org/r/1195783

Change #1195784 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate sqoop jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195784

Change #1195785 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate the data_purge jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195785

Change #1195786 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate refine_sanitize jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195786

Change #1195775 merged by Btullis:

[operations/puppet@production] Enable notifications for an-launcher1003

https://gerrit.wikimedia.org/r/1195775

Change #1195778 merged by Btullis:

[operations/puppet@production] Enable canary events on an-launcher1003

https://gerrit.wikimedia.org/r/1195778

Change #1195780 merged by Btullis:

[operations/puppet@production] Remove the data_check refinery job from both an-launcher hosts

https://gerrit.wikimedia.org/r/1195780

Change #1195781 merged by Btullis:

[operations/puppet@production] Migrate the hdfs_cleaner refinery jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195781

Change #1195782 merged by Btullis:

[operations/puppet@production] Migrate the import_*_dumps systemd jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195782

Change #1195783 merged by Btullis:

[operations/puppet@production] Migrate the project_namespace_map refinery job to an-launcher1003

https://gerrit.wikimedia.org/r/1195783

Change #1195785 merged by Btullis:

[operations/puppet@production] Migrate the data_purge jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195785

Change #1195784 merged by Btullis:

[operations/puppet@production] Migrate sqoop jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195784

We have migrated most of the workload to an-launcher1003 and it has been running since yesterday without any errors.
One thing that is interesting is that one of the jobs is already exceeding the network throughput that it would have been able to achieve on an-launcher1002.

image.png (958×970 px, 122 KB)

https://grafana.wikimedia.org/goto/0ChlpHgDR?orgId=1

This would previously have been hitting the physical limit of 1 Gbps (~110 MB/s) on an-launcher1002 and contributing to the network congestion that caused the failure to resolve in other jobs.

I'm therefore hopeful that by moving this workload to a VM on a host with a 10 Gbps network connection, that we should be mitigating this congestion problem.

With a bit of investigation, it's clear which jobs are the most network-heavy. It's all of the jobs covered by this patch.

Specifically, it's these puppet classes:

These classes create the following system timers:

TypeNameStart time
MediaWikirefinery-import-siteinfo-dumps02:00
WikiDatarefinery-import-wikidata-all-json-dumps01:00
WikiDatarefinery-import-wikidata-all-ttl-dumps01:30
WikiDatarefinery-import-wikidata-lexemes-ttl-dumps02:00
Commonsrefinery-import-commons-mediainfo-ttl-dumps02:30
Commonsrefinery-import-commons-mediainfo-json-dumps03:00

This makes sense in terms of the network traffic, because the jobs read the dumps over NFS from clouddumps1002.wikimedia.org and then write to NFS.

All of these timers ultimately use hdfs-rsync with an NFS source and an HDFS target.
e.g.

btullis@an-launcher1003:/usr/local/bin$ sudo cat refinery-import-wikidata-all-json-dumps 
#!/bin/bash
# NOTE: This file is managed by puppet
#

# Rsync wikidata dumps to HDFS
/usr/local/bin/hdfs-rsync \
    --recursive           \
    --times               \
    --delete              \
    --prune-empty-dirs    \
    --chmod=go-w          \
    --include "/[0-9]*"   \
    --include "/*/*.json.bz2" \
    --exclude "**"        \
    file:/mnt/data/xmldatadumps/public/wikidatawiki/entities/ \
    hdfs://analytics-hadoop/wmf/data/raw/wikidata/dumps/all_json

# Touch flag in all folders to let Airflow start jobs
for f in $(/usr/bin/hdfs dfs -ls /wmf/data/raw/wikidata/dumps/all_json | awk '{print $8}')
do
    /usr/bin/hdfs dfs -touchz $f/_IMPORTED
done

I'm not sure that we need to re-engineer anything right now, but it's pretty clear that this touches on the work that we are planning to carry out in T405360: Implement an Airflow operator for moving data from point A to B.

Change #1195779 merged by Btullis:

[operations/puppet@production] Migrate the refine_netflow job to an-launcher1003

https://gerrit.wikimedia.org/r/1195779

Mentioned in SAL (#wikimedia-analytics) [2025-10-22T11:19:55Z] <btullis> migrated refine_netflow from an-launcher1002 to an-launcher1003 for T402943

Change #1195786 merged by Btullis:

[operations/puppet@production] Migrate refine_sanitize jobs to an-launcher1003

https://gerrit.wikimedia.org/r/1195786

Mentioned in SAL (#wikimedia-analytics) [2025-10-22T16:53:19Z] <btullis> migrated refine_sanitize from an-launcher1002 to an-launcher1003 for T402943

We have now migrated all of the workload from an-launcher1002 to an-launcher1003, so I think that we can tentatively call this done.

Change #1201301 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the python3-pymysql package to the analytics::refinery profile

https://gerrit.wikimedia.org/r/1201301

Change #1201301 merged by Btullis:

[operations/puppet@production] Add the python3-pymysql package to the analytics::refinery profile

https://gerrit.wikimedia.org/r/1201301