Page MenuHomePhabricator

Migrate analytics_test airflow instance to bullseye an-test-client1002
Closed, ResolvedPublic

Description

As part of the Bullseye upgrade we need to test Airflow functionality on Bullseye. an-test-client1002 is already running on Bullseye.
We can only have one airflow scheduler for the test cluster running at a time as detailed on client.pp#L18-L30

"We run this here in the analytics-test cluster because we don't have a 'launcher' role node there, and we can't run hive clients on the same node as the hive server, as we use dns_canonicalize_hostname=true there, which causes Hive Kerberos authentication to fail from that host.
we only want airflow on ONE client instance. This conditional is a hack to ensure that if someone ever creates more an-test-client instances, that the airflow-analytics-test instance is not created there accidentally."

The airflow2.6.1 deb is already available for bullseye thus to test out full functionality we need to

  • Schedule downtime and stop the scheduler job on an-test-client1001
  • Schedule downtime and disable puppet on an-test-client1001
  • Remove an-test-client1001 from ::profile::airflow and replace it with an-test-client1002
  • Verify the status and functionality of Airflow on Bullseye.

Event Timeline

Stevemunene renamed this task from Migrate analytics to Migrate analytics_test airflow instance to bullseye an-test-client1002 .Jul 12 2023, 2:38 PM
Stevemunene claimed this task.
Stevemunene updated the task description. (Show Details)
Stevemunene moved this task from Incoming to In Progress on the Data-Platform-SRE board.
Stevemunene removed subscribers: odimitrijevic, Ottomata, Aklapper and 6 others.

Change 937577 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Change analytics_test airflow to use an-test-client1002

https://gerrit.wikimedia.org/r/937577

Change 937577 merged by Stevemunene:

[operations/puppet@production] Change analytics_test airflow to use an-test-client1002

https://gerrit.wikimedia.org/r/937577

The airflow services are running ok on an-test-client1002 with zero errors/alerts. Next is to find an alternative to how we ensure the airflow scheduler is not present on more than 1 test client

Next is to find an alternative to how we ensure the airflow scheduler is not present on more than 1 test client

I don't really see why we would need to do that though. Can't we simply remove the packages manually from an-test-client1001 for now? Or reimage it?
It's going to be decommissioned soon anyway, isn't it?

Change 938803 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Deploy airflow version 2.6.3 to analytics_test

https://gerrit.wikimedia.org/r/938803

Change 938803 merged by Btullis:

[operations/puppet@production] Deploy airflow version 2.6.3 to analytics_test

https://gerrit.wikimedia.org/r/938803

could we first disable the jobs, then notify our users to start using for airflow related work and all other related work to an-test1002 as we prepare for the eventual decommission of an-test1001?

systemctl stop airflow-scheduler@analytics_test.service
systemctl stop wmf_auto_restart_airflow-scheduler@analytics_test.service
systemctl stop  wmf_auto_restart_airflow-scheduler@analytics_test.timer

systemctl disable airflow-scheduler@analytics_test.service
systemctl disable wmf_auto_restart_airflow-scheduler@analytics_test.service
systemctl disable  wmf_auto_restart_airflow-scheduler@analytics_test.timer

systemctl daemon-reload

Then re enable puppet.

We will need to update this: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test/-/blob/main/targets

could we first disable the jobs, then notify our users to start using for airflow related work and all other related work to an-test1002 as we prepare for the eventual decommission of an-test1001?

We will need to update the hostname here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Instances#analytics_test

image.png (408×919 px, 57 KB)

Then we will need to do an airflow-dags deployment to make sure that the DAGs are up to date on an-test-client1001.

systemctl stop airflow-scheduler@analytics_test.service
systemctl stop wmf_auto_restart_airflow-scheduler@analytics_test.service
systemctl stop  wmf_auto_restart_airflow-scheduler@analytics_test.timer

systemctl disable airflow-scheduler@analytics_test.service
systemctl disable wmf_auto_restart_airflow-scheduler@analytics_test.service
systemctl disable  wmf_auto_restart_airflow-scheduler@analytics_test.timer

systemctl daemon-reload

Then re enable puppet.

Yes, that's fine, but you could also apt purge airflow and/or go straight to the decommissioning.
Is there any other reason why we particularly want to keep an-test-client1001 around? You could check with users who have stuff in /home whether they're happy for it to be deleted.

Mentioned in SAL (#wikimedia-analytics) [2023-07-18T13:20:25Z] <stevemunene> deploy airflow-dags to an-test-client1002 T341700

Updated the targets to an-test-client1002
Updated the docs as well Airflow/Instances
Did the scap deploy but the host shown is an-test-client1001.eqiad.wmnet

stevemunene@deploy1002:/srv/deployment/airflow-dags/analytics_test$ git fetch && git rebase
First, rewinding head to replay your work on top of it...
Fast-forwarded update_airflow_2_6 to refs/remotes/origin/update_airflow_2_6.
stevemunene@deploy1002:/srv/deployment/airflow-dags/analytics_test$ scap deploy
13:35:17 Started deploy [airflow-dags/analytics_test@be05071]
13:35:17 Deploying Rev: HEAD = c203642a6fb793feb1b52141e51d9a6100a75ca9
13:35:17 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
13:35:17 
== DEFAULT ==
:* an-test-client1001.eqiad.wmnet
13:35:19 airflow-dags/analytics_test: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:35:19 airflow-dags/analytics_test: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:35:20 airflow-dags/analytics_test: promote stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:35:20 default deploy successful
13:35:20 
== DEFAULT ==
:* an-test-client1001.eqiad.wmnet
13:35:21 airflow-dags/analytics_test: finalize stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:35:21 default deploy successful
13:35:21 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 03s)
13:35:21 Finished deploy [airflow-dags/analytics_test@be05071] (duration: 00m 03s)
stevemunene@deploy1002:/srv/deployment/airflow-dags/analytics_test$ logout

OK, it looks like the instructions are a little incomplete.
For each instance in here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Instances#List_of_instances
...we have a set of instructions for deploying the DAGs. However, we don't include a check to make sure that the scap configuration itself is up-to-date.

In this case, I pulled the latest version of the scap config, then verified with git log -n 1

btullis@deploy1002:/srv/deployment/airflow-dags/analytics_test/scap$ git pull
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), done.
From https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics_test
   8e87119..930ea7e  main       -> origin/main
Updating 8e87119..930ea7e
Fast-forward
 targets | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
btullis@deploy1002:/srv/deployment/airflow-dags/analytics_test/scap$ git log -n 1
commit 930ea7eb6b8276569d390ba9b2bd8aa1776be141 (HEAD -> main, origin/main, origin/HEAD)
Merge: 8e87119 cae7823
Author: Stevemunene <smunene@wikimedia.org>
Date:   Tue Jul 18 13:13:34 2023 +0000

    Merge branch 'change_test_instance_to_1002' into 'main'
    
    Change test instance scap target to an-test-client1002
    
    See merge request repos/data-engineering/airflow-dags-scap-analytics_test!1

Then I did another scap deploy from the parent directory.

btullis@deploy1002:/srv/deployment/airflow-dags/analytics_test$ scap deploy
09:14:08 Started deploy [airflow-dags/analytics_test@be05071]
09:14:08 Deploying Rev: HEAD = c203642a6fb793feb1b52141e51d9a6100a75ca9
09:14:08 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
09:14:08 
== DEFAULT ==
:* an-test-client1002.eqiad.wmnet
09:14:09 airflow-dags/analytics_test: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
09:14:10 airflow-dags/analytics_test: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
09:14:11 airflow-dags/analytics_test: promote stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
09:14:11 default deploy successful
09:14:11 
== DEFAULT ==
:* an-test-client1002.eqiad.wmnet
09:14:12 airflow-dags/analytics_test: finalize stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
09:14:12 default deploy successful
09:14:12 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 04s)
09:14:12 Finished deploy [airflow-dags/analytics_test@be05071] (duration: 00m 04s)

Looks OK. I'll check the airflow web UI now and make sure that the logs look OK.