Page MenuHomePhabricator

New airflow instance related to Image Suggestion Jobs
Closed, ResolvedPublic3 Estimated Story Points

Description

Spin up a new Airflow server that would be the new platform_eng instance that follows the conventions from airflow_dags (and make sure to announce so that folks potentially affected by this are aware)

Event Timeline

EChetty renamed this task from New airflow instance related to T311417 to New airflow instance related to Image Suggestion Jobs.Jul 25 2022, 2:44 PM

Some more context:

We do have a platform_eng instance as described here https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#platform_eng. However, this instance does not follow the new conventions from the airflow-dags project. So the ask is to:

  1. Not touch the current platform_eng Airflow instance since the production image_suggestions dag is running on it.
  2. Create a new Airflow instance that will be the new platform_eng instance. Make sure there are no name conflicts?

Later, we will test the image_suggestions dag on the new instance, and after a successful run, we can nuke the old one.

Synced up with @mforns on this task. We will attempt to move it forward as much as we can until we get an SRE to help.

I will be working on the scaffolding of the scap gitlab repo, as per https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#Create_the_instance_specific_scap_repository

EChetty updated the task description. (Show Details)

Put together what I think the correct scap configuration is at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-platform_eng.

Followed changes done to spin up the research instance on this commit by @Ottomata.

For the target hostname, I chose an-airflow1004.eqiad.wmnet since an-airflow1003.eqiad.wmnet is already taken by the old platform_eng instance that we want to keep until we test the new one.

xcollazo changed the task status from Open to Stalled.Jul 26 2022, 8:12 PM

While in the midst of following instructions to make the puppet changes for https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#Create_a_scap_deployment_source, I hit a wall. It seems @Ottomata had set it up so that converting the current platform_eng Airflow instance would be a simple config change as seen here: https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/analytics_cluster/airflow/platform_eng.yaml#L53-L57. However, since we have the prod run of the image_suggestions dag on the original server, going forward with this I believe will nuke it.

So without further context from an SRE I think I shouldn't touch the code further.

Change 817774 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] airflow - Modify platform_eng instance to do deployment of airflow-dags

https://gerrit.wikimedia.org/r/817774

(not confident about the patch above, but still wanted to have something for review.)

JArguello-WMF changed the task status from Stalled to Open.Aug 1 2022, 3:27 PM

The original request for an-airflow1003 was here: T284225: Create airflow instances for Platform Engineering and Research with a VM request form submitted here: T284934: Site: 2 VM request for an-airflow100{2,3}

I will complete another such form for the an-airflow1004 machine, as a sub-task of this ticket.

I think what I'd do regarding puppet would be to rename the analytics_cluster::airflow::platform_eng role to analytics_cluster::airflow::platform_eng_legacy and apply this to an-airflow1003 - to indicate that this machine will be decommissioned once the migration is complete.

Change 820122 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add DHCP details for an-airflow1004

https://gerrit.wikimedia.org/r/820122

Change 820122 merged by Btullis:

[operations/puppet@production] Add DHCP details for an-airflow1004

https://gerrit.wikimedia.org/r/820122

Change 820126 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure an-airflow1004 to install with buster

https://gerrit.wikimedia.org/r/820126

Change 820126 merged by Btullis:

[operations/puppet@production] Configure an-airflow1004 to install with buster

https://gerrit.wikimedia.org/r/820126

@xcollazo - The new airflow VM is up and running now, but I have just put it into the insetup role, which means it's ready to be assigned a puppet role.

btullis@marlin-wsl:~$ ssh an-airflow1004.eqiad.wmnet
Linux an-airflow1004 4.19.0-21-amd64 #1 SMP Debian 4.19.249-2 (2022-06-30) x86_64
Debian GNU/Linux 10 (buster)
an-airflow1004 is a Host being setup for later application of a role (insetup)
The last Puppet run was at Wed Aug  3 16:32:38 UTC 2022 (1 minutes ago).
Last puppet commit: (f64a94548d) Dan Andreescu - role::common::aqs: update mw history
Debian GNU/Linux 10 auto-installed on Wed Aug 3 13:40:42 UTC 2022.

I was wondering actually, maybe @Ottomata might have an opinion...

What would happen if we were to share a database between an-airflow1003 and an-airflow1004 temporarily?

We could assign both hosts to the analytics_cluster::airflow::platform_eng role but override the profile::airflow::use_wmf_defaults parameter in a host specific hiera file.

Would you think that this is workable, or would you advise that we use two roles for the two servers? How long do we think that platform engineering would like to keep the two servers running in parallel?

Synced up with Ben over chat, copying here:

What would happen if we were to share a database between an-airflow1003 and an-airflow1004 temporarily?

We would need two separate databases, since the dags (and thus whatever airflow keeps as state) will be different between instances.

How long do we think that platform engineering would like to keep the two servers running in parallel?

A couple weeks tops.

We could assign both hosts to the analytics_cluster::airflow::platform_eng role but override the profile::airflow::use_wmf_defaults parameter in a host specific hiera file.

This makes sense to me @BTullis! The only annoying thing here is that unless we rename the existent mysql database, we will always have to override some of the use_wmf_defaults values. After we shut down an-airflow1003, will it be possible to drop the old database, and rename the new one with the old name?

Status update:

@BTullis has deployed the virtual machine, and we have a code-reviewed patch to install Airflow here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/817774.

I want to learn how all this works, so @Ottomata has agreed to wait till the week of Aug 15 to shepherd deployment of that patch jointly with me.

Change 823722 had a related patch set uploaded (by Ottomata; author: Ottomata):

[labs/private@master] Add dummy an-airflow1004.eqiad.wmnet/analytics-platform-eng keytab

https://gerrit.wikimedia.org/r/823722

Change 823722 merged by Ottomata:

[labs/private@master] Add dummy an-airflow1004.eqiad.wmnet/analytics-platform-eng keytab

https://gerrit.wikimedia.org/r/823722

Change 817774 merged by Ottomata:

[operations/puppet@production] airflow - Configure new platform_eng instance and rename old one as legacy.

https://gerrit.wikimedia.org/r/817774

Change 823727 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow-dags/platform_eng - fix typo in scap source

https://gerrit.wikimedia.org/r/823727

Change 823727 merged by Ottomata:

[operations/puppet@production] airflow-dags/platform_eng - fix typo in scap source

https://gerrit.wikimedia.org/r/823727

Mentioned in SAL (#wikimedia-operations) [2022-08-16T20:39:52Z] <otto@deploy1002> Started deploy [airflow-dags/platform_eng@eba3ff8]: initial scap deploy to an-airflow1004 - T312858

Mentioned in SAL (#wikimedia-operations) [2022-08-16T20:42:22Z] <otto@deploy1002> Finished deploy [airflow-dags/platform_eng@eba3ff8]: initial scap deploy to an-airflow1004 - T312858 (duration: 02m 30s)

Mentioned in SAL (#wikimedia-operations) [2022-08-16T20:53:00Z] <otto@deploy1002> Started deploy [airflow-dags/platform_eng@da511ee]: initial scap deploy to an-airflow1004, take 2 - T312858

Mentioned in SAL (#wikimedia-operations) [2022-08-16T20:54:05Z] <otto@deploy1002> Finished deploy [airflow-dags/platform_eng@da511ee]: initial scap deploy to an-airflow1004, take 2 - T312858 (duration: 01m 05s)

Mentioned in SAL (#wikimedia-operations) [2022-08-16T21:05:37Z] <otto@deploy1002> Started deploy [airflow-dags/platform_eng@33afb85]: initial scap deploy to an-airflow1004, take 3 - T312858

Mentioned in SAL (#wikimedia-operations) [2022-08-16T21:05:56Z] <otto@deploy1002> Finished deploy [airflow-dags/platform_eng@33afb85]: initial scap deploy to an-airflow1004, take 3 - T312858 (duration: 00m 18s)

We did it! an-airflow1004 now running a new platform_eng instance. Outstanding TODOs:

Change 824241 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] Add missing airflow service users to yarn's production queue

https://gerrit.wikimedia.org/r/824241

Mentioned in SAL (#wikimedia-operations) [2022-08-18T19:47:42Z] <ottomata> temporarily disable puppet on an-master100* while applying change in test cluster - T312858

Change 824241 merged by Ottomata:

[operations/puppet@production] Add missing airflow service users to yarn's production queue

https://gerrit.wikimedia.org/r/824241

Mentioned in SAL (#wikimedia-analytics) [2022-08-18T19:57:40Z] <ottomata> apply yarn production queue changes to allow analytics-research and analytics-platform-eng users to submit jobs to production queue - T312858

All right! Just verified that the image_suggestions job is running smoothly on the new an-airflow1004.eqiad.wmnet Airflow instance.

Thanks for all the help @Ottomata and @BTullis!

Closing.

We should probably keep this open until we finish stuff listed in https://phabricator.wikimedia.org/T312858#8159458...or make that stuff its own follow up task.

OH! I just saw you made the follow up. Disregard my last comment please!