Page MenuHomePhabricator

Test aqs_hourly job from Airflow testing instance
Closed, ResolvedPublic8 Estimated Story Points

Description

We are considering to migrate our Oozie jobs to Airflow.
This task is part of the POC that we are working on (T241246).
To try it at first, we'd like to add the aqs_hourly job to the Search team's Airflow installation.
Here's the initial code, written by @JAllemandou:
https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/582114/

Event Timeline

Hi @EBernhardson! As we dicussed in IRC, this is the task I was mentioning.
I'd like to access your Airflow instance to be able to test our job, could you help me with that?
I couldn't find any docs in Wikitech or Meta appart from https://wikitech.wikimedia.org/wiki/Discovery/Analytics.
I believe the code for that goes here, right? https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/analytics
Thanks!

Correct, there isn't (yet) any documentation, but that page is where it would go. All WMF specific code does into the repository you linked, there is a second repository (search/airflow) for deploying the upstream code and python dependencies.

A few pointers:

  • Airflow is installed to an-airflow1001. The web UI is not publicly accessible, i use ssh -L 8778:localhost:8778 an-airflow1001.eqiad.wmnet and visit localhost:8778 in my browser. I use this to check on run history, view logs, and turn dags on/off. DAG's start in the "off" state, on first deployment of a new dag nothing will happen except populating the UI.
  • Invoking airflow commands is done via /usr/local/bin/airflow. This will fail and complain if you didn't sudo to the analytics-search user (you might need additional rights to be able to sudo, not sure). My most common reason to use this is airflow test ... to manually invoke a single operation from a deployed dag.
  • The discovery analytics repository is deployed from the standard deployment servers using scap. Deploying will ship to an-airflow1001 and stat1007. The airflow instance will read the updated files within a minute or two. Stat1007 will not auto-magically start using any of the deployed code, so no risk there.
  • Take note that if you add or update an airflow plugin the airflow webserver needs to be restarted. The scheduler may also need to be restarted, safest to do both. Updating DAG's requires no special action.

Oh one other limit I implied but didn't call out above, I couldn't figure out kerberos + multi-tenancy in airflow. It could perhaps be figured out, but we didn't need it at the time so I went the easier route. What this means is any job you schedule in airflow will have to run as analytics-search user, and save files to hdfs as that user.

fdans added a project: Analytics-Kanban.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

@EBernhardson
In the end, I preferred to create an Airflow instance under my user in stat1007, because of the multi-tenant issue you mentioned.
Nevertheless, your tips, code and Airflow setup helped me a lot (in addition to @elukey) in installing, configuring and running my own Airflow instance.
Plus, I manged to run the aqs_hourly.py job written by @JAllemandou.
So, I will update the title of this task, and consider it done.

For whoever is interested, I created a tutorial on how to quickly set up a test instance of Airflow in a stats machine.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Airflow_testing_instance_tutorial

Cheers!

mforns renamed this task from Test aqs_hourly job from (Search team's) Airflow to Test aqs_hourly job from Airflow testing instance.Apr 16 2020, 8:29 PM
mforns set the point value for this task to 8.
mforns moved this task from In Progress to Done on the Analytics-Kanban board.