Page MenuHomePhabricator

[Airflow] Setup Airflow instance for WMDE
Closed, ResolvedPublic

Description

Let's setup an Airflow instance for WMDE folks.

Steps
  • Create WMDE airflow admin group
    • send request for new system user
    • disable analytics-wmde user on stat 1007
  • an-airflow1007 keytabs
  • Create an-airflow1007 instance.( currently in role(insetup::data_engineering))
  • Create analytics-wmde user for an-airflow1007
  • Create Airflow Postgresql Database
  • Create Airflow puppet configuration
  • Create the instance specific dags folder
  • Create the instance specific scap repository
  • Create WMDE service user
  • Add service user to the Yarn production queue
  • Update Documentation
  • Announce to WMDE team
Success Criteria

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Linking this comment for transparency on what is in progress.

I think that there's one other issue, which is that the analytics-wmde user to whom the keytabs belong is only created by the statistics::wmde class.
So that user won't be created on an-airflow1007. It's currently pulled in by the profile: profile::statistics::explorer::misc_jobs on stats boxes, but I don't think we want all of that. We might need to make a new profile and include only that to pull in the analytics-wmde user and group.

Working on getting the analytics-wmde user who is only available on stat1007 available on the new an-airflow1007 instance as required for the Airflow WMDE puppet configuration.

I created this MR for the subtask "Create the instance specific dags folder (ready to merge)"
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/474
Please, check that all is correct :-)

Change 947714 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wmde: create analytics-wmde user for airflow

https://gerrit.wikimedia.org/r/947714

Change 948534 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wmde: Add wmde airflow instance to insetup role

https://gerrit.wikimedia.org/r/948534

Change 948534 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: Add wmde airflow instance to insetup role

https://gerrit.wikimedia.org/r/948534

We are unblocked on T342546 , Working to merge the tasks listed as in progress and as ready to merge on the ticket.

Change 940936 merged by Stevemunene:

[labs/private@master] Dummy db for new wmde airflow

https://gerrit.wikimedia.org/r/940936

Change 940937 merged by Stevemunene:

[labs/private@master] Add dummy keytabs for new an-airflow1007

https://gerrit.wikimedia.org/r/940937

Change 940961 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: Add a postgresql database and user for airflow wmde

https://gerrit.wikimedia.org/r/940961

Change 940863 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: Add Kara Payne to analytics-wmde

https://gerrit.wikimedia.org/r/940863

Change 949001 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wmde: Create analytics-wmde airflow admin group

https://gerrit.wikimedia.org/r/949001

Change 949019 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wmde: Add wmde service user to the Yarn production queue

https://gerrit.wikimedia.org/r/949019

From the [] Create WMDE airflow admin group review, the aiflow-wmde-admins group requires a system user in order to perform the "admin tasks" for the airflow instance.
Our current user analytics-wmde is not a system user since the user was originally created by statistics::wmde a Class for running WMDE releated statistics & analytics scripts on a statsd host.
The user is currenlty availed on the stat host via the profile profile::statistics::explorer::misc_jobs, along with the other scripts and jobs required for WMDE releated statistics & analytics scripts.

We are currently working the procedures to add analytics-wmde as a system user, or use a different one considering that all the airflow system users and those who can access them are members of analytics_privatedata_users documented here. Granted Andrew and Manuel are already members we would likely only need to add Kara then proceed with the right approvals.

Thanks for the efforts on this, @Stevemunene! Please let us know if there's anything needed on our end :)

From the [] Create WMDE airflow admin group review, the aiflow-wmde-admins group requires a system user in order to perform the "admin tasks" for the airflow instance.
Our current user analytics-wmde is not a system user since the user was originally created by statistics::wmde a Class for running WMDE releated statistics & analytics scripts on a statsd host.
The user is currenlty availed on the stat host via the profile profile::statistics::explorer::misc_jobs, along with the other scripts and jobs required for WMDE releated statistics & analytics scripts.

We are currently working the procedures to add analytics-wmde as a system user, or use a different one considering that all the airflow system users and those who can access them are members of analytics_privatedata_users documented here. Granted Andrew and Manuel are already members we would likely only need to add Kara then proceed with the right approvals.

Adding some more context to this in that
We have the system users like analytics-product,analytics-search as seen on analytics private data system users.
The system users are also responsible for submitting jobs to yarn and to run regular airflow services maintenance as seen here for the analytics-platform-eng system user.

Looping in @elukey for some help with this.

Hi folks! Yes I'd follow what we did for analytics-product etc.. since we'll create the same system user (uid/gid) across nodes (airflow, stat100x, hadoop worker nodes, etc..). You can reserve a uid/gid combination in puppet admin's data.yaml file, and add the related system user to the analtyics-privatedata-users group (as the others).

The only follow up that I can think of is that on stat1007, where analytics-wmde is already present IIRC, we'll have almost surely a different uid/gid, so some follow up (chmod -R etc..) will be needed.

Hi folks! Yes I'd follow what we did for analytics-product etc.. since we'll create the same system user (uid/gid) across nodes (airflow, stat100x, hadoop worker nodes, etc..). You can reserve a uid/gid combination in puppet admin's data.yaml file, and add the related system user to the analtyics-privatedata-users group (as the others).

The only follow up that I can think of is that on stat1007, where analytics-wmde is already present IIRC, we'll have almost surely a different uid/gid, so some follow up (chmod -R etc..) will be needed.

In agreement with this, the current analytics-wmde details on stat1007 are

@stat1007:~$ id analytics-wmde
uid=493(analytics-wmde) gid=1002(analytics-wmde) groups=1002(analytics-wmde)

Which shall need changing as Luca mentioned. For this change we would like to setup a timeframe to monitor the WMDE jobs running on stat1007 and do so with help and guidance on the jobs and related processes from WMDE engineer(s) cc @AndrewTavis_WMDE
Depending on the jobs running on stat1007 and the effect a change would have, an alternative would be to create another System User in general and follow the same steps as we would have with analytics-wmde without the interference with existing jobs/systems.

As Wikidata's Analytics Product Manager I am not focused on the technical engineering aspects. But let me still try to provide some context that might be helpful:

  • The general plan for all WMDE jobs is to either transition them to Airflow in the future or deprecate them.
  • Based on our documentation of jobs, it seems that "WD_PageviewsPerType" is the sole job currently running on stat1007, with other jobs being documented on stat1004, stat1005, and stat1008.
  • In discussions with @JAllemandou during the Spark migration initiative, it became evident that "WD_PageviewsPerType" on stat1007 has been experiencing failures since February 17th. Consequently, a decision was made to already stop the Cron job for it (T334951#8985070).
  • Given these circumstances, it seems reasonable from my perspective to consider options such as deleting the old System User and starting fresh with a clear focus on the new purpose or creating a fresh second user account.

Andrew and I will be back from our leave after next week. Upon our return, you can count on us for anything you may need from our side! :)

Thank you for your response @Manuel , we shall be moving forward with analytics-wmde user, I have sent out the access request for this. Corresponding patches to follow.

@BTullis With the upcoming elevation of the analytics-wmde user to a systemwide user across nodes (airflow, stat100x, hadoop worker nodes, etc..) and membership of analytics-privatedata-users, I'm considering removing access to analytics-wmde for the general analytics-wmde-users group and having only the airflow-wmde-admins with access to the user. This shouldn't affect much since the user was only on stat1007.

@Stevemunene I have uploaded a new patchset to https://gerrit.wikimedia.org/r/c/operations/puppet/+/949001 to fix the CI issue.
It was a YAML indenting issue that was causing the CI to fail.

I've also updated the commit message to try to clarify what the patch is for, but please feel free to reword it if you think I've misunderstood anything.

Change 947714 abandoned by Stevemunene:

[operations/puppet@production] airflow-wmde: create analytics-wmde users class for wmde services

Reason:

We decided to create the analytics-wmde user as a system user on the admin module. Thus we no longer need to create the user using classes as done here.

https://gerrit.wikimedia.org/r/947714

Change 959222 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wmde: Remove statsd analytics-wmde user

https://gerrit.wikimedia.org/r/959222

Change 949001 merged by CDanis:

[operations/puppet@production] admin: Create analytics-wmde system user and airflow admin group

https://gerrit.wikimedia.org/r/949001

Change 959222 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: Remove statsd analytics-wmde user

https://gerrit.wikimedia.org/r/959222

We were unblocked on the` analytics-wmde` admin group and user and were able to create the admin group after all the right approvals. So we are actively back in progress and this should be ready in a couple of days.

I think it would be a bit urgent to do a proper cleanup of the wmde scripts and cronjobs that were running on stat1007 as the change in user details has affected most of the folders and associated group permissions thus causing puppet failures on the host cc @Manuel

Examples shown here

Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_set_origin_wmde/scripts]/returns: 	git config --global --add safe.directory /srv/analytics-wmde/graphite/src/scripts
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/scripts' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_set_origin_wmde/scripts]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/scripts' returned 128 instead of one of [0] (corrective)
Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_pull_wmde/scripts]: Dependency Exec[git_set_origin_wmde/scripts] has failures: true
Warning: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_pull_wmde/scripts]: Skipping because of failed dependencies
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/toolkit-analyzer-build' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/toolkit-analyzer-build]/Exec[git_set_origin_wmde/toolkit-analyzer-build]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/toolkit-analyzer-build' returned 128 instead of one of [0] (corrective)
Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/toolkit-analyzer-build]/Exec[git_pull_wmde/toolkit-analyzer-build]: Dependency Exec[git_set_origin_wmde/toolkit-analyzer-build] has failures: true
Warning: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/toolkit-analyzer-build]/Exec[git_pull_wmde/toolkit-analyzer-build]: Skipping because of failed dependencies
Notice: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: fatal: detected dubious ownership in repository at '/srv/analytics-wmde/wdcm/src'
Notice: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: 	git config --global --add safe.directory /srv/analytics-wmde/wdcm/src
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/WDCM' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/WDCM' returned 128 instead of one of [0] (corrective)
Warning: /Stage[main]/Statistics::Wmde::Graphite/Systemd::Timer::Job[wmde-analytics-minutely]/Systemd::Unit[wmde-analytics-minutely.service]/File[/lib/systemd/system/wmde-analytics-minutely.service]: Skipping because of failed dependencies

Change 961699 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Disable WMDE misc jobs on stat1007

https://gerrit.wikimedia.org/r/961699

Hello, while doing a cleanup of the WMDE scripts/jobs we came across a situation where some of the jobs timers are still running and seem to be doing so successfully.
Thus we are a bit skeptical of purging the jobs as had initially been suggested since we might inadvertently cause a gap in some data upon which people are relying. We would like to request some help reviewing the jobs running on stat1007 especially those using the previous analytics-wmde user/group permissions, before we begin the purge operation. cc @Manuel , @AndrewTavis_WMDE

These are some of those timers.

btullis@stat1007:~$ systemctl list-timers |grep wmde
Fri 2023-09-29 09:11:00 UTC  2s left             Fri 2023-09-29 09:10:00 UTC  57s ago            wmde-analytics-minutely.timer                   wmde-analytics-minutely.service
Fri 2023-09-29 12:00:00 UTC  2h 49min left       Thu 2023-09-28 12:00:01 UTC  21h ago            wmde-analytics-daily-noon.timer                 wmde-analytics-daily-noon.service
Sat 2023-09-30 03:00:00 UTC  17h left            Fri 2023-09-29 03:00:01 UTC  6h ago             wmde-analytics-daily-early.timer                wmde-analytics-daily-early.service
Sun 2023-10-01 00:00:00 UTC  1 day 14h left      Sun 2023-09-24 00:00:01 UTC  5 days ago         wmde-analytics-weekly.timer                     wmde-analytics-weekly.service

They're also still running.

Sep 29 09:13:00 stat1007 systemd[1]: Starting Minutely jobs for wmde analytics infrastructure...
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + '[' -z /srv/analytics-wmde/graphite/src/scripts ']'
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + date '+%F %T minutely.sh Started!'
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 minutely.sh Started!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + eval /srv/analytics-wmde/graphite/src/scripts/src/wikidata/wb_changes.php
Sep 29 09:13:00 stat1007 minutely.sh[25022]: ++ /srv/analytics-wmde/graphite/src/scripts/src/wikidata/wb_changes.php
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + date '+%F %T minutely.sh Waiting!'
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + eval /srv/analytics-wmde/graphite/src/scripts/src/wikidata/maxlag.php
Sep 29 09:13:00 stat1007 minutely.sh[25022]: ++ /srv/analytics-wmde/graphite/src/scripts/src/wikidata/maxlag.php
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + eval /srv/analytics-wmde/graphite/src/scripts/src/wikidata/recentChanges.php
Sep 29 09:13:00 stat1007 minutely.sh[25022]: ++ /srv/analytics-wmde/graphite/src/scripts/src/wikidata/recentChanges.php
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 minutely.sh Waiting!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + wait
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 wikidata-wb_changes Script Started!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 wikidata-maxlag Script Started!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 wikidata-recentChanges Script Started!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 wikidata-wb_changes Script Finished!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 wikidata-maxlag Script Finished!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 wikidata-recentChanges Script Finished!
Sep 29 09:13:00 stat1007 minutely.sh[25022]: + date '+%F %T minutely.sh Ended!'
Sep 29 09:13:00 stat1007 minutely.sh[25022]: 2023-09-29 09:13:00 minutely.sh Ended!
Sep 29 09:13:00 stat1007 systemd[1]: wmde-analytics-minutely.service: Succeeded.
Sep 29 09:13:00 stat1007 systemd[1]: Started Minutely jobs for wmde analytics infrastructure.

This is the patch where we have been working on gerrit 961699: Disable WMDE misc jobs on stat1007, there in lie the shown comments and the pcc results for the config removal.

pcc= Help:Puppet-compiler used to get the results of a given puppet configuration without having to deploy it to servers.

Hi @Stevemunene and @BTullis, thank you so much for reaching out about this before pulling the plug! I was unaware that the cronjobs for other systems were hosted/started from stat1007 (this is how I understood your comments). I'll look more into this and come back to you.

Hi @Stevemunene and @BTullis, thank you again for making us aware of this important issue! It turns out that these timers are indeed critical for different teams, so we must ensure not to stop them. What does it mean for our work here that we cannot stop the cronjobs for now? Can we go another route (e.g. you create a completely new user for this, after all, and we clear up the analytics-wmde situation separately)?

Also, unfortunately, the cronjobs are currently undocumented and unowned on our end, and I am trying to clarify the ownership, but this might take some time. A joint meeting/chat with one of you and one of our engineers might help us, to fully understand the situation. Would you be up for such a conversation?

Change 966256 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wmde: Place airflow1007 in airflow-wmde role

https://gerrit.wikimedia.org/r/966256

Change 940939 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: Create scap deployment source for wmde

https://gerrit.wikimedia.org/r/940939

Change 940938 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: configure wmde airflow instance

https://gerrit.wikimedia.org/r/940938

Change 966256 merged by Stevemunene:

[operations/puppet@production] airflow-wmde: Place airflow1007 in airflow-wmde role

https://gerrit.wikimedia.org/r/966256

Icinga downtime and Alertmanager silence (ID=bb8fddcb-96c9-4078-bcf9-9fd9d4c95358) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=cae0d6d1-edbc-4b22-8059-9236cb8823bc) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

Hi @Stevemunene and @BTullis, thank you again for making us aware of this important issue! It turns out that these timers are indeed critical for different teams, so we must ensure not to stop them. What does it mean for our work here that we cannot stop the cronjobs for now? Can we go another route (e.g. you create a completely new user for this, after all, and we clear up the analytics-wmde situation separately)?

Also, unfortunately, the cronjobs are currently undocumented and unowned on our end, and I am trying to clarify the ownership, but this might take some time. A joint meeting/chat with one of you and one of our engineers might help us, to fully understand the situation. Would you be up for such a conversation?

Hi @Manuel Thanks for your response, and yes a conversation would certainly be of help in identifying the jobs and how to move forward.

For now we have been able to resolve this error from a previous comment that was causing puppet failures on stat1007.

We were unblocked on the` analytics-wmde` admin group and user and were able to create the admin group after all the right approvals. So we are actively back in progress and this should be ready in a couple of days.

I think it would be a bit urgent to do a proper cleanup of the wmde scripts and cronjobs that were running on stat1007 as the change in user details has affected most of the folders and associated group permissions thus causing puppet failures on the host cc @Manuel

Examples shown here

Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_set_origin_wmde/scripts]/returns: 	git config --global --add safe.directory /srv/analytics-wmde/graphite/src/scripts
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/scripts' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_set_origin_wmde/scripts]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/scripts' returned 128 instead of one of [0] (corrective)
Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_pull_wmde/scripts]: Dependency Exec[git_set_origin_wmde/scripts] has failures: true
Warning: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_pull_wmde/scripts]: Skipping because of failed dependencies
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/toolkit-analyzer-build' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/toolkit-analyzer-build]/Exec[git_set_origin_wmde/toolkit-analyzer-build]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/toolkit-analyzer-build' returned 128 instead of one of [0] (corrective)
Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/toolkit-analyzer-build]/Exec[git_pull_wmde/toolkit-analyzer-build]: Dependency Exec[git_set_origin_wmde/toolkit-analyzer-build] has failures: true
Warning: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/toolkit-analyzer-build]/Exec[git_pull_wmde/toolkit-analyzer-build]: Skipping because of failed dependencies
Notice: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: fatal: detected dubious ownership in repository at '/srv/analytics-wmde/wdcm/src'
Notice: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: 	git config --global --add safe.directory /srv/analytics-wmde/wdcm/src
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/WDCM' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/WDCM' returned 128 instead of one of [0] (corrective)
Warning: /Stage[main]/Statistics::Wmde::Graphite/Systemd::Timer::Job[wmde-analytics-minutely]/Systemd::Unit[wmde-analytics-minutely.service]/File[/lib/systemd/system/wmde-analytics-minutely.service]: Skipping because of failed dependencies

The error was mainly caused by the recent change uid and gid of the analytics-wmde user. The folders bound to the previous uid and gid were inaccessible to the new user who could not perform actions like git clone to the folder causing these errors

Notice: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: 	git config --global --add safe.directory /srv/analytics-wmde/wdcm/src
Error: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/WDCM' returned 128 instead of one of [0]
Error: /Stage[main]/Statistics::Wmde::Wdcm/Git::Clone[analytics/wmde/WDCM]/Exec[git_set_origin_analytics/wmde/WDCM]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/git remote set-url origin https://gerrit.wikimedia.org/r/analytics/wmde/WDCM' returned 128 instead of one of [0] (corrective)
Warning: /Stage[main]/Statistics::Wmde::Graphite/Systemd::Timer::Job[wmde-analytics-minutely]/Systemd::Unit[wmde-analytics-minutely.service]/File[/lib/systemd/system/wmde-analytics-minutely.service]: Skipping because of failed dependencies

This was fixed by updating the uid and gid of any folder running with the old uid to the new ones used by analytics-wmde user.
sudo find . -uid 493 -exec chown 927:927 {} \;
Subsequent puppet runs were successful on the host.

Change 949019 merged by Ryan Kemper:

[operations/puppet@production] airflow-wmde: Add wmde service user to the Yarn production queue

https://gerrit.wikimedia.org/r/949019

Mentioned in SAL (#wikimedia-analytics) [2023-10-18T16:53:58Z] <stevemunene> Add analytics-wmde service user to the Yarn production queue T340648

Icinga downtime and Alertmanager silence (ID=9120ee3d-6b27-4a2d-bc4d-e4f592882343) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2023-10-18T18:03:24Z] <stevemunene> revert Add analytics-wmde service user to the Yarn production queue T340648

Icinga downtime and Alertmanager silence (ID=ca92fb50-91c0-4832-a18d-b71b3e5cae7d) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=60f9f206-7461-495b-8373-fa1c60ddaf2e) set by stevemunene@cumin1001 for 3 days, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

While setting up the wmde instance, we noticed an error on the deployment server in the initial setup of the scap repo. As per the instance creation instructions we use, Create the instance specific dags folder is created and merged here Airflow Dags Scap wmde. Next comes the Create a scap deployment source which when merged should make changes to the deployers group adding airflow-wmde-admins and define the scap sources.
However, after merging the patch, we are getting some puppet errors initiallizing the scap repo

Error: Execution of '/usr/bin/scap deploy --init' returned 1: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[airflow-dags/wmde]/Scap_source[airflow-dags/wmde]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 1:

Icinga downtime and Alertmanager silence (ID=ec8c920c-26ba-48f2-b8e1-6bc3a415578a) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2023-11-08T15:52:16Z] <stevemunene> Add analytics-wmde service user to the Yarn production queue T340648

Icinga downtime and Alertmanager silence (ID=ecc0e9fa-3af2-4029-86fe-4b42f82adef6) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Downtime as we setup the new WMDE Airflow instance

an-airflow1007.eqiad.wmnet

We have made progress with the WMDE airflow instance, merged and implemented all the setup PRs that were pending and initialized the DB as below;
Check connection

stevemunene@an-airflow1007:/usr/lib/airflow/bin$ sudo AIRFLOW_HOME=/srv/airflow-wmde/ ./airflow db check
/usr/lib/airflow/lib/python3.10/site-packages/airflow/utils/db.py:1740 RemovedIn20Warning: [31mDeprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. [32mTo prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". [36mSet environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message.[0m (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
[2023-11-09T10:39:39.781+0000] {db.py:1741} INFO - Connection successful

Initialize DB

stevemunene@an-airflow1007:/usr/lib/airflow/bin$ sudo AIRFLOW_HOME=/srv/airflow-wmde/ ./airflow db init
DB: postgresql://airflow_wmde:***@an-db1001.eqiad.wmnet/airflow_wmde?sslmode=require&sslrootcert=%2Fetc%2Fssl%2Fcerts%2Fwmf-ca-certificates.crt
[2023-11-09T11:00:48.603+0000] {migration.py:213} INFO - Context impl PostgresqlImpl.
[2023-11-09T11:00:48.604+0000] {migration.py:216} INFO - Will assume transactional DDL.
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running stamp_revision  -> c804e5c76e3e
Initialization done

The airflow webserver is up and accessible via the instructions available here.

image.png (980×2 px, 153 KB)

However for the acceptance criteria, there is an error with the sample DAG initially provided which I am looking into, cc @mforns

Broken DAG: [/srv/deployment/airflow-dags/wmde/wmde/dags/example/example_dag.py] Traceback (most recent call last):
  File "/srv/deployment/airflow-dags/wmde/wmde/config/dag_config.py", line 39, in <module>
    dataset_registry = DatasetRegistry(dataset_file_paths)
  File "/srv/deployment/airflow-dags/wmde/wmf_airflow_common/dataset.py", line 466, in __init__
    for dataset_name, dataset_params in yaml.safe_load(dataset_file).items():
AttributeError: 'NoneType' object has no attribute 'items'

Oh! The datasets.yaml file of the wmde/config folder does not specify any dataset yet.
That's why the loading of the DatasetRegistry is failing, it expects at least 1 dataset defined.
I created an MR to allow empty dataset files.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/541
Hope it works!

Change 961699 abandoned by Stevemunene:

[operations/puppet@production] Disable WMDE misc jobs on stat1007

Reason:

The misc jobs on stat1007 are active and in use so we shall not be disabling all of them.

https://gerrit.wikimedia.org/r/961699

We now have a sample DAG on the WMDE instance which should meet the acceptance criteria, huge thanks to @mforns for the help with the sample DAG.

image.png (434×1 px, 64 KB)

The WMDE instance is available via the instructions available here #wmde. Feel free to reach out to the Data Platform SRE's for any help accessing the instance or deploying DAGs. cc @Manuel @AndrewTavis_WMDE

Hey @Stevemunene! Thanks so much for the efforts here! Further thanks to the others at WMF who have helped along the way :) This is such an important step for analytics at WMDE! 🎉

We'll be in contact with you all in the New Year to set up some meetings for how to get our own DAGs up and running. @JAllemandou and I already have plans to look into these topics in our bi-weeklies, and maybe some further support will be needed from there regarding how best to set up our processes as well as checking desired output schemas to plan it all out.

Really looking forward to the next steps!

Gehel subscribed.

Re-opening until we get validation from WMDE that things are working for them as expected.

Thanks for the attention on this, @Gehel! I've put checking the wmde instance as the first and only thing for @JAllemandou and my 1:1 on Monday. I'll get back to everyone after we've had a chance to check it out!

Hello all! Closing this ticket post the 1:1 with @JAllemandou as I have SSH access to the wmde instance on Airflow, have been added as a collaborator on GitLab and have successfully cloned the Airflow DAGs repo. Thanks all so much for the support here! Looking forward to the steps ahead! 🚀

Change 993667 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the wmde instance to cumin A:analytics-airflow alias

https://gerrit.wikimedia.org/r/993667

Change 993667 merged by Btullis:

[operations/puppet@production] Add the wmde instance to cumin A:analytics-airflow alias

https://gerrit.wikimedia.org/r/993667