In the context of current work around pageviews data set enhancements, the WME team needs a dedicated Airflow instance running data enrichment and AWS synchronization jobs.
A dedicated instance is further recommended for future data pipelining needs.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Add access for platform engineering Airflow and data | operations/puppet | production | +3 -2 |
Related Objects
Event Timeline
Interesting! I'm very happy to help here, but I'd like to know a little more about why a new instance is seen as a specific requirement.
Since the migration to Kubernetes, we've been working on the understanding that the likely shift would be towards consolidation of Airflow instances, as opposed to a proliferation.
Maybe we could have a chat about what kind of DAGs you expect to be running on this instance and what their specific features will be. It might be that the WME DAGs would be suitable to run on the airflow-main instance, but let's discuss.
@HShaikh please provide more input on concrete future Airflow needs.
As an alternative to an own instance WME could leverage platform-eng for now and migrate later when other use cases are obvious.
Hey ben, So for the need for new instance was looking at it from a future isolation perspective. (WME doesn't want their jobs to potentially cause resource starvation or be affected by other jobs hunger)
The Pageviews project is a starter project where we are looking to start doing some computer inside the DPE infrastructure to benefit from access to data at an earlier stage in the pipelines, allowing us to be more timely with signals we would like to create.
Creation of this page views data set is one part of the pipeline. the second part is syncing that data over to AWS. The mechanism for the syncing is still being designed to be most optimal. (in terms of form and frequency)
Looking at the Gitlab folder structure the team was thinking of having a high level folder for WME jobs ( which it turns out translates to a separate instance).
In the initial phase our compute needs will probably not too high and can use the shared resources in the airflow-main instance.
Would it be possible to create a folder at the https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags level but have it run on the airflow-main instance?
OK, thanks. That makes sense. However, I there might be other, easier ways that we can ensure fair resource allocation for your pipelines.
We don't have to rule out a new instance, but I wouldn't start with it, especially for a single pipeline.
The Pageviews project is a starter project where we are looking to start doing some computer inside the DPE infrastructure to benefit from access to data at an earlier stage in the pipelines, allowing us to be more timely with signals we would like to create.
Creation of this page views data set is one part of the pipeline.
Great, thanks for explaining. Do you know what that data set will look like, yet? Will it be parquet files forming a hive table, or something else? Will you be using Spark to generate it?
You can also choose whether any distributed part of your pipeline runs on Hadoop/YARN or on Kubernetes.
the second part is syncing that data over to AWS. The mechanism for the syncing is still being designed to be most optimal. (in terms of form and frequency)
We might be able to help design this, based on some prior art here.
For example, we have a sync-utils image, that we use to perform parallel-rsync operations to move files around from DAG tasks.
Some reference code for how to launch this image using the KubernetesPodOperator can be found here.
That image also contains the rclone utility, which is extremely capable.
If your dataset files are generated on HDFS, then you can use the hdfs remote type on one end, and the S3 remote on the other end.
You could also use our Ceph server and have your interim files stored on either CephFS or an S3 bucket on that cluster, before being synced to AWS.
Whatever your source type, rclone can be used with the sync command to bring the AWS version up-to-date efficiently with the source files.
If rclone doesn't meet your needs, we can likely add another tool to the sync-utils image that will help.
Looking at the Gitlab folder structure the team was thinking of having a high level folder for WME jobs ( which it turns out translates to a separate instance).
In the initial phase our compute needs will probably not too high and can use the shared resources in the airflow-main instance.
Would it be possible to create a folder at the https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags level but have it run on the airflow-main instance?
How about if you had a sub-folder for your DAGs, to begin with? Would this be a reasonable way to start?
There was some discussion recently about whether airflow-main or airflow-platform-eng would be better, as platform-eng has been used to hold some miscellaneous DAGs, but I don't have very strong feelings about it.
Hi @BTullis!
Chiming in here to add a bit of info / my personal views on some of the questions and points you've raised above:
Do you know what that data set will look like, yet? Will it be parquet files forming a hive table, or something else? Will you be using Spark to generate it? You can also choose whether any distributed part of your pipeline runs on Hadoop/YARN or on Kubernetes.
You can see a sample version of final data schema (or something pretty close to it) in Hive in my personal namespace (specifically in htriedman.pageview_combined_analytics). That table is created using several precursor tables, which you can see samples of in Hive at:
- htriedman.pageview_geo_proportion
- htriedman.pageview_geo_top10
- htriedman.pageview_associated_proportion
- htriedman.pageview_associated_top10
All of these sample tables were created as parquet files in hive using a standard yarn-large sized Spark cluster. There may be some desire (not now, but at some point in the future) to investigate ways of collecting streaming data using Flink, but I'm only bringing that up to put it on your radar, not because it's necessary for this initial product.
Data will only be kept for a limited time before it is dropped to keep table sizes down — haven't finalized these numbers left, but I'm thinking 3 months of pageview_associated_* data (which comes from the clickstream dataset, released monthly), 60 days of pageview_geo_* data (which comes from the DP geo pageviews dataset, released daily), and ~15-20 days of hourly pageview data.
Any other questions / clarifying points on your end?
How about if you had a sub-folder for your DAGs, to begin with? Would this be a reasonable way to start?
I would be find with putting these airflow jobs in either of the options you listed, with a very slight preference for airflow-platform-eng (because it already contains the DP geo pageviews aggregation, which is part of this data release).
With regard to data transfer options, I defer to other WME engineers who are going to be receiving the dataset and processing it on WME's AWS clusters.
Update here, I took the access patch for Haroon, Ricardo, and me out of WIP:
Looking for a review and merge from someone with rights on the repo.
Change #1165605 had a related patch set uploaded (by Dr0ptp4kt; author: Dr0ptp4kt):
[operations/puppet@production] Add access for platform engineering Airflow and data
Thanks for the patch. I am happy to review and merge. As per the data.yaml file, @Ottomata's approval is required for this so adding him here for that.
Change #1165605 merged by Ssingh:
[operations/puppet@production] Add access for platform engineering Airflow and data
Thanks @Ottomata!
@dr0ptp4kt: Access request merged. Please try in ~30 minutes and let us know if there any issues with the access.
Thanks @Ottomata . Looks like maybe an extra command could be needed to get permissions in Airflow, I think the thing over at wikitech:SRE/Clinic_Duty/Access_requests#Modify_LDAP_groups.
Able to to do the LDAP thing to grant the additional access for resquito and dr0ptp4kt?
I logged out and logged back into Airflow, but it didn't seem to take effect.
From https://ldap.toolforge.org/group/airflow-platform-eng-ops it shows some of the others (who are presently in the group in data.yaml), and I'm thinking that was probably because of previous LDAP commands run.
Now, I noticed that there are some additional (trusted) folks in the LDAP group as well. I'll go check with the committer of the parent commit to see if those folks still want/need access to the elevated Airflow access, in which case I think they should be added explicitly into the analytics_platform_eng_admins_members / airflow-platform-eng-ops assignment in data.yaml as well just to reflect reality.
Related: T399899#11017204 , I'll go check about thoughts on updates to the documentation about access.
I am not very sure which groups need to be modified here (I do promise to update the documentation once that becomes clear) but for now, I added you and resquito to airflow-platform-eng-ops. Let me know if that works?
@BTullis I think we could use your knowledge on this one. Any pointers?
@ssingh, I think this LDAP grant (thanks for that!) conferred some additional access that will be needed. However, I think I assumed, incorrectly, that the rights conferred in Airflow would happen along the lines of what probablye descends from puppet/hieradata/role/common/search/airflow.yaml (231ec7) for Search - where the rights are expanded in the Airflow UI for users apparently corresponding to airflow-search-admins in data.yaml. In particular it's useful to have the ability to access Browse > Audit Logs and Admin > Variables and Admin > Connections....although having all seven of the Browse options and all six of the Admin options is useful. What I didn't fully consider was that, apparently, the analytics-platform-eng-admins mapping may not exist for Airflow the same way as for the Search instance of Airflow in Puppet and other configuration wireup. Classical hasty conclusion, my bad. I suspect based on the commit history since I started the WIP of my access patch and the tip of production now on the puppet repo, that possibly there's some configuration wireup that's been in flight / happened for T362788: Migrate Airflow to the dse-k8s cluster that may require some further tweaks?
I see three different Browse & Admin dropdown combos amongst airflow-search.wikimedia.org, airflow.wikimedia.org, and airflow-platform-eng.wikimedia.org for myself. What I'm hoping for here is for airflow-platform-eng.wikimedia.org to have the same level of access as what I see in airflow-search.wikimedia.org.
I'll share what I'm seeing.
airflow-search.wikimedia.org
airflow.wikimedia.org
(Doesn't have Audit Logs)
airflow-platform-eng.wikimedia.org
(Does have Audit Logs)
(Doesn't have the useful Admin options)
@dr0ptp4kt The most useful thing to check on each instance is the Your Profile link, which you can always access from the top-right icon on the page.
This will tell you what roles you have assigned. For example, I see this:
What you want is to see the Ops role assigned to you.
Each of the Airflow instances has a specific LDAP role that is assigned to the Ops role.
Ops users should be able to see the connections. For the airflow-platform-eng Airflow instance, this is airflow-platform-eng-ops
That group is assigned the Ops role here:
https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/dse-k8s-services/airflow-platform-eng/values-production.yaml#L5-L7
The rest of the roles are assigned according to the defaults here:
https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/airflow/values.yaml#L64-L72
So if you're not in the airflow-platform-eng-ops LDAP group, then you will be assigned the User role.
But we can see here that you are in that group.
btullis@seaborgium:~$ ldapsearch -x cn=airflow-platform-eng-ops # extended LDIF # # LDAPv3 # base <dc=wikimedia,dc=org> (default) with scope subtree # filter: cn=airflow-platform-eng-ops # requesting: ALL # # airflow-platform-eng-ops, groups, wikimedia.org dn: cn=airflow-platform-eng-ops,ou=groups,dc=wikimedia,dc=org cn: airflow-platform-eng-ops objectClass: groupOfNames member: uid=bpirkle,ou=people,dc=wikimedia,dc=org member: uid=cparle,ou=people,dc=wikimedia,dc=org member: uid=daniel,ou=people,dc=wikimedia,dc=org member: uid=dr0ptp4kt,ou=people,dc=wikimedia,dc=org member: uid=fab,ou=people,dc=wikimedia,dc=org member: uid=gmodena,ou=people,dc=wikimedia,dc=org member: uid=hokwelum,ou=people,dc=wikimedia,dc=org member: uid=htriedman,ou=people,dc=wikimedia,dc=org member: uid=kevinbazira,ou=people,dc=wikimedia,dc=org member: uid=mfossati,ou=people,dc=wikimedia,dc=org member: uid=mlitn,ou=people,dc=wikimedia,dc=org member: uid=resquito,ou=people,dc=wikimedia,dc=org member: uid=sg912,ou=people,dc=wikimedia,dc=org member: uid=tchin,ou=people,dc=wikimedia,dc=org member: uid=xcollazo,ou=people,dc=wikimedia,dc=org # search result search: 2 result: 0 Success # numResponses: 2 # numEntries: 1
So if you're not seeing the Ops role, then the first thing to do is try logging out of the CAS-SSO system by visiting https://idp.wikimedia.org/logout and then logging in again.
Also, if you go here: https://idp.wikimedia.org/login then you should be able to see which LDAP groups were retrived when you logged in most recently, so you should be able to check that the groups you expect to see are there.
Let us know how it goes - maybe we need some more documentation about how this works.
@BTullis thanks! IIRC I had previously logged out and logged back in post the LDAP grant and not seen the additional functionality, and when following your link just a little bit ago still saw [User, Public] for the rights.
But, after reading your thorough explanation of the access mappings (very helpful, by the way!) and the suggestion to log out and log back in again, I did that via the https://airflow-platform-eng.wikimedia.org/ user icon in the top right of the screen again and the CAS sign-in screen again. And now I see all six of the Browse menu items and all seven of the Admin menu items and the assignments at https://airflow-platform-eng.wikimedia.org/users/userinfo/ of [Op, User, Public]. That makes sense given the Helm mapping for the Op role.
So, what this makes me wonder a little bit is if perhaps there was some synchronization time post the role association being added to LDAP; or perhaps something concerning a stashed set of claims associated with a cookie. The session cookie in Airflow definitely seems to rotate from logging out to logging back in, but hard to say. In any case, I'm in now. Thanks again!
I am resolving this, as I think that we have a way forward here, but please feel free to reopen if you believe there to me more work on this ticket.








