User Details
- User Since
- Jan 4 2022, 1:16 PM (73 w, 5 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- AQuhen (WMF) [ Global Accounts ]
Tue, May 30
Mon, May 29
Thu, May 25
Wed, May 24
There is a second problem hidden behind the missing Scala lib: the Guava version mismatch between the one provided by Hadoop and the one included in eventutilities.
Squooping test is conclusive and the patch could be merged right now.
Refinery-source does not ship Scala anymore because it was included in wikihadoop, which is not included anymore.
https://archiva.wikimedia.org/#artifact-dependencies/org.wikimedia/wikihadoop/0.3-wmf1
Tue, May 23
Thanks all for the reviews. Even if the dag is working, it could be great to decide the single source of truce for our datasets metadata? Right now, its located in:
- airflow-dags/../dataset.yml
- airflow-dags/../..._dag.py
- DataHub
Mon, May 22
Do you know if there is a DB with the new schema version? It would be cool to have a place to test the import.
Tue, May 16
Mon, May 15
Here is a standardized version of the first iteration for easy use by ppl without knowledge of DataHub: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/386
Thu, May 11
Some propositions for an immediate and more useful next step:
Wed, May 10
Update: I'm emitting metadata to Kafka from an ad-hoc Airflow data lineage task. The configuration is setting up the communication with Kafka and the schema registry, Karapace. Then the metadata is well-fetched by the mce-consumer service on the DataHub side. Now I'm looking to use the detailed version of the data lineage event, containing more information than just the link upstream<>downstream.
May 4 2023
May 3 2023
I've checked the result on HDFS. It performs as expected.
May 2 2023
Apr 24 2023
Apr 14 2023
Apr 13 2023
OK to separate the migration from this task.
Bug: There is an extra systemd check making sure SUCCESS files are generated:
https://github.com/wikimedia/operations-puppet/blob/fc98a524be9be65935b8d80b506ca33af5d442b2/modules/profile/manifests/analytics/refinery/job/data_check.pp#L27
Apr 12 2023
Apr 11 2023
I like idea A because the conda env encapsulates all needed libs.
Apr 7 2023
I can confirm that now the data looks good in both:
Apr 6 2023
The data has been regenerated and should be pushed automatically to the web endpoint at 5 am UTC.
Apr 3 2023
Mar 31 2023
Optionally, some updates to the java code: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/904778
Mar 30 2023
Mar 29 2023
Mar 24 2023
Mar 21 2023
Mar 17 2023
No history was lost. Some dags have been renamed: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/760f31789ee20f3e6e263fa4733ff51202fa52a0
Mar 10 2023
Mar 7 2023
Mar 6 2023
Here is the Airflow Job: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/260
Mar 3 2023
Feb 28 2023
In the description, I've added a list of jobs that look like dependencies.
Feb 23 2023
Note: we should create a new branch, main_airflow_2_5_1, in airflow-dags to deploy the code only one the instances which are migrated. Hopefully, it shouldn't take long. Then when the migration process is finished we could merge to main and deploy from main.