Mon, Sep 18
Mon, Sep 11
For Airflow dags, we are using trusted-runners provided by rel-eng.
Thu, Sep 7
I have done part of the refactor in this change: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/2..3
- Adding the unit tests on the important part of the code
- Add some new traits, and classes and rename some classes for better comprehension
Wed, Sep 6
What has been done in a first step:
- Custom partitioner POC
- First implementation
- Clarifying source & result expectation
Mon, Sep 4
1 metric that could have been useful was the number of task retries.
Wed, Aug 30
Aug 17 2023
OK to move to Gitlab. 👍 I'm making it work first.
Jul 17 2023
I have the first draft version in Gerrit.
Jul 8 2023
Jul 6 2023
Jun 29 2023
Jun 27 2023
3 of our dataset are now going to use canonical.countries.is_protected:
- Cassandra AQS pageview_top_percountry_daily
- Cassandra AQS pageview_top_bycountry_monthly
- Hive geoeditors_public_monthly
Documentation added here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide
Hello @lbowmaker , for this ticket, a patch has already been proposed.
Jun 26 2023
Jun 21 2023
Jun 20 2023
Jun 19 2023
Jun 13 2023
I'm proposing with those patches:
Jun 8 2023
Jun 7 2023
Should we merge those CI pipeline changes to make it the standard in workflow utils?
We have three new pipelines in this MR:
- Build image
- Run tests
- Build & publish artifacts into the registry
I've a MR with:
Jun 5 2023
Today we decided not to automatize the tagging process.
May 30 2023
May 29 2023
May 25 2023
May 24 2023
There is a second problem hidden behind the missing Scala lib: the Guava version mismatch between the one provided by Hadoop and the one included in eventutilities.
Squooping test is conclusive and the patch could be merged right now.
Refinery-source does not ship Scala anymore because it was included in wikihadoop, which is not included anymore.
May 23 2023
Thanks all for the reviews. Even if the DAG is working, deciding the single source of truth for our dataset metadata could be great. Right now, its located in:
May 22 2023
Do you know if there is a DB with the new schema version? It would be cool to have a place to test the import.
May 16 2023
May 15 2023
Here is a standardized version of the first iteration for easy use by ppl without knowledge of DataHub: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/386
May 11 2023
Some propositions for an immediate and more useful next step:
May 10 2023
Update: I'm emitting metadata to Kafka from an ad-hoc Airflow data lineage task. The configuration is setting up the communication with Kafka and the schema registry, Karapace. Then the metadata is well-fetched by the mce-consumer service on the DataHub side. Now I'm looking to use the detailed version of the data lineage event, containing more information than just the link upstream<>downstream.
May 4 2023
May 3 2023
I've checked the result on HDFS. It performs as expected.
May 2 2023
Apr 24 2023
Apr 14 2023
Apr 13 2023
OK to separate the migration from this task.
Bug: There is an extra systemd check making sure SUCCESS files are generated:
Apr 12 2023
Apr 11 2023
I like idea A because the conda env encapsulates all needed libs.