Page MenuHomePhabricator

mforns (Marcel Ruiz Forns)
Software Engineer @ Analytics

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 7 2014, 8:52 PM (420 w, 1 d)
Availability
Available
IRC Nick
mforns
LDAP User
Mforns
MediaWiki User
Unknown

Recent Activity

Mon, Nov 21

mforns updated the task description for T282035: Catalog, Categorize, and Templetize existing scheduled workflows.
Mon, Nov 21, 3:23 PM · Data Pipelines, Data-Engineering, Platform Engineering

Mon, Nov 14

mforns updated the task description for T318367: Create Plan for Spark 2 Deprecation.
Mon, Nov 14, 2:31 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning
mforns updated subscribers of T298777: Prevent gaps when providing banner impression data of WMDE fundraising banners.

Hi @kai.nissen!
Since the last comments in this task, there has been progress in the Airflow front.
We Data Engineering are aiming to replace all our legacy scheduling systems with Airflow.
We already have migrated approximately half of our data pipelines to it, and other teams are using it as well.
Airflow allows to schedule at any necessary interval, like every 15 mins.
Also, with Airflow, you can specify data dependencies, which is much more robust against gaps, and also offers easy re-running of jobs.
Would you be interested in being able to schedule jobs with Airflow as well?
If so, let us know! You can ping @EChetty, and we can discuss if it would be possible.

Mon, Nov 14, 1:48 PM · WMDE Requirements on WMF needing Prioritisation, Fundraising-Analysis, Fundraising-Backlog, WMDE-FUN-Team, WMDE-Fundraising-Tech

Tue, Nov 8

mforns created T322690: Add support for repository artifacts in Airflow.
Tue, Nov 8, 8:30 PM · Data Pipelines
mforns renamed T322534: Spike: Product Analytics ETL options - Timebox 1 Sprint. from Spike: Notebook Schedular options - Timebox 1 Sprint. to Spike: Product Analytics ETL options - Timebox 1 Sprint..
Tue, Nov 8, 4:21 PM · Data Pipelines (Sprint 04)

Mon, Nov 7

mforns created T322545: wmf.virtualpageview_hourly's language_variant field is corrupted.
Mon, Nov 7, 1:41 PM · Data Pipelines

Wed, Nov 2

mforns added a comment to T316049: Unify all Product Analytics ETL jobs.

I'd like to advice against this decision a bit. Although solution 1 might seem simpler, I can see some problems that could hit us, for instance:

Wed, Nov 2, 8:02 PM · Product-Analytics (Kanban), Epic
mforns updated the task description for T318367: Create Plan for Spark 2 Deprecation.
Wed, Nov 2, 4:59 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning
mforns added a comment to T318367: Create Plan for Spark 2 Deprecation.

Of course @Miriam! just wanted to know whether those were all. Let me know if I can help! Cheers

Wed, Nov 2, 4:30 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning
mforns updated subscribers of T321960: Presto returns incorrect data for an added field.
Wed, Nov 2, 3:46 PM · Data Pipelines (Sprint 04), Data-Engineering-Planning, Product-Analytics
mforns added a comment to T321960: Presto returns incorrect data for an added field.

Hm, this seems something we should prioritize... Let's bring this out next Monday for our sprint planning!

Wed, Nov 2, 3:46 PM · Data Pipelines (Sprint 04), Data-Engineering-Planning, Product-Analytics
mforns added a comment to T318367: Create Plan for Spark 2 Deprecation.

And thank you @Miriam as well! I saw your team added a couple jobs. Would that be all? If so I will close the list :-)

Wed, Nov 2, 3:43 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning
mforns added a comment to T318367: Create Plan for Spark 2 Deprecation.

Thank you @mpopov!!

Wed, Nov 2, 3:33 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning

Mon, Oct 31

mforns created T322036: Implement periodical cleaning of Airflow databases.
Mon, Oct 31, 3:14 PM · Data-Engineering-Planning, Data Pipelines
mforns moved T304852: Reduce the number of files generated by geoeditors airflor jobs from Next Up to In Progress on the Data Pipelines (Sprint 03) board.
Mon, Oct 31, 3:01 PM · Data Pipelines, Data-Engineering
mforns claimed T304852: Reduce the number of files generated by geoeditors airflor jobs.
Mon, Oct 31, 3:01 PM · Data Pipelines, Data-Engineering
mforns added a comment to T321960: Presto returns incorrect data for an added field.

Interesting, @Ottomata!
It says here that the fix for the bug you paste was included in release 0.258. So, we already have that fix in prod, right? It seems to me something very similar, though.

Mon, Oct 31, 2:49 PM · Data Pipelines (Sprint 04), Data-Engineering-Planning, Product-Analytics

Fri, Oct 28

mforns created T321925: Allow Cormac Parle and Marco Fossati to deploy analytics-platform-eng Airflow instance.
Fri, Oct 28, 5:23 PM · Data Pipelines (Sprint 04), Data-Engineering-Planning
mforns added a comment to T314131: Some reliability metrics missing since June 20th '22.

Yes, if we had implemented the DAG differently, re-running would be a task that Airflow users could easily do!
However, this particular DAG (and a couple others) follow a pattern that makes it difficult to re-run partially.
We plan to change those DAGs to a better structure and add the documentation to our Airflow developer guide.

Fri, Oct 28, 2:33 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata

Oct 27 2022

mforns added a comment to T314131: Some reliability metrics missing since June 20th '22.

I've created a task to specifically tackle the back-filling: https://phabricator.wikimedia.org/T321838

Oct 27 2022, 4:32 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata
mforns created T321838: Back-fill Wikidata reliability Graphite metrics.
Oct 27 2022, 4:30 PM · Data-Engineering-Planning, Data Pipelines
mforns updated the task description for T318367: Create Plan for Spark 2 Deprecation.
Oct 27 2022, 4:07 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning

Oct 26 2022

mforns added a comment to T316049: Unify all Product Analytics ETL jobs.

Thanks @mpopov for the summary!

Oct 26 2022, 9:26 PM · Product-Analytics (Kanban), Epic
mforns added a comment to T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.

@xcollazo This would be for the analytics and the analytics_test instances.
Although, if other teams are following our developer guide docs, they might do it as well, but that is their decision to make, I think!

Oct 26 2022, 9:09 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns added a comment to T314131: Some reliability metrics missing since June 20th '22.

Hi @Michael! Yes, we will back-fill as much as we can.
I have to talk to the team tomorrow to see how we want to approach that, since that particular Airflow DAG is not easy to re-run partially...
I'll keep you posted!

Oct 26 2022, 7:14 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata

Oct 24 2022

mforns created T321506: refinery scap deployment to thin nodes is broken.
Oct 24 2022, 5:11 PM · Data-Engineering-Planning, Data Pipelines

Oct 21 2022

mforns added a comment to T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.

Here's the documentation about timeouts in Airflow's developer guide:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer_guide#Timeouts

Oct 21 2022, 3:50 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns added a comment to T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.

This is the MR that removes the timeout from the existing DAGs:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/176

Oct 21 2022, 3:38 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns added a comment to T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.

We had a discussion with the team and here's a summary:

Oct 21 2022, 3:20 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning

Oct 18 2022

mforns added a comment to T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.

Some thoughts to start a conversation:

Oct 18 2022, 2:24 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns added a comment to T319440: Deploy a PostgreSQL service for Airflow to use.

Thanks @BTullis!

Oct 18 2022, 12:36 PM · Data Pipelines (Sprint 04), Patch-For-Review

Oct 17 2022

mforns updated subscribers of T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.
Oct 17 2022, 8:04 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns added a comment to T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.

After some reading, here's what I learned:

Oct 17 2022, 8:03 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns moved T317549: [airflow] Normalize the use of timeouts in Airflow DAGs from Ready to In Progress on the Data Pipelines (Sprint 03) board.
Oct 17 2022, 7:17 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns claimed T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.
Oct 17 2022, 7:17 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning
mforns added a comment to T318367: Create Plan for Spark 2 Deprecation.

Here's the link to the spreadsheet with the list of Spark2 jobs still running in the Hadoop cluster.
https://docs.google.com/spreadsheets/d/1j1HzjebGU61mDRRMS8Lyx7DpYiqQrxaM09HL5QRPRfc

Oct 17 2022, 7:08 PM · Data Pipelines (Sprint 04), Shared-Data-Infrastructure, Data-Engineering-Planning
mforns added a comment to T319440: Deploy a PostgreSQL service for Airflow to use.

Thanks for all the notes!!
One question: I don't understand what replacing an-db100[1-2] with an-mariadb100[1-2] means. Does it mean that we will use the already provisioned machines an-db100[1-2] for PostreSQL and then purchase 2 new machines, an-mariadb100[1-2], for all the mariadb databases? (sorry I have no permits for T319437)

Oct 17 2022, 4:25 PM · Data Pipelines (Sprint 04), Patch-For-Review

Oct 11 2022

mforns moved T314131: Some reliability metrics missing since June 20th '22 from In Review to Ready to Deploy on the Data Pipelines (Sprint 02) board.
Oct 11 2022, 1:38 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata

Oct 10 2022

mforns moved T314131: Some reliability metrics missing since June 20th '22 from Ready to In Review on the Data Pipelines (Sprint 02) board.
Oct 10 2022, 8:23 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata

Oct 3 2022

mforns added a comment to T305841: Migrate unique devices jobs.

Regarding the back-filling of the correct computation of unique devices data:
We have already re-run (back-filled) the correct metrics since 1st of July. This is the earliest we can back-fill, since older source data is not available any more.

Oct 3 2022, 4:23 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning
mforns added a comment to T305841: Migrate unique devices jobs.

The unique devices jobs have been migrated to Airflow successfully!

Oct 3 2022, 4:21 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning

Sep 28 2022

mforns added a comment to T305841: Migrate unique devices jobs.

GitLab changes associated with this task:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/140
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/163

Sep 28 2022, 8:09 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning
mforns added a comment to T318849: analytics-dumps-fetch-unique_devices.service failing on dumps servers.

The permissions of the unique devices dumps have been restored.

Sep 28 2022, 7:05 PM · Analytics, Dumps-Generation, cloud-services-team (Kanban)

Sep 12 2022

mforns created T317549: [airflow] Normalize the use of timeouts in Airflow DAGs.
Sep 12 2022, 4:45 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning

Sep 7 2022

mforns added a comment to T314131: Some reliability metrics missing since June 20th '22.

I've looked a bit into this and I think I found what's happening.
Indeed the metrics query is able to gather the data correctly, but the metrics do not reach Graphite.
The reason is the HiveToGraphite Spark job is failing when sending the metrics to Graphite, because the values of the metrics are doubles.

22/09/07 00:33:14 ERROR HiveToGraphite: java.lang.Double cannot be cast to java.lang.Long. Failed to send message to Graphite.

HiveToGraphite expects that the metric values are longs, and not doubles.
Although the queries do not explicitly specify the double type, my suspicion is that the percentile_approx calculation in some metrics outputs a double,

percentile_approx(time_firstbyte, 0.5) as metric_count,

which after the UNION statement affects all the results (all the metric values become doubles since they share the same column). But maybe I'm wrong!
In any case, we have to modify the query file to make sure the output values are compatible with the type long.

Sep 7 2022, 7:07 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata
mforns moved T314131: Some reliability metrics missing since June 20th '22 from Ready to In Progress on the Data Pipelines (Sprint 01) board.
Sep 7 2022, 4:06 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata
mforns claimed T314131: Some reliability metrics missing since June 20th '22.
Sep 7 2022, 4:05 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning, Wikidata Analytics, Wikidata
mforns added a comment to T315674: Remove materialized .json files from event schema repositories.

My understanding is that yaml is the only format actually in use. If so, I think there is no point in supporting both; I'd lean towards the path of least resistance (even if suboptimal) and deprecate json support.

Sep 7 2022, 1:54 PM · Event-Platform Value Stream (Sprint 02), Data-Engineering-Planning

Sep 5 2022

mforns added a comment to T316746: Fix `refinery-drop-older-than` script for end-of-month/end-of-year.

I think it does belong to data pipelines. Probably infrastructure, since it's the tool (and not any particular job) that is failing.
It's not tech debt, it's a bug that will break production jobs at the end of the month :D
We should fix this before then!

Sep 5 2022, 1:40 PM · Data Pipelines (Sprint 03), Data-Engineering-Planning

Sep 2 2022

mforns added a comment to T316572: Triage and Report on Unique Devices Data Issue.

The process has been documented in
https://docs.google.com/document/d/1Aj2oOZvwb6D6lm89XU_T9brtXApz5b7R00UtbzEiYkI/edit

Sep 2 2022, 12:52 PM · Data Pipelines

Aug 25 2022

mforns added a comment to T315329: Update Search Engine list.

Code and changes make sense to me @Isaac!
I'll confirm with the team, but feel free to create a patch for that if you feel like it!

Aug 25 2022, 2:13 PM · Data Pipelines

Aug 24 2022

mforns added a comment to T305841: Migrate unique devices jobs.

During the initial phase of this task we had some issues: The queries that currently compute unique devices metrics in Hive (with the Oozie job) didn't work in Spark3. The data is quite skewed and Spark has difficulties with it since it computes everything in memory (as opposed of Hive that can handle big skewed data more robustly).

Aug 24 2022, 12:57 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning

Aug 22 2022

mforns placed T250845: Anomaly detection alarms for the edit event stream up for grabs.
Aug 22 2022, 12:59 PM · Data-Engineering
mforns added a comment to T310542: [Airflow] Refactor HDFSArchiveOperator to run in Skein.

Another option here is to use spark-submit instead of java to run the Archiver.
I know it's using a gun to kill a fly, but it's there...

Aug 22 2022, 12:59 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning

Aug 17 2022

mforns moved T305841: Migrate unique devices jobs from Ready to In Progress on the Data Pipelines (Sprint 00) board.
Aug 17 2022, 3:14 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning
mforns claimed T305841: Migrate unique devices jobs.
Aug 17 2022, 3:14 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning
mforns added a comment to T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy.

🙏 🙏 🙏

Aug 17 2022, 2:57 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream

Aug 16 2022

mforns created T315326: [Airflow] Add log rotation to scheduler logs.
Aug 16 2022, 2:45 PM · Data-Engineering-Planning, Data Pipelines

Aug 11 2022

mforns added a comment to T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy.

Thanks a lot @phuedx!

Aug 11 2022, 2:50 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream

Jul 29 2022

mforns added a comment to T311976: Investigate why airflow sensor tasks fail without sending errors.

Since we deleted some airflow logs under an-launcher1002:/srv/analytics-airflow/logs this issue has not happened.
Also, every time there was a sensor silent failure, at least one of the sensor's log files was missing (Airflow couldn't find it).
From this, and some team conversations, we suspect that the silent failures could be caused by the logs filling up the /srv disk in an-launcher1002.
So, I was looking at the logs and found out that an Airflow bug is polluting the logs (and breaking SLAs), see: https://phabricator.wikimedia.org/T314181#8116392

Jul 29 2022, 8:15 PM · Data Pipelines, Data-Engineering-Planning, Data-Engineering-Kanban
mforns added a comment to T314181: Airflow does not send SLA emails nor update SLA misses in the db.

While troubleshooting T311976 I found out that *all* DAGs are silently failing at the scheduler stage.
They are outputting logs like this every couple seconds:

[2022-07-29 19:19:54,652] {processor.py:552} ERROR - Error executing SlaCallbackRequest callback for file: /srv/deployment/airflow-dags/analytics/analytics/dags/aqs/aqs_hourly_dag.py
Traceback (most recent call last):
  File "/usr/lib/airflow/lib/python3.7/site-packages/airflow/dag_processing/processor.py", line 545, in execute_callbacks
    self.manage_slas(dagbag.dags.get(request.dag_id))
  File "/usr/lib/airflow/lib/python3.7/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/lib/airflow/lib/python3.7/site-packages/airflow/dag_processing/processor.py", line 413, in manage_slas
    if following_schedule + task.sla < timezone.utcnow():
TypeError: unsupported operand type(s) for +: 'DateTime' and 'NoneType'

This has several effects:

Jul 29 2022, 8:06 PM · Data-Engineering-Planning
mforns renamed T314181: Airflow does not send SLA emails nor update SLA misses in the db from [airflow] Airflow does not send SLA emails nor update SLA misses in the db to Airflow does not send SLA emails nor update SLA misses in the db.
Jul 29 2022, 7:51 PM · Data-Engineering-Planning
mforns moved T314181: Airflow does not send SLA emails nor update SLA misses in the db from Ready to In progress on the Data-Engineering-Planning (Sprint 02) board.
Jul 29 2022, 7:50 PM · Data-Engineering-Planning
mforns edited projects for T314181: Airflow does not send SLA emails nor update SLA misses in the db, added: Data-Engineering-Planning (Sprint 02); removed Data-Engineering-Planning.
Jul 29 2022, 7:49 PM · Data-Engineering-Planning
mforns created T314181: Airflow does not send SLA emails nor update SLA misses in the db.
Jul 29 2022, 7:48 PM · Data-Engineering-Planning
mforns updated subscribers of T312514: Check home/HDFS leftovers of aniketars.

Heya @Miriam :]
I underestimated the size of the data, I'm sorry.
The part of aniketar's data that was on their hdfs home folder, I've moved over to hdfs://user/mirrys/aniketars (still hdfs - it is too big to move to your stat1005 home folder).
To move the data on stat boxes I need the help of an SRE (root access).

Jul 29 2022, 6:35 PM · Data-Engineering-Planning

Jul 27 2022

mforns added a comment to T312514: Check home/HDFS leftovers of aniketars.

@Miriam Sure! I can copy them to your home folder, and then when you confirm you have everything, I will delete the original ones.
In which of your home directories do you want me to put these files? HDFS? Or any particular stat machine?
Also, can you give me your username? Couldn't find it!
Cheers!

Jul 27 2022, 3:15 PM · Data-Engineering-Planning

Jul 26 2022

mforns created T313834: Add meta.wikidata to the pageview allow-list.
Jul 26 2022, 5:30 PM · Data-Engineering-Planning (Sprint 02)
mforns added a comment to T313816: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs.

Arfff, of course, sorry for that.
Thanks for adding nokafor to the team's alerts!

Jul 26 2022, 3:56 PM · Data-Engineering
mforns added a comment to T313816: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs.

When trying to execute sudo -u hdfs hdfs dfs -ls in stat1008.eqiad.wmnet she gets the error:

ls: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "stat1008/10.64.5.35"; destination host is: "an-master1001.eqiad.wmnet":8020;

she did kinit before and has a valid ticket!

Jul 26 2022, 2:59 PM · Data-Engineering
mforns updated the task description for T313816: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs.
Jul 26 2022, 2:58 PM · Data-Engineering
mforns created T313816: Add nokafor to receive analytics-alerts emails and have sudo -u hdfs rights in hdfs.
Jul 26 2022, 2:56 PM · Data-Engineering

Jul 25 2022

mforns added a comment to T312514: Check home/HDFS leftovers of aniketars.

These are the files that belonged to Aniket Bharti,
please can you confirm whether they should be deleted, or moved them to another location?

Jul 25 2022, 6:38 PM · Data-Engineering-Planning

Jul 15 2022

mforns updated the task description for T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy.
Jul 15 2022, 7:16 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream
mforns updated the task description for T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy.
Jul 15 2022, 7:16 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream
mforns moved T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy from In progress to Ready to deploy on the Data-Engineering-Planning (Sprint 01) board.
Jul 15 2022, 4:59 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream

Jul 11 2022

mforns updated the task description for T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy.
Jul 11 2022, 11:14 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream

Jul 5 2022

phuedx awarded T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy a Mountain of Wealth token.
Jul 5 2022, 2:20 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Metrics-Platform-Planning, Patch-For-Review, Data-Engineering-Kanban, Wikidata-Termbox, Wikidata-Campsite, wdwb-tech, Wikidata, Event-Platform Value Stream

Jun 30 2022

mforns moved T311763: Tech All Hands - Airflow Presentation from Ready to In progress on the Data-Engineering-Planning (Sprint 01) board.
Jun 30 2022, 6:54 PM · Data-Engineering-Planning (Sprint 01)

Jun 28 2022

mforns moved T307935: [Airflow] Proof of concept of Cassandra loading from Discussed (Tracking) to Done on the Data Pipelines board.
Jun 28 2022, 6:22 PM · Data-Engineering-Kanban, Data-Engineering, Data Pipelines
mforns renamed T307935: [Airflow] Proof of concept of Cassandra loading from Migrate Cassandra pageview-per-project-hourly Job to [Airflow] Proof of concept of Cassandra loading.
Jun 28 2022, 6:21 PM · Data-Engineering-Kanban, Data-Engineering, Data Pipelines
mforns moved T307937: SparkSubmitOperator should make it easier to use conda dist envs from Estimated to Done on the Data Pipelines board.
Jun 28 2022, 6:16 PM · Data-Engineering, Data Pipelines

Jun 24 2022

mforns created T311315: [Wikistats] Add newly translated languages.
Jun 24 2022, 4:11 PM · Data Pipelines, Analytics-Wikistats, Data-Engineering-Planning

Jun 23 2022

mforns added a comment to T301568: [Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style).

I think the decision depends on a research that we have not yet done. We should do a time-boxed spike to test cascading of DAG/task dependencies and taking into consideration how those feed into data catalog.
Alternatively, we could just continue the migration without creating DAG/task dependencies (like we've done so far), and then in the future, if we need to improve the pipeline dependency management, we can do the spike and modify the jobs if feasible!
I lean a bit towards the latter, but let me know team what you think!

Jun 23 2022, 3:48 PM · Data Pipelines, Data-Engineering

Jun 13 2022

mforns created T310542: [Airflow] Refactor HDFSArchiveOperator to run in Skein.
Jun 13 2022, 7:25 PM · Data Pipelines (Sprint 02), Data-Engineering-Planning
mforns moved T300054: [Airflow] Add DAG subfolder name to error email's subject from Discussed (Tracking) to In Review on the Data Pipelines board.
Jun 13 2022, 3:45 PM · Data Pipelines, Data-Engineering
mforns added a parent task for T300054: [Airflow] Add DAG subfolder name to error email's subject: T309993: Spark 3 Migration .
Jun 13 2022, 3:44 PM · Data Pipelines, Data-Engineering
mforns added a subtask for T309993: Spark 3 Migration : T300054: [Airflow] Add DAG subfolder name to error email's subject.
Jun 13 2022, 3:44 PM · Data-Engineering-Kanban, Data-Engineering, Epic, Data Pipelines
mforns moved T308049: Airflow Job for Ingesting: Hive Metadata into DataHub from Estimated to In Review on the Data Pipelines board.
Jun 13 2022, 3:41 PM · Data-Catalog
mforns moved T308050: Airflow Job for Ingesting Kafka Metadata into DataHub from Estimated to In Review on the Data Pipelines board.
Jun 13 2022, 3:41 PM · Data-Catalog
mforns moved T308051: Airflow Job for Ingesting Druid Metadata into DataHub from Estimated to In Review on the Data Pipelines board.
Jun 13 2022, 3:41 PM · Data-Catalog
mforns moved T308767: Fix api_daily job from In Review to Done on the Data Pipelines board.
Jun 13 2022, 3:40 PM · Patch-For-Review, Data Pipelines, Data-Engineering-Kanban, Data-Engineering
mforns moved T306955: Spark3 migration - Currently existing airflow jobs from In Review to Done on the Data Pipelines board.
Jun 13 2022, 3:40 PM · Data Pipelines, Data-Engineering-Kanban, Data-Engineering
mforns moved T306955: Spark3 migration - Currently existing airflow jobs from Estimated to In Review on the Data Pipelines board.
Jun 13 2022, 3:40 PM · Data Pipelines, Data-Engineering-Kanban, Data-Engineering
mforns moved T309993: Spark 3 Migration from Estimated to In Review on the Data Pipelines board.
Jun 13 2022, 3:40 PM · Data-Engineering-Kanban, Data-Engineering, Epic, Data Pipelines

Jun 8 2022

JAllemandou awarded T309563: [Airflow] URLSensor might be preventing alerts to fire correctly a Hungry Hippo token.
Jun 8 2022, 6:50 PM · Data-Engineering-Planning (Sprint 02), Data Pipelines

Jun 6 2022

mforns moved T309718: [Airflow] Migrate Oozie's mediawiki_history_load jobs to Airflow from Discussed (Tracking) to Estimated on the Data Pipelines board.
Jun 6 2022, 3:40 PM · Data-Engineering-Kanban, Data Pipelines
mforns moved T308766: Fix airflow interlanguage job from In Review to Done on the Data Pipelines board.
Jun 6 2022, 3:38 PM · Data Pipelines, Data-Engineering-Kanban, Data-Engineering
mforns moved T307540: Migrate 1+ reportupdater jobs from Estimated to Discussed (Tracking) on the Data Pipelines board.
Jun 6 2022, 3:38 PM · Data-Engineering, Data Pipelines
mforns moved T307505: Migrate 1+ Refine jobs from Estimated to Discussed (Tracking) on the Data Pipelines board.
Jun 6 2022, 3:38 PM · Data-Engineering, Data Pipelines