Page MenuHomePhabricator

amastilovic (Aleksandar Mastilovic)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Jan 20 2024, 12:05 AM (116 w, 5 d)
Availability
Available
IRC Nick
amastilovic
LDAP User
Aleksandar Mastilovic
MediaWiki User
AMastilovic-WMF [ Global Accounts ]

Recent Activity

Tue, Apr 7

amastilovic added a comment to T421941: SUL session and password issues for AMastilovic-WMF.

It might be a coincidence that my officewiki session ended right about the same time I changed my WikiTech password, and because as you correctly suspected my password manager keeps passwords for both sites under the same entry, I ended up logged out since the old OfficeWiki password was replaced with the new WikiTech password.

Tue, Apr 7, 4:46 PM · WMF-General-or-Unknown
amastilovic added a comment to T422459: Re-run maintainviews on all clouddb* and an-redacteddb1001.eqiad.wmnet.

Can confirm that this resolved our Sqoop issue - thank you @Marostegui !

Tue, Apr 7, 3:33 PM · cloud-services-team, Data-Services, Data-Engineering-Radar, DBA, Data-Engineering

Mon, Apr 6

amastilovic added a comment to T421941: SUL session and password issues for AMastilovic-WMF.

@Reedy do you happen to know which team should I talk to regarding SUL and/or OfficeWiki accounts?

Mon, Apr 6, 11:04 PM · WMF-General-or-Unknown
amastilovic created T422412: Improve the quality of Sqoop error logging.
Mon, Apr 6, 5:53 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Fri, Apr 3

amastilovic added a comment to T421789: Add support for variables to DbtSkeinOperator.

@Mayakp.wiki yes this is for the backfill functionality, among other stuff!

Fri, Apr 3, 12:32 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Thu, Apr 2

amastilovic added a comment to T421941: SUL session and password issues for AMastilovic-WMF.

I think @bd808 is on to something here. Right after I changed the password on my SUL account, officewiki logged me out and now I can't log back in.

Thu, Apr 2, 12:19 AM · WMF-General-or-Unknown

Wed, Apr 1

amastilovic created T422080: Remove the test DBT DAG from test_k8s Airflow.
Wed, Apr 1, 8:48 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Tue, Mar 31

amastilovic created T421941: SUL session and password issues for AMastilovic-WMF.
Tue, Mar 31, 5:15 PM · WMF-General-or-Unknown

Mon, Mar 30

amastilovic created T421789: Add support for variables to DbtSkeinOperator.
Mon, Mar 30, 10:24 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Thu, Mar 26

amastilovic updated the task description for T421434: Move all currently scheduled DBT DAGs to the `dbt_scheduled` Airflow DAGs.
Thu, Mar 26, 7:33 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic created T421434: Move all currently scheduled DBT DAGs to the `dbt_scheduled` Airflow DAGs.
Thu, Mar 26, 7:33 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic closed T420069: Schedule three new monthly DBT models for Movement Insights as Resolved.
Thu, Mar 26, 7:29 PM · OKR-Work (WE1 FY2025-26), Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T420069: Schedule three new monthly DBT models for Movement Insights.

Update: I've finally figured out the reason this was failing in Airflow - the skein driver was running out of memory and silently exiting. I've fixed the DAG: https://airflow.wikimedia.org/dags/dbt_demo/grid

Thu, Mar 26, 7:28 PM · OKR-Work (WE1 FY2025-26), Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Wed, Mar 25

amastilovic updated the task description for T419925: Build a set of configurable pre-scheduled DBT Airflow DAGs executing dbt-jobs models.
Wed, Mar 25, 9:53 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic closed T419594: Implement more fine-grained selection of DBT models in DbtSkeinOperator as Resolved.
Wed, Mar 25, 9:51 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic closed T410268: Run dbt from Airflow, a subtask of T406764: Provide a dbt-core development environment and production setup in the data-platform, as Resolved.
Wed, Mar 25, 9:51 PM · Patch-For-Review, Data-Engineering-Roadmap, Movement-Insights, Epic, Data-Platform-SRE
amastilovic closed T410268: Run dbt from Airflow as Resolved.
Wed, Mar 25, 9:51 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), OKR-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Movement-Insights

Thu, Mar 19

amastilovic added a comment to T419286: GrowthBook experiment analysis keeps failing/stalling.

@mpopov hey even if the ticket is closed, I still think it'd be beneficial if I left a written trace of the optimizations that Claude AI suggested when it comes to improving the performance of experiment queries, focusing on Presto-specific optimizations. I understand that experiment queries are assembled together by GrowthBook itself and we don't have a way of modifying them, so only a few of these improvement{F73158376}

Thu, Mar 19, 9:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Test Kitchen
amastilovic added a comment to T420069: Schedule three new monthly DBT models for Movement Insights.

For some reason Phab link to GitLab doesn't seem to be working so here's the related airflow-dags change that's been merged already: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2089

Thu, Mar 19, 12:48 AM · OKR-Work (WE1 FY2025-26), Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T419925: Build a set of configurable pre-scheduled DBT Airflow DAGs executing dbt-jobs models.

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2087

Thu, Mar 19, 12:46 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Tue, Mar 17

amastilovic added a comment to T419925: Build a set of configurable pre-scheduled DBT Airflow DAGs executing dbt-jobs models.

I'm now wondering, should we give a try to Cosmos? It is supposed to split dbt models into Airflow tasks automatically.

Tue, Mar 17, 3:35 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 17 2026

amastilovic added a comment to T419925: Build a set of configurable pre-scheduled DBT Airflow DAGs executing dbt-jobs models.

I personally like the idea of running a single dbt run and letting dbt handle all the dependencies, but I see the point where that could become complex to manage. What happens if the dbt job becomes very big and a single model fails? what if we need to backfill specific models?

Mar 17 2026, 12:38 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T419925: Build a set of configurable pre-scheduled DBT Airflow DAGs executing dbt-jobs models.

As I understand it, orchestrating models would depend on submitting MRs to two repositories: dbt-jobs, to create the model, and airflow-dags, to define how the model is orchestrated in a DAG. I think we had also discussed having the schedule be configured in the dbt model metadata, where each pre-scheduled DAG would do dbt select to find the models to run; essentially the team only needs one MR to dbt-jobs to configure orchestration too, via the team's or model's metadata. While this (if feasible) makes the user experience a bit simpler, there are probably trade-offs here that I'm not seeing. How do the two approaches compare?

Mar 17 2026, 12:30 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 13 2026

amastilovic created T420069: Schedule three new monthly DBT models for Movement Insights.
Mar 13 2026, 10:13 PM · OKR-Work (WE1 FY2025-26), Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic created T419925: Build a set of configurable pre-scheduled DBT Airflow DAGs executing dbt-jobs models.
Mar 13 2026, 12:15 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 10 2026

amastilovic created T419594: Implement more fine-grained selection of DBT models in DbtSkeinOperator.
Mar 10 2026, 6:52 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 9 2026

amastilovic added a comment to T419310: Create a custom DBT materialization macro.

@JMonton-WMF microbatch incremental strategy looks exactly like dbt's answer to the common batching practice we employ. I've read some docs on it just now and it seems like we could easily fit its usage into our common usage patterns, with the caveat that our models would need the event_time column which they sometimes lack unfortunately.

Mar 9 2026, 11:25 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mar 7 2026

amastilovic added a comment to T419286: GrowthBook experiment analysis keeps failing/stalling.

The Presto-Iceberg connection setup in GrowthBook had a request timeout set to 170 seconds (2.83 minutes). When I tried to update the experiment queries' results, Presto queries all got "user canceled" after 2.83 minutes - which means that GrowthBook was canceling them. I've increased the Presto-Iceberg connection request timeout to 5 minutes/300 seconds, I've re-run the experiment queries and they did finish successfully, but they barely made it in time (longest took 4.6 mins).

Mar 7 2026, 2:29 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Test Kitchen
amastilovic created T419310: Create a custom DBT materialization macro.
Mar 7 2026, 12:59 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mar 5 2026

amastilovic added a comment to T419121: druid_load_webrequest_sampled_live_hourly SerDe error in singular DAG run.

Is this cleanup process something we should implement as part of the pipeline?

Mar 5 2026, 6:42 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 25 2026

amastilovic edited projects for T415283: Refactor pingback analytics pipeline, added: Data-Engineering (Q3 FY25/26 January 1st - March 31th); removed Data-Engineering.
Feb 25 2026, 11:17 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic added a parent task for T418190: Refactor pingback reports pipelines using dbt: T415283: Refactor pingback analytics pipeline.
Feb 25 2026, 11:17 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic added a subtask for T415283: Refactor pingback analytics pipeline: T418190: Refactor pingback reports pipelines using dbt.
Feb 25 2026, 11:17 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic claimed T415283: Refactor pingback analytics pipeline.
Feb 25 2026, 11:16 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic added a comment to T418190: Refactor pingback reports pipelines using dbt.

Nice! Just wondering, do we know who uses that output CSV and where?

I'm asking for my own education (I don't know much about pingback), but I'm also wondering: do we actually want to produce a CSV, or can migrate its readers to reading a table at the end of this work?

Feb 25 2026, 11:16 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Feb 24 2026

amastilovic added a comment to T209453: Refine: Use Spark SQL instead of Hive JDBC.

@Ottomata I'm not sure if it will work when using a Hive adapter, but it should work through a Spark adapter since it works from Jupyter's wmf.spark.run.

Feb 24 2026, 2:25 AM · Data-Engineering, Data Pipelines
amastilovic created T418190: Refactor pingback reports pipelines using dbt.
Feb 24 2026, 2:03 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic added a comment to T418065: spark-sql warns about mismatching table schema for event.EditAttemptStep.

So, the HiveMetastore had the fields, but Spark's version in TBLPROPERTIES did not, is that correct? If so, then that makes sense.

Feb 24 2026, 1:42 AM · Data-Engineering
amastilovic added a comment to T418065: spark-sql warns about mismatching table schema for event.EditAttemptStep.

@Ottomata yeah I just ran that command above, via wmf.spark.run in my Jupyter notebook. The trick with the structs is that you have to provide a whole new struct and not just the new fields of the struct.

Feb 24 2026, 1:40 AM · Data-Engineering

Feb 23 2026

amastilovic added a comment to T418065: spark-sql warns about mismatching table schema for event.EditAttemptStep.

A somewhat better way to manually fix this is to determine the difference in Hive vs Spark schemas, and apply ALTER TABLE ... ALTER COLUMN in Spark SQL to reflect what is in Hive metastore.

Feb 23 2026, 10:59 PM · Data-Engineering

Feb 20 2026

amastilovic added a comment to T416672: dbt repository structure (Milestone 3).

dbt doesn't allow to have multiple files with the same name, even if they live in different folders

What happens there is a clash? Would dbt run fail? Can we detect this before orchestration with a linter/CI check or so?

Feb 20 2026, 10:33 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Movement-Insights (FY25-26 H2)
amastilovic added a comment to T416672: dbt repository structure (Milestone 3).
Some thoughts on model naming guidelines
Feb 20 2026, 10:31 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Movement-Insights (FY25-26 H2)
amastilovic added a comment to T416672: dbt repository structure (Milestone 3).

Great writeup @JMonton-WMF ! A monorepo shared by all teams definitely sounds like the way to go about this. I’ll try to offer some concrete suggestions on how to organize the monorepo, and address some questions posted above.

Feb 20 2026, 8:30 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Movement-Insights (FY25-26 H2)

Feb 12 2026

amastilovic added a comment to T417152: Sqlfluff Rules for dbt.

Thanks for working through these details, @JMonton-WMF and @Mayakp.wiki !

Feb 12 2026, 5:57 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T417171: Sqlufluff on Stat hosts.

@amastilovic - I think that you can already install lots of different Python apps and packages, can't you? It's just that conda-analytics is currently the framework that we install at the operating system level to give people this functionality of creating virtual environments and easily switching between them.

Feb 12 2026, 5:45 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T417171: Sqlufluff on Stat hosts.

I'd be in favor of the second option, installing Poetry (or pipx for that matter) on the Stat hosts. This would enable Stat machine users to safely install many different Python apps/packages, not just sqlfluff.

Feb 12 2026, 1:54 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T415194: Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library.

I see you ended up going with DbtSkeinOperator inheriting from SimpleSkeinOperator. I'm curious to learn why you chose that over a DbtOperator and using the SkeinHook and SkeinHookBuilder to implement this.

Mostly for the reasons of expediency - inheriting from SimpleSkeinOperator was a much quicker and well-tested route. Also, SimpleSkeinOperator itself is using SkeinHook and its builder. I might be missing something but from what I could see I would basically end up duplicating that same code in a DbtOperator.

Feb 12 2026, 12:25 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic updated the task description for T415194: Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library.
Feb 12 2026, 12:17 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 6 2026

amastilovic added a comment to T416709: Airflow instance for Experiment Platform.

On DPE side I believe we need to cover the following items in order to support this new instance:

Feb 6 2026, 9:24 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen, Data-Engineering
amastilovic added a comment to T408146: [Hypothesis] WE1.5.1 Contributor metrics dashboard.

Weekly update from the Data Engineering team:

Feb 6 2026, 5:25 PM · OKR-Work (WE1 FY2025-26), Movement-Insights (FY25-26 H1)

Jan 23 2026

amastilovic added a comment to T415267: aggregate_for_fundraising_hourly failing for last 24 hours.

The issue has now been fixed: https://airflow-platform-eng.wikimedia.org/dags/aggregate_for_fundraising_hourly/grid

Jan 23 2026, 9:48 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 22 2026

amastilovic claimed T415267: aggregate_for_fundraising_hourly failing for last 24 hours.
Jan 22 2026, 3:57 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic created T415275: Put aggregate_for_fundraising.hql into refinery.
Jan 22 2026, 3:30 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 21 2026

amastilovic added a comment to T415194: Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library.

@Ottomata I'm looking into that, thanks for the suggestion!

Jan 21 2026, 3:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a subtask for T410268: Run dbt from Airflow: T415194: Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library.
Jan 21 2026, 2:04 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), OKR-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Movement-Insights
amastilovic added a parent task for T415194: Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library: T410268: Run dbt from Airflow.
Jan 21 2026, 2:04 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic created T415194: Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library.
Jan 21 2026, 1:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic closed T405040: Global Editor Metrics - backfill pageview metric data, a subtask of T403660: WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics, as Resolved.
Jan 21 2026, 1:38 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), OKR-Work, MediaWiki-Page-derived-data, Growth-Team, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog
amastilovic closed T405040: Global Editor Metrics - backfill pageview metric data as Resolved.
Jan 21 2026, 1:38 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), OKR-Work, MediaWiki-Page-derived-data
amastilovic added a comment to T414784: Test the dbt+skein approach to running dbt Spark jobs in K8s.

I know close to zero about dbt, but if dbt is launching a spark job, then these are probably two separate settings. SimpleSkeinOperator will only request a single YARN container (the application master) in which it will run a command. For our Spark Operators, launcher=skein will create a single node YARN application in which it will run the spark-submit command.

The way we run Spark via dbt is through the dbt-spark adapter, which supports 4 different ways of interacting with Spark: ODBC, Thrift, HTTP and session. We are using the session method which effectively spins up a pyspark session to run the SQL commands. I guess in this way it's similar to our Jupyter notebooks.

Jan 21 2026, 1:36 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic edited projects for T414784: Test the dbt+skein approach to running dbt Spark jobs in K8s, added: Data-Engineering (Q3 FY25/26 January 1st - March 31th); removed Data-Engineering (Q2 FY25/26 October 1st - December 31th).
Jan 21 2026, 12:57 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T414784: Test the dbt+skein approach to running dbt Spark jobs in K8s.
  • About profiles.yml, I think we could consider the profiles.yml a default profile for local development, and maybe overwrite it from Airflow before running dbt.
Jan 21 2026, 12:54 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 20 2026

amastilovic moved T409601: Review and productionize the WME differential privacy data set from In progress to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
amastilovic moved T414714: Add data-steward-alerts mail to anomaly_detection_traffic_distribution_daily DAG from In progress to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:38 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic moved T405039: Global Editor Metrics - Data Pipeline from In progress to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:14 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, OKR-Work, MediaWiki-Page-derived-data
amastilovic moved T405040: Global Editor Metrics - backfill pageview metric data from Ready to Deploy to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:13 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), OKR-Work, MediaWiki-Page-derived-data
amastilovic moved T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes from In progress to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:13 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), OKR-Work, MediaWiki-Page-derived-data
amastilovic moved T414107: Inventory of SystemD timer based jobs and pipelines from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:12 PM · Essential-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic moved T414109: Technical assessment of AQS framework from Next Up to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 4:10 PM · AQS2.0, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic updated the task description for T414784: Test the dbt+skein approach to running dbt Spark jobs in K8s.
Jan 20 2026, 1:43 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T414784: Test the dbt+skein approach to running dbt Spark jobs in K8s.

The dbt+skein test as outlined in this ticket has been performed successfully: https://airflow-test-k8s.wikimedia.org/dags/test_dbt_skein_dag/

Jan 20 2026, 1:42 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 16 2026

amastilovic created T414784: Test the dbt+skein approach to running dbt Spark jobs in K8s.
Jan 16 2026, 10:54 AM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 14 2026

amastilovic closed T411536: Set a custom From: email address for alerts from Airflow dev instances as Resolved.
Jan 14 2026, 2:49 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
amastilovic added a comment to T411536: Set a custom From: email address for alerts from Airflow dev instances.

This has been kindly completed by @brouberol in the above change, so I'm closing this ticket.

Jan 14 2026, 2:49 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 2 2025

amastilovic created T411536: Set a custom From: email address for alerts from Airflow dev instances.
Dec 2 2025, 6:08 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 1 2025

amastilovic moved T400283: Clean up airflow-dags gitlab-ci.yaml CI/CD pipelines from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 1 2025, 4:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
amastilovic moved T410285: SDS 1.3.6 SPUR bot detection - Productionize SPUR datasets import from In progress to In Review on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 1 2025, 4:44 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
amastilovic moved T409601: Review and productionize the WME differential privacy data set from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 1 2025, 4:32 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
amastilovic moved T410688: Implement a new pipeline and table with reconciled historical revision data from Urgent to In Review on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 1 2025, 4:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 25 2025

amastilovic closed T409782: Update thresholds configuration for MediaWiki History Reduced error checks as Resolved.
Nov 25 2025, 9:55 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
amastilovic added a comment to T377023: Add CI step to event schema repositories to test to fail if a schema is deleted.

Ditto what @xcollazo said above. In order to have the desired behavior for this pipeline job, I think you need:

Nov 25 2025, 2:05 AM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform
amastilovic created T410972: Requesting access to cassandra-staging-devs group for amastilovic.
Nov 25 2025, 1:04 AM · SRE, SRE-Access-Requests

Nov 24 2025

amastilovic added a comment to T410962: Provision Global Editor Metrics tables & endpoints.

@Eevans thank you for that MR! You are correct, wiki_id should be TEXT - we've already implemented it in the Hive counterpart for that table: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1206879

Nov 24 2025, 10:50 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence

Nov 18 2025

amastilovic added a comment to T409782: Update thresholds configuration for MediaWiki History Reduced error checks.

Addressed in https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1794

Nov 18 2025, 1:58 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 14 2025

amastilovic updated subscribers of T405039: Global Editor Metrics - Data Pipeline.

OK so I've now officially backfilled the wmf_contributors. and wmf_readership. tables, but the process I had to use in order for the number of files to be small enough is complicated enough that it warrants being documented somewhere:

Nov 14 2025, 7:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, OKR-Work, MediaWiki-Page-derived-data

Nov 10 2025

amastilovic created T409782: Update thresholds configuration for MediaWiki History Reduced error checks.
Nov 10 2025, 8:44 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 7 2025

amastilovic updated the task description for T409514: Migrate Sqoop jobs to Airflow.
Nov 7 2025, 3:02 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)
amastilovic created T409514: Migrate Sqoop jobs to Airflow.
Nov 7 2025, 1:04 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Oct 31 2025

amastilovic added a comment to T408942: Add code styles rules to analytics-refinery-source.

We definitely already have the maven-checkstyle-plugin set up in the main pom.xml - I know because it's very annoying since the codebase doesn't seem to conform to the style being checked, and on each compile it produces a ton of ERRORs in output.

Oct 31 2025, 9:39 PM · Data-Engineering, Essential-Work

Oct 30 2025

amastilovic added a comment to T408687: Create example dbt models using Iceberg.

That specific use-case sounds like what dbt calls a microbatch incremental strategy that replaces time intervals given the event_time column: https://docs.getdbt.com/docs/build/incremental-microbatch

Oct 30 2025, 8:18 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Movement-Insights, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Epic
amastilovic added a comment to T407322: Create dbt folder structure.

@JMonton-WMF I think we could use this task to include a .dbtignore file that will let dbt commands ignore the .ipynb_checkpoint folders: https://docs.getdbt.com/reference/dbtignore

Oct 30 2025, 5:06 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
amastilovic added a comment to T408687: Create example dbt models using Iceberg.

insert_overwrite is what @JAllemandou is describing, perfect.

Oct 30 2025, 4:46 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Movement-Insights, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Epic

Oct 29 2025

amastilovic added a comment to T407994: Move Druid realtime configuration out of Refinery into standalone repo on GitLab.

Do we want only Druid realtime configs its own repo? Perhaps we want the batch ones in the same place?

My uninformed thought on this is that this should be a "Druid config stuff" repo, which would therefore include both realtime AND batch configs :)

Oct 29 2025, 11:42 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Data-Platform-SRE, SRE

Oct 28 2025

amastilovic added a comment to T406263: mediawiki_history_reduced - add page_id and user_central_id fields.

user_id, user_central_id and page_id fields are now available in both the Hive dataset wmf.mediawiki_history_reduced as well as in the corresponding Druid dataset.

Oct 28 2025, 8:26 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
amastilovic updated the task description for T406263: mediawiki_history_reduced - add page_id and user_central_id fields.
Oct 28 2025, 8:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
amastilovic updated the task description for T406263: mediawiki_history_reduced - add page_id and user_central_id fields.
Oct 28 2025, 8:20 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Oct 22 2025

amastilovic claimed T365648: Add user_central_id to mediawiki_history and mediawiki_history_reduced Hive tables.
Oct 22 2025, 11:55 PM · Data-Engineering, Data Pipelines
amastilovic added a comment to T406766: Add dbt related packages to conda-analytics.

Yes, that's true. But conda-analytics isn't necessarily a long-term solution. I'd be much happier to start out with a container based solution as per: T406636: Create a dbt Docker container but container runtimes are not available to us on the stat servers at the moment.

At least this way, we will have something unform to work with already on the stat servers.

Oct 22 2025, 11:33 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
amastilovic updated the task description for T407994: Move Druid realtime configuration out of Refinery into standalone repo on GitLab.
Oct 22 2025, 7:09 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Data-Platform-SRE, SRE
amastilovic created T407994: Move Druid realtime configuration out of Refinery into standalone repo on GitLab.
Oct 22 2025, 3:39 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Data-Platform-SRE, SRE