User Details
- User Since
- Jan 20 2024, 12:05 AM (116 w, 5 d)
- Availability
- Available
- IRC Nick
- amastilovic
- LDAP User
- Aleksandar Mastilovic
- MediaWiki User
- AMastilovic-WMF [ Global Accounts ]
Tue, Apr 7
It might be a coincidence that my officewiki session ended right about the same time I changed my WikiTech password, and because as you correctly suspected my password manager keeps passwords for both sites under the same entry, I ended up logged out since the old OfficeWiki password was replaced with the new WikiTech password.
Can confirm that this resolved our Sqoop issue - thank you @Marostegui !
Mon, Apr 6
@Reedy do you happen to know which team should I talk to regarding SUL and/or OfficeWiki accounts?
Fri, Apr 3
@Mayakp.wiki yes this is for the backfill functionality, among other stuff!
Thu, Apr 2
I think @bd808 is on to something here. Right after I changed the password on my SUL account, officewiki logged me out and now I can't log back in.
Wed, Apr 1
Tue, Mar 31
Mon, Mar 30
Thu, Mar 26
Update: I've finally figured out the reason this was failing in Airflow - the skein driver was running out of memory and silently exiting. I've fixed the DAG: https://airflow.wikimedia.org/dags/dbt_demo/grid
Wed, Mar 25
Thu, Mar 19
@mpopov hey even if the ticket is closed, I still think it'd be beneficial if I left a written trace of the optimizations that Claude AI suggested when it comes to improving the performance of experiment queries, focusing on Presto-specific optimizations. I understand that experiment queries are assembled together by GrowthBook itself and we don't have a way of modifying them, so only a few of these improvement{F73158376}
For some reason Phab link to GitLab doesn't seem to be working so here's the related airflow-dags change that's been merged already: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2089
Tue, Mar 17
I'm now wondering, should we give a try to Cosmos? It is supposed to split dbt models into Airflow tasks automatically.
Mar 17 2026
I personally like the idea of running a single dbt run and letting dbt handle all the dependencies, but I see the point where that could become complex to manage. What happens if the dbt job becomes very big and a single model fails? what if we need to backfill specific models?
As I understand it, orchestrating models would depend on submitting MRs to two repositories: dbt-jobs, to create the model, and airflow-dags, to define how the model is orchestrated in a DAG. I think we had also discussed having the schedule be configured in the dbt model metadata, where each pre-scheduled DAG would do dbt select to find the models to run; essentially the team only needs one MR to dbt-jobs to configure orchestration too, via the team's or model's metadata. While this (if feasible) makes the user experience a bit simpler, there are probably trade-offs here that I'm not seeing. How do the two approaches compare?
Mar 13 2026
Mar 10 2026
Mar 9 2026
@JMonton-WMF microbatch incremental strategy looks exactly like dbt's answer to the common batching practice we employ. I've read some docs on it just now and it seems like we could easily fit its usage into our common usage patterns, with the caveat that our models would need the event_time column which they sometimes lack unfortunately.
Mar 7 2026
The Presto-Iceberg connection setup in GrowthBook had a request timeout set to 170 seconds (2.83 minutes). When I tried to update the experiment queries' results, Presto queries all got "user canceled" after 2.83 minutes - which means that GrowthBook was canceling them. I've increased the Presto-Iceberg connection request timeout to 5 minutes/300 seconds, I've re-run the experiment queries and they did finish successfully, but they barely made it in time (longest took 4.6 mins).
Mar 5 2026
Is this cleanup process something we should implement as part of the pipeline?
Feb 25 2026
Feb 24 2026
@Ottomata I'm not sure if it will work when using a Hive adapter, but it should work through a Spark adapter since it works from Jupyter's wmf.spark.run.
@Ottomata yeah I just ran that command above, via wmf.spark.run in my Jupyter notebook. The trick with the structs is that you have to provide a whole new struct and not just the new fields of the struct.
Feb 23 2026
A somewhat better way to manually fix this is to determine the difference in Hive vs Spark schemas, and apply ALTER TABLE ... ALTER COLUMN in Spark SQL to reflect what is in Hive metastore.
Feb 20 2026
dbt doesn't allow to have multiple files with the same name, even if they live in different folders
What happens there is a clash? Would dbt run fail? Can we detect this before orchestration with a linter/CI check or so?
Some thoughts on model naming guidelines
Great writeup @JMonton-WMF ! A monorepo shared by all teams definitely sounds like the way to go about this. I’ll try to offer some concrete suggestions on how to organize the monorepo, and address some questions posted above.
Feb 12 2026
Thanks for working through these details, @JMonton-WMF and @Mayakp.wiki !
@amastilovic - I think that you can already install lots of different Python apps and packages, can't you? It's just that conda-analytics is currently the framework that we install at the operating system level to give people this functionality of creating virtual environments and easily switching between them.
I'd be in favor of the second option, installing Poetry (or pipx for that matter) on the Stat hosts. This would enable Stat machine users to safely install many different Python apps/packages, not just sqlfluff.
Mostly for the reasons of expediency - inheriting from SimpleSkeinOperator was a much quicker and well-tested route. Also, SimpleSkeinOperator itself is using SkeinHook and its builder. I might be missing something but from what I could see I would basically end up duplicating that same code in a DbtOperator.
Feb 6 2026
On DPE side I believe we need to cover the following items in order to support this new instance:
Weekly update from the Data Engineering team:
Jan 23 2026
The issue has now been fixed: https://airflow-platform-eng.wikimedia.org/dags/aggregate_for_fundraising_hourly/grid
Jan 22 2026
Jan 21 2026
@Ottomata I'm looking into that, thanks for the suggestion!
The way we run Spark via dbt is through the dbt-spark adapter, which supports 4 different ways of interacting with Spark: ODBC, Thrift, HTTP and session. We are using the session method which effectively spins up a pyspark session to run the SQL commands. I guess in this way it's similar to our Jupyter notebooks.
Jan 20 2026
The dbt+skein test as outlined in this ticket has been performed successfully: https://airflow-test-k8s.wikimedia.org/dags/test_dbt_skein_dag/
Jan 16 2026
Jan 14 2026
This has been kindly completed by @brouberol in the above change, so I'm closing this ticket.
Dec 2 2025
Dec 1 2025
Nov 25 2025
Ditto what @xcollazo said above. In order to have the desired behavior for this pipeline job, I think you need:
Nov 24 2025
@Eevans thank you for that MR! You are correct, wiki_id should be TEXT - we've already implemented it in the Hive counterpart for that table: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1206879
Nov 18 2025
Nov 14 2025
OK so I've now officially backfilled the wmf_contributors. and wmf_readership. tables, but the process I had to use in order for the number of files to be small enough is complicated enough that it warrants being documented somewhere:
Nov 10 2025
Nov 7 2025
Oct 31 2025
We definitely already have the maven-checkstyle-plugin set up in the main pom.xml - I know because it's very annoying since the codebase doesn't seem to conform to the style being checked, and on each compile it produces a ton of ERRORs in output.
Oct 30 2025
That specific use-case sounds like what dbt calls a microbatch incremental strategy that replaces time intervals given the event_time column: https://docs.getdbt.com/docs/build/incremental-microbatch
@JMonton-WMF I think we could use this task to include a .dbtignore file that will let dbt commands ignore the .ipynb_checkpoint folders: https://docs.getdbt.com/reference/dbtignore
insert_overwrite is what @JAllemandou is describing, perfect.
Oct 29 2025
Do we want only Druid realtime configs its own repo? Perhaps we want the batch ones in the same place?
My uninformed thought on this is that this should be a "Druid config stuff" repo, which would therefore include both realtime AND batch configs :)
Oct 28 2025
user_id, user_central_id and page_id fields are now available in both the Hive dataset wmf.mediawiki_history_reduced as well as in the corresponding Druid dataset.
