User Details
- User Since
- Oct 1 2025, 2:42 PM (19 w, 2 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- JMonton-WMF [ Global Accounts ]
Today
If installing these tools is a concern, and we don't need to provide sqlfluff to many people, maybe we can just add an explanation of how to install it in a new virtual environment, or as part of a Makefile, so anyone that needs it, can install it only for them.
Yesterday
For most of the rules, we can take the ones used in Airflow, but there are some rules that @Mayakp.wiki would like to change a bit.
Wed, Feb 11
Mon, Feb 9
Let me make sure I understand: when we say "subfolders by projects", these are not distinct dbt projects right? The repository itself is the dbt project, and the subfolders are units of organization that the team defines?
That's right, there is a single dbt project, folders are just ways of organizing models.
Fri, Feb 6
There are many different ways of configuring a dbt project to work with multiple teams and projects, here is a list of some features that can allow different configurations, and a proposal.
Mon, Feb 2
Fri, Jan 30
Weekly update from the Data Engineering team:
Tue, Jan 27
@Ottomata I've updated the MR to the latest structure inside the Python folder. I'm assuming the version .dev0 would be set in the helm files, so I'll start working on it.
Maybe the MR could be merged after a review, and we can iterate over it with the new schema after it's decided in T415158
Mon, Jan 26
Fri, Jan 23
Weekly update from the Data Engineering team:
Tue, Jan 20
Just a couple of comments:
Fri, Jan 16
Weekly update from the Data Engineering team:
Jan 8 2026
The topics have now 3 partitions. We are not changing the number of tasks on Flink for now, but this change will allow us to parallelize better when needed.
Many thanks @MoritzMuehlenhoff !
Dec 19 2025
Weekly update from the Data Engineering team:
Dec 18 2025
Dec 17 2025
Dec 12 2025
Weekly update from the Data Engineering team:
Thanks for the info @xcollazo. This is actually a good example for this ticket, even if we increase to 20 TaskManagers temporarily, the throughput is limited by the topic having only 1 partition. Having 3 partitions should help in similar scenarios.
Dec 11 2025
After some conversations, we think there are different paths related to this:
I think this might be a bit more complex that it seems. I've been looking at the import events in Mediawiki, and it seems that there are 2 different fields:
- EventType: Which can be PageCreated, PageDeleted, PageRevisionUpdated, etc.
- Cause: Which can be edit, move, delete, import, rollback, etc.
Dec 10 2025
Dec 9 2025
Good point @GGoncalves-WMF, I got lost after many Slack messages. I'll keep Java and Maven as "decided", and we can discuss more about the use cases.
After a conversation in Slack, it looks like the majority of the people agree with at least a few things: We'll do the migration, and we'll use Java and Maven. I'll create some subtasks describing the work needed.
Dec 5 2025
Weekly update from the Data Engineering team:
Completely agree with that @elukey, thanks for the help!
Dec 4 2025
That's interesting.
In general, I wouldn't reduce the number of partitions of a topic to slow down a pipeline, I assume we could slow down the pipeline by not increasing the parallelism of the Flink application, or adding some kind of back off system in the process. But I understand the concern. So, at least it seems that it doesn't make sense to increase the number of workers of the PyFlink application, and if we don't need that, one of the reasons to increase partitions is also not needed.
Hi @elukey!
We don't need this often to be honest, maybe it's more about being able to help the DP SRE team with small tasks rather than giving them more work, but I totally understand the concerns.
In case this is approved, I created the patch I believe is needed, to help with the process.
We had a discussion on Slack, we'll start with 3 partitions, it should help balance the storage across the brokers, speed up the process, and we don't need many partitions because this process runs only once a month.
Dec 3 2025
Another option: I worked in the past with https://github.com/devshawn/kafka-gitops to manage all topic settings from the CI/CD and it worked pretty well. The repo seems to be a bit old now, but it does the job.
A question we haven't solved yet is: How many partitions should we use?
Nov 28 2025
Weekly update from the Data Engineering team:
Nov 26 2025
Nov 25 2025
Thanks both! I've added the changes. Now it requires a manual execution and it can be bypassed by adding the CI_BREAK_GLASS_REASON.
Nov 24 2025
In aims to move this check to the CI, but keeping it standard from the jsonschema-tools repository, I'd like to go for this proposal:
Nov 21 2025
Some days ago I created a new CI check in a MR to check for deletions, but @Ottomata suggested that it could be moved to the jsonschema-tools repository.
The issue with this MR is that it should be created on every repository using json-schemas.
Weekly update from the Data Engineering team:
Hi all,
I have some room to work on Event Platform tasks and I could take the work done on this one and try to push it to the finish line.
As the ticket is a bit old, I'd like to confirm if this is still needed.
Nov 17 2025
That sounds good. Then, we could consider increasing the partitions in Jumbo too, codfw.mediawiki.page_content_change.v1 and eqiad.mediawiki.page_content_change.v1 both have 3 partitions right now. I'm not sure if any consumer In Jumbo relies on the message ordering that could be affected by the change in partitions, I'm guessing not.
Nov 14 2025
@Ottomata, what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?
Hive tables event.mediawiki_wikistories_consumption_event, event.mediawiki_wikistories_contribution_event and the associated data in HDFS are removed.
