User Details
- User Since
- Jan 20 2024, 12:05 AM (98 w, 1 d)
- Availability
- Available
- IRC Nick
- amastilovic
- LDAP User
- Aleksandar Mastilovic
- MediaWiki User
- AMastilovic-WMF [ Global Accounts ]
Tue, Dec 2
Mon, Dec 1
Tue, Nov 25
Ditto what @xcollazo said above. In order to have the desired behavior for this pipeline job, I think you need:
Mon, Nov 24
@Eevans thank you for that MR! You are correct, wiki_id should be TEXT - we've already implemented it in the Hive counterpart for that table: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1206879
Tue, Nov 18
Fri, Nov 14
OK so I've now officially backfilled the wmf_contributors. and wmf_readership. tables, but the process I had to use in order for the number of files to be small enough is complicated enough that it warrants being documented somewhere:
Mon, Nov 10
Fri, Nov 7
Oct 31 2025
We definitely already have the maven-checkstyle-plugin set up in the main pom.xml - I know because it's very annoying since the codebase doesn't seem to conform to the style being checked, and on each compile it produces a ton of ERRORs in output.
Oct 30 2025
That specific use-case sounds like what dbt calls a microbatch incremental strategy that replaces time intervals given the event_time column: https://docs.getdbt.com/docs/build/incremental-microbatch
@JMonton-WMF I think we could use this task to include a .dbtignore file that will let dbt commands ignore the .ipynb_checkpoint folders: https://docs.getdbt.com/reference/dbtignore
insert_overwrite is what @JAllemandou is describing, perfect.
Oct 29 2025
Do we want only Druid realtime configs its own repo? Perhaps we want the batch ones in the same place?
My uninformed thought on this is that this should be a "Druid config stuff" repo, which would therefore include both realtime AND batch configs :)
Oct 28 2025
user_id, user_central_id and page_id fields are now available in both the Hive dataset wmf.mediawiki_history_reduced as well as in the corresponding Druid dataset.
Oct 22 2025
@Ottomata so it sounds like we are ready to accept the mediawiki_history_reduced dataset as it is right now, but with user_central_id and page_id columns added? If so, I'll start backfilling September 2025.
Oct 21 2025
WIP Typo! I do that when i'm creating test tables so I can increment as I make changes. Final one will not have 0, but maybe _v1 if we want to version it!
@BTullis wouldn't this approach introduce a discrepancy between what users use on stat boxes and what is run in GitLab CI/CD and eventually in Airflow? The latter two will run in Docker images, and I wonder how different the two installations will end up being.
Oct 17 2025
Oct 14 2025
Oh TIL that we can use kokkuri as a GitLab component! Thumbs up.
Oct 9 2025
Oct 8 2025
Oct 7 2025
Sep 26 2025
Sep 25 2025
This all feels very achievable, but I wonder if we might be making things difficult for ourselves by trying to define one operator that can do it all, like a single Swiss Army knife.
Sep 23 2025
This might be quite onerous on ops week duty and/or folks just trying to upgrade or deploy their job.
We have that manual forced cache warmup for precisely this scenario by the way.
Would data sizes be any concern? One of our use cases is a weekly transfer of a few hundred gb spread across ~10k files in a nested directory structure.
Blunderbuss could easily do this for you, with minimal resource usage on the Airflow executor side :-)
Sep 18 2025
The 2025-08 backfill run of the DAG has completed successfully, and judging by data sizes on HDFS I'd say it falls in line with what we've seen in the previous months. @GFontenelle_WMF if you have some basic validation checks to run on this data, now would be a good time. Thank you!
Sep 17 2025
The new SQL query is:
Sep 16 2025
Sep 4 2025
Aug 29 2025
As far as I can see, all these checkers already use DeequColumnAnalysis so the basic plumbing seems to already be there. I'll have to investigate a bit deeper to see if and how it's being used right now, but hopefully the scope creep won't be too big. I'll keep you updated here.
Aug 28 2025
It seems to me that this growth is organic and should be taken into account when doing the error checks, but the MediawikiHistoryChecker unfortunately doesn't provide for fine-tuning the error boundaries. The way it works right now is, it accepts the minimum and maximum boundaries for any kind of growth. These same two boundaries are then used for user growth, page growth, denormalized history growth and reduced history growth checks.
Aug 27 2025
Aug 26 2025
Aug 19 2025
Use the following example of how these values should be saved in the deployment-charts repository: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/dse-k8s-services/_airflow_common_/values-analytics-production.yaml
Aug 18 2025
Aug 15 2025
Aug 13 2025
Aug 12 2025
Another vote for JRE version, too.
Aug 11 2025
Aug 6 2025
Aug 5 2025
Aug 4 2025
Thank you, @BTullis !!
Jul 30 2025
OK so I've done some investigation work and here's what I think needs to be done:
Agreed, I think this is WAD.
Sounds good, closing the ticket.
Jul 29 2025
Jul 28 2025
I can't explain why it didn't hit the permission errors when I ran it manually.
