User Details
- User Since
- Oct 8 2014, 5:48 PM (492 w, 5 d)
- Availability
- Available
- IRC Nick
- Milimetric
- LDAP User
- Milimetric
- MediaWiki User
- Milimetric (WMF) [ Global Accounts ]
Sun, Mar 17
Feb 5 2024
Jan 9 2024
Jan 8 2024
@VirginiaPoundstone this issue came up again (thanks very much to @xcollazo who remembered this task). I support option b) in Xabriel's plan above, and I think this should be triaged with high importance as a production issue. This table is used by lots of people and it seems to me it'll keep failing. If the folks looking into it don't remember this, it's a lot of time wasted.
Quick mention of this other task where some of the work took place: T353296. Relevant to this, the gerrit change https://gerrit.wikimedia.org/r/c/analytics/refinery/+/982899 included updates to the following pipelines/datasets:
Jan 4 2024
TL;DR; the data pipeline up to AQS seems fine, my guess is we're not filtering properly to exclude redirects in AQS 2, timeline corresponds with the reported problem. Sorry for the inconvenience, working on a fix.
@Mayakp.wiki the patch to watch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352/. This has not yet been merged and deployed. When it is, you'll start seeing the changes in x_analytics.
Datahub allows you to add descriptions at sub-field level. We should at some point get to consensus about where we want all this description stuff to live. We talked about:
Dec 22 2023
Dec 12 2023
Quick recap for anyone looking to implement lineage. First, a note regarding lineage as part of centralized configuration. I think this would be very useful, and I'm in no way suggesting that we slow down on the work that @JAllemandou and @lbowmaker are leading on that front. The reality is that a centralized config may take a few more months to get implemented. In the meantime, we could instrument lineage in the airflow DAGs in a few minutes per DAG. Done in a standard way, this would be very easy to migrate to centralized config. In addition, as we implement this we may find exceptions and edge cases that would inform the centralized config. If anyone disagrees with anything here, you are very welcome, please don't take this as a "decision". Just a thought. If we agree with this and there's some slow-down to migrate back to the centralized config, I hereby promise that I'll do it myself on all DAGs.
The following is a quick rundown of what I would think about if something goes wrong, and how I would check.
Dec 11 2023
A full list of current use cases could only be compiled by reaching out to researchers who download this dataset. Limited to what we know, current use cases are roughly:
MediaWiki History is described in detail in the following places:
The algorithm is explained at length starting here.
A shortened and updated list of Changes and Known Problems.
MediaWiki History is described in detail in the following places:
wmf_raw.mediawiki_pagelinks and wmf_raw.mediawiki_page_props is available with snapshot 2023-11
Dec 8 2023
I agree, @stjn, hopefully that's not as hyper-urgent and maybe @VirginiaPoundstone + @lbowmaker can triage.
Dec 7 2023
I'm really sorry this didn't get through the pipeline sooner, someone only told me about the issue last week. Had I known sooner I would have made the fix sooner. We are going to bring this up in our retro.
Dec 6 2023
The above patches do what I suggested in a comment on the talk page: https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/Hiding_the_number_of_Russian/Belorussian/Kazakh_contributors_on_the_statistics_map which is to gray out the countries currently on the protection list and explain that the data is hidden. If and when the country list chagnes, we should update this or make it more reactive to the data itself.
Sqooping from the production replicas would mean applying the same sanitization rules on our side. I see the filter here is:
This is the varnish code (VCL) that does analytics-y things to create and update the X-analytics header. Adding stuff here would prevent us from having to change varnishkafka. Or maybe I misunderstood the whole thing, which is always possible in Varnish land :)
Dec 5 2023
This sounds like it would work... but I do want to point out a potential maintenance issue:
Nov 30 2023
Nov 29 2023
Nov 28 2023
I would like to emphatically support Timo in T169027#9362252 here. And just to re-state what I think is the most critical part of the argument:
merged and deployed right now, used to fix another instance of the webrequest duplicate map key failures. Note for future selves: it would be good to figure out where these are coming from still.
Nov 27 2023
Besides the great discussion above, I just want to point out some related things.
Nov 20 2023
@SGupta-WMF may I please have permissions to the doc too? Will asked me to review
Nov 16 2023
+1 for leaving writing to Hive tables alone (and erring towards correctness and jobs failing and hopefully comments that we can find)
+1 to instead focusing on the Iceberg migration
My apologies for the late review, +1 to Scott's point of resolving this and making it public.
Nov 14 2023
Nov 13 2023
Nov 9 2023
Since the dumps for enwiki and ukwikinews are both complete now, I looked at the snapshot hosts 101[0123]. I see that the code that seems to be failing in the stack trace has been updated to -wmf.4 (the stack traces are from -wmf.2 and -wmf.3 respectively). So this seems like it was fixed by someone else, deployed, and the snapshot hosts resumed their work.
Full output from email:
indeed, there are quite some differences in the different pipelines. When the Wikidata folks look at this, do ping us as we have been working on a new dumps process and migrating other dumps to our Airflow scheduler. cc @VirginiaPoundstone
When you all would like to start this work, let's talk. We would love to move this kind of dump to an Airflow pipeline for ease of maintenance.
Try and combine that into one.
Nov 7 2023
Nov 6 2023
Nov 4 2023
This has been used over the last few days to generate trees and it seems to be working well so far. We have some sample data and can use the logic to output a new set once Fiona and Virginia decide on it. Code is at https://phabricator.wikimedia.org/P53125
Nov 3 2023
Nov 2 2023
Oct 31 2023
New logic includes vertexType and writes to milimetric.sample_category_graph (it's writing right now). See updated spark for coordinating the rest of the work:
Oct 23 2023
Oct 20 2023
This is ready for deploy
Oct 18 2023
Oct 17 2023
Thomas deployed (did a great job!). I checked the table and it looks good, this is ready for sign off.
Oct 16 2023
@JEbe-WMF - I'm sorry I had this comment but forgot to Submit! Your plan looks good to me, thank you for putting it together.
Oct 12 2023
Just to wrap this task up, the code that's merged now uses the rc1 schema. This was mostly done by Antoine. Any remaining work on XML publishing has been broken up in separate tasks, all of which are part of epic T347994. This task can be considered done.