quick update: resolved with Eric to work on this as a separate component. Will start on a patch now, keeping it in the Codex sandbox for now with T363432 as the goal.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Thu, May 16
Wed, May 15
Thu, May 9
Wed, May 1
ok, I didn't do much here, just provided a very short description and detailed out the schemas as Marcel had them in the design doc. Please let me know if anyone was imagining something else.
filling out the readme right now, thanks Ben!
I am not sure this is 100% squashed because the behavior is so weird. Here's what I found, in short:
Apr 18 2024
It would be cool to do a quick spike into Scalar and the customization we'd need there. Abstain as a voter here, I like all the options just fine and I have bad aesthetics when it comes to reading docs because I just start hacking and see what happens :)
+1 for Option 2. For what it's worth, when we initially put up the endpoint docs on wikitech we were just doing so while we waited for a better end user experience than the swagger UI afforded us. I especially like the integration with wikitech described in option 2 (the discovery pages that would lead wiki users to the docs)
Apr 16 2024
+1, SSR is kind of a pain if done in fancier ways, but in this way you get a lot for free and it helps even reduce code. As a bonus the user gets a great experience.
Apr 15 2024
I've broken this down into subtasks but I'm keeping it as something between an epic and an actual task. It's coordinating and has all the acceptance criteria, it was just too big. So I'll leave the other two subtasks on the boards while I'm on vacation and put this in paused. This can be resumed whenever you'd like to continue work on coordinating and deployment.
Apr 11 2024
Approved, welcome back Andy :)
Approved
Apr 4 2024
Apr 3 2024
Apr 1 2024
Mar 29 2024
I found a candidate bug. The script used to ask for the year and month, and after the change it asks for the day. generate_druid_unique_devices_per_domain_daily_aggregated_monthly.hql seems to have been adapted to give the correct result, but evidence to the contrary, the druid output seems to be only one day. Running now to prove or disprove.
Merge request 582 seems to have changed how we do this monthly druid segment aggregation, so the answer must be around here. Again I checked the new source table, now Iceberg (wmf_readership.unique_devices_per_project_family_daily) and again that seems to have data for all of January, for example.
Looked at this a bit today.
Mar 25 2024
@VirginiaPoundstone: Looks like Giuseppe patched varnish to send more requestctls, so maybe that completely or partially solves the problem. I'd have to look through the data to see. I'm going to do a good job focusing and only do that if you put it in the sprint :) (should take no more than an hour, but it's probably not like a few seconds if I want to be more thorough)
Mar 22 2024
I made the puppet change but I need an SRE to merge. This is not well documented indeed, we should talk about a better way to maintain this interface that so many people use.
Approved!
Mar 20 2024
deleted from meta
I believe this dataset that's already being published is strictly better and in my opinion should replace the current active editors by country data: https://analytics.wikimedia.org/published/datasets/geoeditors_weekly/ (also the monthly version)
Ah, thanks @will for finding T358793: Decommission AQS 1.0, @brouberol and others can go ahead and take AQS 1 offline and follow through with decommissioning. Take note of what Eric said there, the servers themselves are still useful, just AQS 1 is going away.
I'm working to find the relevant tickets, but AQS 1 should be sunset and I think it's ok to take it offline for now and follow through with the rest of the process. I've just been absent for a couple months and might be missing some nuance.
Mar 17 2024
Feb 5 2024
Jan 9 2024
Jan 8 2024
@VirginiaPoundstone this issue came up again (thanks very much to @xcollazo who remembered this task). I support option b) in Xabriel's plan above, and I think this should be triaged with high importance as a production issue. This table is used by lots of people and it seems to me it'll keep failing. If the folks looking into it don't remember this, it's a lot of time wasted.
Quick mention of this other task where some of the work took place: T353296. Relevant to this, the gerrit change https://gerrit.wikimedia.org/r/c/analytics/refinery/+/982899 included updates to the following pipelines/datasets:
Jan 4 2024
TL;DR; the data pipeline up to AQS seems fine, my guess is we're not filtering properly to exclude redirects in AQS 2, timeline corresponds with the reported problem. Sorry for the inconvenience, working on a fix.
@Mayakp.wiki the patch to watch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352/. This has not yet been merged and deployed. When it is, you'll start seeing the changes in x_analytics.
Datahub allows you to add descriptions at sub-field level. We should at some point get to consensus about where we want all this description stuff to live. We talked about:
Dec 22 2023
Dec 12 2023
Quick recap for anyone looking to implement lineage. First, a note regarding lineage as part of centralized configuration. I think this would be very useful, and I'm in no way suggesting that we slow down on the work that @JAllemandou and @lbowmaker are leading on that front. The reality is that a centralized config may take a few more months to get implemented. In the meantime, we could instrument lineage in the airflow DAGs in a few minutes per DAG. Done in a standard way, this would be very easy to migrate to centralized config. In addition, as we implement this we may find exceptions and edge cases that would inform the centralized config. If anyone disagrees with anything here, you are very welcome, please don't take this as a "decision". Just a thought. If we agree with this and there's some slow-down to migrate back to the centralized config, I hereby promise that I'll do it myself on all DAGs.
In T351117#9379025, @Fabfur wrote:Hi @Milimetric sorry for the late reply, I'll try to answer to your question but consider we're still investigating about all pro and cons of this "migration", and for sure we'll share our thought and our action plan before moving on with this...
The following is a quick rundown of what I would think about if something goes wrong, and how I would check.
Dec 11 2023
A full list of current use cases could only be compiled by reaching out to researchers who download this dataset. Limited to what we know, current use cases are roughly:
MediaWiki History is described in detail in the following places:
The algorithm is explained at length starting here.
A shortened and updated list of Changes and Known Problems.
MediaWiki History is described in detail in the following places:
wmf_raw.mediawiki_pagelinks and wmf_raw.mediawiki_page_props is available with snapshot 2023-11
Dec 8 2023
I agree, @stjn, hopefully that's not as hyper-urgent and maybe @VirginiaPoundstone + @lbowmaker can triage.
Dec 7 2023
I'm really sorry this didn't get through the pipeline sooner, someone only told me about the issue last week. Had I known sooner I would have made the fix sooner. We are going to bring this up in our retro.
In T333716#9389355, @stjn wrote:@Milimetric: this is great, but I think it should be also indicated under the map that some countries do not have any results, so people can see this easier. For example, page view stats have this in the bottom: Those countries with less than 100 views are not reported and are blank in the map. Seems like the absence of data for privacy reasons is good to report there as well. Can you also add that?
Dec 6 2023
The above patches do what I suggested in a comment on the talk page: https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/Hiding_the_number_of_Russian/Belorussian/Kazakh_contributors_on_the_statistics_map which is to gray out the countries currently on the protection list and explain that the data is hidden. If and when the country list chagnes, we should update this or make it more reactive to the data itself.
Sqooping from the production replicas would mean applying the same sanitization rules on our side. I see the filter here is:
This is the varnish code (VCL) that does analytics-y things to create and update the X-analytics header. Adding stuff here would prevent us from having to change varnishkafka. Or maybe I misunderstood the whole thing, which is always possible in Varnish land :)
Dec 5 2023
This sounds like it would work... but I do want to point out a potential maintenance issue: