Fri, Sep 22
I burned through my budgeted hour (it was really two hours) on this and I added a bunch of links, details, cleanup, and I only got through like 15% of the sheet. I think my estimate of 12 hours is closer to what it would take to get through it, but marking done as per spike.
Also, while we're talking canaries, if it's just as easy, enabling them for all EventBus - sourced streams is a good idea. Otherwise we have the problem Xabriel explains above. Here's an example job using the page move events.
Thu, Sep 21
Ok, so the action here would be to label the data better, and add an annotation for Phase 5 and any other big changes.
AQS 1.0 is sending the required headers now, etag is enabled on all endpoints (not just knowledge gaps). Hugh, please verify and let us know if anything else needs to happen before we can route to the knowledge gaps endpoint. Thank you!
- Do we want/need a public-facing API? @Ladsgroup's use-case doesn't require one, is there demand for this elsewhere?
I marked a couple of these as bad just to see what that process was like, see T346969
Wed, Sep 20
mwscript maintenance/findBadBlobs.php --wiki hrwiki --revisions 1705637
mwscript maintenance/findBadBlobs.php --wiki azwiki --revisions 413206,413238,413328
Also related, T342267: Investigate surprising "10% Other" portion of Analytics Browsers report which really needs some love as well.
I vaguely remember this thing in 2018... Windows did get grouped up, but I agree with the DJ's points and that this data makes no sense without at least some kind of annotation.
Tue, Sep 19
to keep the archives happy, I talked to Fabian on Monday and answered this question - yes, all-projects means all wikis. For aggregates at the project family level, for example "all wikipedias", we use all-wikipedia-projects (see wikistats example)
Ok, I hear Ben's concerns but kind of decided to risk updating everything at once (because it's easier to roll back now than when we move to AQS 2.0). When reviewed, deployed, and validated, I will move this to blocked until Hugh can review.
Ok, to resolve I'm going to erase this dvd.html file from all the dumpsdata hosts as per docs:
Moving is fine, let's not make a new RC until we have a new schema
Hmmm... no my prompt for that would be something more like "in the theme of Tron crossed with Lawnmower man but replacing any sadness or darkness with joy and dance"
Mon, Sep 18
@hnowlan: TL;DR; do you see the cache-control that AQS is already setting and do we need an ETag header or is that just a nice to have?
This is now done
(sorry this slipped through)
Fri, Sep 15
I know they're just computers and they don't have feelings and stuff, but something about this makes me so happy, just picturing free RAM and CPU resources frolicking in the YARN clouds...
- we need to find a way to tag the prefetch proxy traffic. Ideally in webrequest and other derived pageview tables for easy analysis.
- this is possible using the 'Sec-Purpose: prefetch; anonymous-client-ip' request header.
- note that we would not want to change any of our existing dimensions (like agent_type) to indicate prefetch pageviews since this will break our reporting and has consequences on Superset dashboards. Instead find a way to store this in any existing field or create new field
When this gets prioritized, we can make it into a proper epic and break it down, but if anybody else wants to take any piece of it, please don't let this stop you.
Wed, Sep 13
It makes sense, @mforns, it's just a little strange, I would've expected the minor versions to be in the first 200 characters of the strings and indeed we saw a drop in the UA string entropy, so it's strange that this doesn't affect everything else. But I believe you and I'll just file that away with other life curiosities :)
Tue, Sep 12
quick recap of cleanup:
Mon, Sep 11
running a manual version on a screen, as the dumpsgen user: 14381.pts-0.snapshot1009
@Ladsgroup: would you prefer that to the settings change? I'm happy to delete, but as I understood the maintenance delete scripts won't work without a content handler. So I guess I could update all the content models to json and then delete?
Fri, Sep 8
I spoke to Antoine, and it turns out this was not really the biggest issue, some spark tuning shrugged off the problem. There are lots of other super interesting details in the XML publishing machinery that's built as part of T335862: Implement job to generate Dump XML files. See code here: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/
The above gives us 524,283,851 pages across all projects and namespaces to play with.
Thu, Sep 7
Approved! I think maybe you also need analytics-admins as per data access docs
Wed, Sep 6
Tue, Sep 5
Fri, Sep 1
First, some setup.
Thu, Aug 31
Wed, Aug 30
Parking some links that will be useful to this work:
Tue, Aug 29
Mon, Aug 28
Aug 22 2023
Aug 17 2023
Aug 8 2023
Aug 7 2023
Just a random drive-by note, since I'm not the one playing with this, but it might be interesting to instrument EventBus a little bit. For example, from the deferred job that publishes to Kafka, we could log a basic key for each event that we publish. It should be possible to aggregate these logs and compare them against what we see in Kafka to figure out what we missed, perhaps even facilitate retries.
Aug 5 2023
Thank you for filing this @VeniVidiVicipedia! We're going through a reorg so things are in a bit of a messy state right now. Bear with us as we triage
Aug 1 2023
My ramblings that got me to the null edits, might be useful for someone else testing that there is no other source of unexpected drift: