User Details
- User Since
- Oct 8 2014, 5:48 PM (600 w, 2 d)
- Availability
- Available
- IRC Nick
- Milimetric
- LDAP User
- Milimetric
- MediaWiki User
- Milimetric (WMF) [ Global Accounts ]
Thu, Mar 19
Wed, Mar 18
As I understand from a recent email this covers the following events:
Mar 11 2026
Mar 10 2026
great, will update to this approach, thanks @Krinkle!
Mar 4 2026
Typing out loud the rationalization for the instrumentation here:
Mar 3 2026
This has been s[[ https://test-kitchen.wikimedia.org/instrument/bot-detection-2026-03 | cheduled, turned on, and will start collecting data March 5th ]]. The related code is merged, deployed, and should roll out with the train as can be seen here:
link drop for later research:
Mar 2 2026
Feb 26 2026
Feb 25 2026
approved! Welcome to data!
@Ottomata EDIT: sorry, right, we sqoop it from cloud replicas: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/f12678b3a11c8046165925b8cb52a82bc5b98350/modules/profile/templates/analytics/refinery/job/refinery-sqoop-mediawiki-not-history.sh.erb#13
approved! Welcome to moar data
Feb 23 2026
technically speaking, MW History is as accurate as our monthly sqoop of the MariaDB replicas. I'm not so sure it's better than what MW Content History is doing:
Feb 19 2026
we found another trickier issue - rerunning some of these jobs fetched new config but caused some mismatch between Airflow mapped tasks and stream config, which caused some silently failing jobs. Rerunning these manually for hours:
Yes, this would constitute a big privacy risk. Copying user properties without being specific and careful about which ones is similar to saving search requests. In some cases that will be ok, in others people accidentally copy/paste their most private information into search boxes.
I think I'm technically an approver for this so maybe Ben can approve me. And maybe all analytics-admins should have admin in all airflow admin groups, in case we're needed in an emergency?
Feb 13 2026
Feb 11 2026
@GGoncalves-WMF and @Ahoelzl this means we can stop this dump, getting one more job off of SRE's plate.
(we are looking into changing the Herald rule that auto-tags us, so tag on purpose if you need DE)
(we're not actively working on this, but may circle back about json format dumps)
ping @Ahoelzl to prioritize
@MNeisler let us know if you need specific help, just watching this on the radar for now.
This is not likely to get resourced outside of a special event like a hackathon or onboarding a new team member. Maybe the Technical wishlist or some other way to advocate for this is a better path.
thanks SRE!
Feb 10 2026
Data was gathered. I will provide a quick analysis here and leave breadcrumbs for others to engage. However, just this quick analysis answers our questions around instrumentation and unblocks both bot detection and attribution research work (details on how follow).
Dan has access - assigning to Thomas
Feb 9 2026
k, moving back for consideration with new information.
Feb 5 2026
we're working on this as part of normal incident response (it's not an incident - just data anomaly that is likely due to bot behavior, but we're treating these through the same process so they have proper visibility)
Yeah, we used to have this dataset, it was called webrequest_sampled_128 and I think this was the table create statement for it (the population code is either lost somewhere around here or in git history): https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/hql/webrequest/create_webrequest_sampled_raw_table.hql
@xcollazo Is this obsolete?
Andreas reviewing in weekly sync
@Ahoelzl will follow up
@Ahoelzl will touch base with Andrew + Guilherme on this - I personally think it's important to de-duplicate datasets and I'd choose the DP ones as they feel net-better to me. But this feels lower priority than bot detection work, to me.
Andrew - what's the timeline on this for DE? Do we just track it in radar and advise or does it need active development from us?
Thank you so much for pinging. This is suspicious indeed, and especially interesting is @jhsoby-WMNO's finding about non-latin wikis.
Feb 4 2026
To send a more complete URL we'd have to checksum it to keep it within a certain string length limit. It seems maybe cyrb53 is the right algorithm?
new instrument activated as: https://test-kitchen.wikimedia.org/instrument/bot-detection-2026-02
Feb 2 2026
Jan 28 2026
no immediate need for this - please reopen otherwise (ideal would be to reduce friction on major version bumps for schemas)
we're shifting to the new Dumps 2.0 system
DE doesn't plan to work on this, but we'll discuss in our sync-up meeting
looks good, please re-add us if you need support
Jan 22 2026
@Sfaci are you going to make the changes?
as the backfill is not strictly needed now we will keep it in mind
watching, please add us if you need review / brainbouncing
watching, please add us if you need review / brainbouncing
need DE here? Please re-add if so
Jan 19 2026
approved
Jan 12 2026
Jan 9 2026
Dec 4 2025
After reading the thread that causes this ticket, I have seen that some approaches were discussed to work on something to have a way to debug for logged-out users. As far as I understand, that thing wasn't done. For example, there is no a new way that allows developers to debug when logged-out but something about that was mentioned. I haven't found any ticket about it. Should we file a new one about that to work on it later? I would like to have this clear, just in case I'm missing something, before announcing the related work that was done. Maybe @phuedx @Milimetric you can provide clarity
