Sorry, I responded to this but on slack. (wrong place.) Here's what I said: I would say https://phabricator.wikimedia.org/T422522 (technical build) because it's not really about correctness (2.5.6), but more about determining how performant our replacement for the Blazegraph label service handling is.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Mon, Apr 20
Fri, Apr 17
Friday update: the WDP engineers are meeting with Data Platform contacts on Tuesday to discuss the project and what they recommend. We've prepared a short doc summarizing the current state that we will share with the rest of the meeting attendees on Monday. Updates from that meeting will be available next week.
Thu, Apr 16
Wed, Apr 15
Tue, Apr 14
Mon, Apr 13
Update: the new repository exists at wikidata-platform/rdf-streaming-consumer. I merely cloned the original RDF repo and am working on an MR to delete everything but the consumer. It will depend on an artifact created by the new wdqs-common repo.
Fri, Apr 10
Thu, Apr 9
This is all done. I added linting in a pipeline for MRs in the Analytics repo. I also added a license and readme. Turns out the "main" branch of the repo was already protected, too. So, all done with this cleanup.
Wed, Apr 8
Thu, Apr 2
I've added the apache 2.0 license and a very small README. I'll work on the other items next week.
Update: I've added some documentation to help others solve this problem in the future.
Thank you for reporting those @Mbch331 ; I ran the same script on those entities and they should be un-queryable now (or soon, due to eventual consistency).
Wed, Apr 1
Hi there! First, sorry for the slow response on this. I've just run a script to trigger RDF reconciliation for all the entities named here, and they should be deleted from WDQS now. If you see that your deleted entry is still queryable, please follow up and let me know.
Tue, Mar 31
Mon, Mar 30
The DAG to materialize the top user-agents (>1000 qpd) is running. It cleans up data >90 days as well, and I also set up a monthly maintenance DAG to delete old iceberg snapshots. This is in accordance with the standard practices around this type of data, according to a consultation I had with the data engineering advisors.
Fri, Mar 27
Thu, Mar 26
- The table exists: wikidata.wdqs_external_queries_by_user_agent_daily aggregates the queries by latency time class for each user agent per day, for user-agents with >1000 queries a day. We also store total query counts and the number of successful queries.
- The DAG to populate this table has been deployed.
- Data cleanup after 90 days at the row level happens via the data-populating DAG, too.
- Another DAG to do Iceberg snapshot cleanup runs separately; this MR is under code review.
Mar 20 2026
Project update:
That sounds like the perfect solution. Thank you!
Hi! This was all a mixup on my part; I didn't realize that I needed to submit an access ticket to add myself to my team's analytics user. This ticket is still open and tracks a bigger project, which I needed the analytics access for. Sorry about the mixup. We still want this task open to track my additional work.
Mar 9 2026
Mar 5 2026
Hi again,
Mar 4 2026
Thanks for your request for this data, and we understand this would be a really useful dataset for WMF to release.
Mar 3 2026
Feb 24 2026
Hi there @Pfps ! I'm looking into refreshing the code that was previously used for query anonymization. Would you clarify what exactly you're hoping to get as output? Of course the queries in anonymized form, but what else - timestamps, user agent categories, etc?
Feb 23 2026
Oh, I am closing this, because I found the documentation for getting a Kerberos identity: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Kerberos#Get_a_password_for_Kerberos I'll follow that process.
Feb 20 2026
Hi again! Thanks for granting this access!