Page MenuHomePhabricator

request for access to Wikidata Query logs
Closed, DeclinedPublic

Description

I've been doing some benchmarking of QLever and other SPARQL engines on Wikidata.

The first part was comparison between Blazegraph, MillenniumDB, QLever, and Virtuoso. QLever was fastest.

The second part is estimating how much QLever (and maybe other engines) might slow down as Wikidata becomes larger. Initial indications is that the slowdown is only moderate.

But this work mostly uses query sets that are derived from old WDQS logs. I did create a new benchmark from Scholia queries, but that's only a small subset of the current WDQS queries.

Is it possible to get access to some current query logs? Ideally they would be anonymized the same way that some old logs were. See https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en for more information. Alternatively, I could do the anonymizing if I'm allowed access to the raw logs.

If the raw logs are turned over, this would be little work. If the logs are anonymized in-house, then it would be a moderate amount of work to obtain and run the anonymizing script.

Event Timeline

Pfps renamed this task from reqeust for access to Wikidata Query logs to request for access to Wikidata Query logs.Nov 10 2025, 8:39 PM
BTracy-WMF triaged this task as Medium priority.
BTracy-WMF moved this task from Incoming to Analysis on the Wikidata-Query-Service board.

Hi @Pfps - we are coordinating with the WMF safety and security team on how they advise we share this data. We expect to have guidance from them and next steps on this request by the end of next week (11/21).

@Pfps we have submitted a request to safety and security and are waiting on a response before proceeding. We are still targeting 11/21, but this date is tentative as we work through the correct process.

It's well past 11/21. Has there been any progress?

It's well past 11/21. Has there been any progress?

Our request is still under review by privacy & legal. We'll update the task once we receive feedback.

It is very frustrating to have this task languish without any way of contacting the team that appears to be blocking progress.

Hi @Pfps , apologies for the ambiguity on this request. The delay is due to the security review, a week long team offsite at the beginning of this month, and a refresh needed on the code used to anonymize the logs. This item is included in our current sprint. We will share updates as we have them.

I just noticed the paper https://arxiv.org/abs/2602.14594

That's very interesting work, except that it uses query logs that are about 8 years old!

Think how much better this could have been if it used current logs!

Hi there @Pfps ! I'm looking into refreshing the code that was previously used for query anonymization. Would you clarify what exactly you're hoping to get as output? Of course the queries in anonymized form, but what else - timestamps, user agent categories, etc?

Thank you!

I want to mock up a server to investigate its load, so all I need for that is the anonymized query and a relative (or absolute) timestamp. User agent categories could be useful to better estimate future loads. Anonymizing string literals will be a problem for me, but I understand if this has to be done.

You might want to contact the authors of the paper I reference above to see if they are interested in redoing their work with current queries and what they would want.

DSantamaria changed the task status from Open to In Progress.Feb 25 2026, 3:40 PM

One more thing that would be useful, if possible, is whether the query was syntactically legal according to Blazegraph. I could get this information by running the query through Blazegraph, but if this information is in the log I could use that instead.

Hi @Pfps ,

Thanks for your request for this data, and we understand this would be a really useful dataset for WMF to release.

A couple months ago the Wikidata Platform team did the due diligence to have this request reviewed by legal and privacy, who gave the OK for data to be released under the condition that the queries were anonymized. Following that, I did some exploratory work. Given the limitations I discovered (see below), we have unfortunately decided to close this request, and we aren’t able to release a dataset now. However, we do have a suggestion for another avenue for you to pursue to get this data (see the end of this note).

Quick summary of my work and findings:

  • I forked the ~10 year old repo that already existed for logs analysis+anonymization: https://github.com/Wikidata/QueryAnalysis/
  • I updated my fork in a few relatively minor ways to get it to run in the present day: https://github.com/lindsayerickson/QueryAnalysis/
  • Upon examining the output, I noticed that the anonymization didn’t seem good enough. The anonymized queries occasionally contained raw strings and URLs. So we definitely cannot release the output as-is.

Since this is no longer just a matter of running some lightly-updated code, but also now includes updating the underlying anonymization code itself and then validating it, we determined that it’s out of scope for us to fulfill this request now. I’m sorry about this outcome and I wish the anonymization had just worked as we had hoped.

Also relevant is that data releases like this are not generally done by individual teams. We were hoping it might work out in this case, which is why we did the background work and investigation, but alas. However, we’d suggest that you follow up with research-wmf@lists.wikimedia.org to ask about releasing this data, as that’s the right place for this topic.

Thanks!

I am disappointed in this abrupt ending, particularly after nearly four months.

I indicated that I would be willing to do the anonymization if I was allowed access to the raw logs. I would be willing to enter into an agreement to not share the raw logs. I understand the privacy issues involved in access to raw data - at one time I had CERT certification which gave me potential access to quite sensitive information.

I just learned that the initial work to anonymize the queries was supported by the Wikimedia Foundation, https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

Hi again,

sorry for the undesirable outcome! Most of the wait time was us doing the due diligence with legal and privacy. Our team isn't able to manage a collaboration where we would share the raw logs with you. Again, I recommend following up with research-wmf@lists.wikimedia.org.

And yes, the code I forked is WMDE's. But it's 10 years old and the underlying data seems to have changed in such a way that we are no longer confident in its anonymization capabilities.

One problem I have here is that I haven't seen any of the interaction with the privacy and security people so I don't know what their requirements for anonymization are.

A second problem is that the original anonymization was not designed to scrub all strings and certainly not designed to remove all IRIs. In fact the code likely introduces lots of IRIs because it appears to expand qnames into IRIs and if so the result includes an IRI wherever there was something like wd:Q5. Scrubbing all such identifiers would make the queries unusable. So it may be that the code is actually still doing what was considered the right thing at the time.

Unfortunately, reviews with privacy and security have their own process outside of Phabricator. Permissions to view those issues are not controlled by us.

Dataset releases are primarily handled by Wikimedia Research. Our team picked up this request because we thought the code mentioned above would work with little input from us. After a fair amount of time and energy invested we found that wasn't the case. As mentioned above, we recommend taking this request through the standard process with Research.

So who can I ask to see the interaction with the Privacy and Security team?