Page MenuHomePhabricator

Data request for logs from SparQL interface at query.wikidata.org
Closed, DuplicatePublic

Description

I would like to access the query logs from the sparql interface at query.wikidata.org . I am interested in this as a way of providing some quantitative evidence of the amount of use claims are getting through the sparql interface. One driving purpose here is to provide evidence to discussions with potential data providers of the increased impact that their work can have if it is released via wikidata.

Event Timeline

Aklapper renamed this task from Develop data usage code to Data request for logs from SparQL interface at query.wikidata.org.Aug 25 2016, 11:27 AM
Aklapper removed I9606 as the assignee of this task.
Aklapper edited projects, added Wikidata; removed WMF-NDA-Requests.
thiemowmde triaged this task as Medium priority.Sep 5 2016, 3:13 PM

I'm not even sure if we have such query logs. @Smalyshev, do you have a hint for the user? Maybe he can use an existing Grafana board or something?

We have the logs, but they are not publicly accessible. See https://meta.wikimedia.org/wiki/Discovery/Data_access_guidelines#Request_logs for access guidelines.

Hi folks. It sounds like there is reasonably clear pattern for access. I have a student that could execute this project starting sept. 19 if the barriers were cleared. Anything I can provide to move this along? Thanks!

Hi @I9606 - we have a NDA process that your student would need to go through before we can go too much further with this being done in a volunteer capacity.

The link is here for the main page: https://meta.wikimedia.org/wiki/Non-disclosure_agreements and your student would need to start here: https://wikitech.wikimedia.org/wiki/File:Volunteer_Non-disclosure_Agreement_Template.pdf.

It looks like some of this might have been done in this ticket: T143819

OK. Do we just sign and mail that in or is there a specific contact person we should be in touch with?

Let's bring in @leila to confirm next steps, thanks.

@I9606 I imagine that what you are interested in will be one of the early outputs of the research documented at https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries . If that is the case, we should wait for the result of that research to gradually start coming out.

Assuming that we can gain access to the output of that work and that it allows us to explore subject-matter specific aspects of the data, then yes, it sounds like it would be a great foundation for what we want to do.

I notice that this project started in May this year and that it has not added a timeline section yet. What are the expectations for when it will complete ? If it is not making active progress, perhaps we could join forces with them immediately rather than waiting around for it to finish?

@I9606 that specific project proposal was initiated in May 2016. The access to data was granted only in September 2016. Timelines will be updated once we know more. :)

Would it be possible for our team to get access to these log files so that we can perform our analyses that are related to, but distinct from, the ones that @mkroetzsch is doing? We are happy to coordinate with Markus so that there is no duplication of effort. But, I suspect that our analyses are much more specific to the biomedical research community than their general purpose ones.

For context, our team (led by @I9606 and myself) has been spearheading the loading of biomedical data into wikidata through https://www.wikidata.org/wiki/Wikidata:WikiProject_Molecular_biology. We are at a critical point with several potential data providers in convincing them to upload their data, and showing we can provide summarized usage reports for their funding agencies is a key blocker.

If this sounds reasonable, please let us know how we submit our signed NDA agreement forms. (Email? Attach to this ticket? Mail?)

@AndrewSu As I just replied to Benjamin Good in this matter, it is a bit too early for this, since we only have the basic technical access as of very recently. We have not had a chance to extract any community shareable data sets yet, and it is clear that it will require some time to get clearance for such data even after we believe it is ready.

In the long run, I would find some collaboration very interesting, but we need to lay the foundations for this first, which will likely take a few more months.

@mkroetzsch Thank you for the info. We look forward to coordinating more when/if you see fit in the future.

Since our project is not dependent on Markus' work, and since I don't believe that our work will negatively impact Markus' project, I propose we treat our request here as a completely separate initiative. So unless anyone has an objection to our plan, we await information on next steps. Again, we are ready to submit signed NDAs as soon as we receive instructions.

@AndrewSu please read https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations to learn about how we start formal collaborations (which is a pre-requisite for accessing the data). If you are interested, please attach a proposal to this phabricator task, ping me, and I'll make sure the Research team will review your request in the coming weeks and get back to you.

From my side the team around @I9606 and @AndrewSu has useful things to contribute on this topic and it'd be great if their request can be granted.

@AndrewSu Lydia and I had some off-list discussions and we thought it's a good idea that I leave a bit more information for you here:

  • Please don't spend days on the proposal if you decide to submit it. This is supposed to be a 1-2 page proposal that will help us understand what the problem is, why it's important and how you want to solve it (methodology). Some lit review would be great, but at a high level.
  • Dario and myself are operating at capacity in terms of forming new collaborations at the moment (we have a few more in the pipeline and we already have some in place). This being said, there are at least two other people in my team who may want to initiate this collaboration. Also, if the proposal ends up being aligned with something I will work on this year, I may drop something else to make it happen. This is to say that there is uncertainty on our end until we read the proposal, and there are some resource constraints.

I hope this extra information help you with your decision. If you have any question, please ping me.

Thank you @leila for the guidance on the process and next steps -- very helpful! @I9606 and I will touch base to see how we want to proceed/prioritize from our end...

@Nuria @Smalyshev this request (please read current task description) is something we have been getting from other people as well. Basically, anytime the community wants to convince an entity to donate data to Wikidata they have to somehow show how much the previous donations been used. It makes sense to me that instead of granting access to logs for specific requests, we look into offering a metric that indicates the daily/weekly/monthly counts the Wikidata item or statement has been requested. Is this something that Analytics can help with or is this Search related? :)

Hmm not sure how to implement this yet, as we do not track which items were in query results (might be possible from GUI, though expensive, and probably not possible from API) but may be possible to analyze e.g. property usage in queries. Anybody in Analytics interested in helping with this?

Just want to add a note that if someone on the WMF side was interested in building the infrastructure to compute these usage metrics, the "Gene Wiki" team would be very willing collaborators in evaluating and refining the metrics. We have been working hard loading biomedical data into Wikidata. We've convinced several resources to convert to CC0, but we're also talking with many data providers who have reservations (many of which might be addressed by usage statistics). Based on these interactions, I think we have a pretty good perspective on what metrics would be valuable to this cross section of data providers.

If @Smalyshev thinks this would be a good idea and can develop the instrumentation for the metrics and own the metric definition (together with "gene wiki") we can help on the project as needed, seems to me that things like these could be computed with the infrastructure we have in place.

As far as I understand you need to publish not only queries to service but also query results (is this correct @Smalyshev?) analyzing those will produce the metric counts @AndrewSu and @leila are interested on. This requires a schema definition of what a query result is (i imagine) so it seems that there is some work to do on the wikidata end before being able to product counts.

It may be hard to capture query results, given that we don't have any mechanism of tracking them now. We do have logs for queries themselves, so that's what I would start with...

@AndrewSu if you have any suggestions about the metrics that would be very helpful. Please add them here.

My initial thought is that there will be two types of metrics. First, we want to look at statement-level metrics. For all the statements that our team has loaded into Wikidata, we have been referencing specific resources that assert that statement. For example, see the human gene reelin (Q414043). This gene has a genetic association (P2293) with the disease Alzheimer's disease (Q11081), as stated in (P248) a database called Phenocarta (Q22330995). We would like to provide the Phenocarta team statistics on how often Phenocarta-referenced statements are used in SPARQL queries. Those statements might be part of the output of the SPARQL query, or they might simply be structural intermediates.

Second, we might also want to look at item-level metrics. See for example visual agnosia (Q18742). This item is mapped to the Disease Ontology (Q5282129) through the Disease Ontology ID (P699) (and one intermediate item for the specific release of the ontology). Again, we would want to provide the Disease Ontology team metrics on how often DO-linked items were utilized (either directly or indirectly) in SPARQL queries. (Note also that ontologies that are referenced as external identifiers in Wikidata items will very often also be referenced in support of instance of (P31) or subclass of (P279) statements, which may fall under the previous category.)

Computing one or both of these metrics in my mind would be good first steps, though I'm guessing there would need to be further iteration once we examine the results. Hope this is helpful...

Those statements might be part of the output of the SPARQL query, or they might simply be structural intermediates.

We don't have currently tools to capture the statistics about output of the query, let alone intermediaries. We could, however (with some work) capture usage of certain property, or item, or property-item combination, in the original query. Would that be useful?

@Smalyshev @AndrewSu please take a look at other metric definitions we have. once you decide on a metric definition please be so kind as to document it in meta: https://meta.wikimedia.org/wiki/Research:Standard_metrics#Newly_registered_user

This helps a lot to quantify what thimgs mean when you see them on a dashboard

We could, however (with some work) capture usage of certain property, or item, or property-item combination, in the original query. Would that be useful?

  • Property usage: I think there is some small-ish subset of properties that are very closely tied to a single data provider (e.g., Disease Ontology ID (P699)) where property usage would be informative to that data provider. But since usage of a single property could (usually?) span many different data providers, I think this will not be sufficient for most data providers.
  • Item usage: This partially addresses the "item-level metrics" in my last post, but it depends on how it's counted. Again, suppose I'm interested in metrics on Alzheimer's disease. If you mean counting the number of explicit mentions of Q11081 in a SPARQL query (eg "how many symptoms does Alzheimer's disease have?), that's a good start. But that misses out on cases where the item is returned as a result but not explicitly mentioned (eg "What diseases have a symptom of memory loss?").
  • Property-item usage: Not seeing clearly exactly how this might work, but I think the same caveats as Item usage apply.

Note also that I don't think any of these metrics get at the "statement-level metrics" I described above. These arguably will be the more common case too.

As one very vague idea regarding a possible implementation that did account for outputs and intermediate, perhaps we could set up a temporary database that removed all items/statements from a given data provider, reran a set of sparql queries, and then compared results. If the results differed, then you could empirically say that the data provider was important for that query. Obviously complexity/scale are issues here...

For overall context, data providers are continually having to justify their existence to funders (e.g. the NIH), usually in terms of how important they are to a community of third-parties (e.g. research scientists). Currently they do that through restrictive licenses, so they can point to the number of licensees they have. If we want to convince them to contribute to Wikidata, they immediately lose the licensee count metric because there is no requirement to license. To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they can give to funders. Just want to explain this perspective in case it wasn't clear...

To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they can give to funders

Understood, as I said we are willing to help in any way we can, seems like a great objective. My main point is that if we come up with a metric we should document it outside this ticket once we have some agreement.

To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they can give to funders

Understood, as I said we are willing to help in any way we can, seems like a great objective. My main point is that if we come up with a metric we should document it outside this ticket once we have some agreement.

Got it, definitely will do that! Thanks!

See also: T164019 which could probably provide a platform for collecting the stats.

Here's how I see the process for handling releases (see also T183020: Investigate the possibility to release Wikidata queries):

  1. WDQS logs are placed in separate partition on hive
  2. We create a pipeline that parses these logs and produces sanitized logs containing successful queries with:
    • Timestamp
    • Sanizitzed query, as described in T183020, probably with additional provisions mentioned there in comments
    • Sanitized user agent, as described in T183020
    • Bot flag
    • Session-hashed client IP (i.e., same IP produces same hash in the short term, but not necessarily over all data set)
    • Possibly geocoded_data - i.e. country etc. (probably not more specific than that)
    • Time to first byte
    • Response size
    • Referer class (external/internal)

This data can still be sensitive, but not as sensitive as raw source data, so can be easily shared with researchers after appropriate vetting and NDA procedures.

  1. From the data set above, we would create another data set, that includes:
    • Timestamp
    • Sanitized query, from above
    • (?) External/internal flag
    • (?) Bot flag
    • Additional tagging by items (Q-ids) and properties (P-ids) used in the query, so people could see usage by specific properties.

This data set could be periodically published openly.

Would like to hear comments about this idea.

@Smalyshev: Take a look at information we keep on pageview hourly, for long time keeping we need to remove PII and we neither store detail timestamps or sessionIds as we want to avoid session reconstruction precisely. So probably if we round timestamp and remove sessionId your proposal for dattaset #1 is safe to keep long term (cc @mforns for anything I might be missing)

The proposal for dataset #2 looks good

I think with the mentioned adjustments both datasets would be useful for public consumption.

Thinking about it, I don't think we ever would need more that hourly resolution for anything related to queries (we can get hit stats from the usual stats places I assume). I also thought about dataset #1 as more short-lived. But I am not that insistant on session ID thing, maybe dropping it is fine too and then we could make it public.

@Smalyshev We like to default to public if possible, the more eyes on the data the more useful it can be.

@Nuria @Smalyshev

So probably if we round timestamp and remove sessionId your proposal for dattaset #1 is safe to keep long term (cc @mforns for anything I might be missing)

I think it depends highly on how drastically we sanitize the potentially identifying fields (user agent and client IP) and the fields that can indicate user acivity/features (query, location).
Intuitively it seems to me that we can keep this data in a private store indefinitely if sanitized. But having those sensitive 4 fields in the same data set will make it difficult to publicize, even if sanitized. I don't know how frequent are WDQS queries, but I imagine they are several orders of magnitude smaller than pageviews. Thus the buckets of this data set are likely to be sparse and small, which increases the threat to user privacy.

If we wanted to make this public, I'd go for removing the geographic location field entirely, and probably for daily or monthly resolution instead of hourly (depending on bucket size).
Also, splitting the data set in several unrelatable thematic data sets could help: queries by country, queries by user agent, session queries, etc.

Sorry if I'm too pessimistic, I'm not familiar with the kind of information that WDQS queries can give away about users.

I made a more formal full description of which data I'd like to be in the public dataset, so people don't have to read through all the comments here: https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Publishing_query_data

Please review and comment if you see anything missing or wrong.

I think notes look good.

@mforns main point that I missed is that we probably also want to remove geolocation from dataset #1, I see that from your sumup you did.

Remaining item is sanitization of sparql queries and on that I think we have to trust your expertise. As in any system any non parseable queries should be removed cause -as we have seen before- bad queries might contain someone's credit card number (for real). From your notes you are also removing non parseable queries so, good again. I think also grouping user agent should be of use and not a privacy concern (as long as you only include broad categories. For example, we do not want to include: "user agent of the lunux distro only 3 people in the world have access to", this case is covered by you removing long tail of browsers with less than 10.000 requests.

Let's go ahead and start working on this, oozie/spark will be the way to go, since you already have tags on webrequest data you can probably run the job that will create this data once a day as you are only quering a subset of webrequest table ? (cc @JAllemandou to confirm). Here are some spark examples that might be of use to know how to generally approach the problem: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala

@Nuria , @Smalyshev : Given all wikidata-query tagged rows belong in misc, which is super small, I have no objection running jobs either hourly or daily.

@Esc3300 Which users? WDQS does not track users, only queries. The log does contain query IP but the data processing will remove it, as well as any other PII. Additionally, we don't really know who runs any of the queries (beyond the basic information like IP) so I am not sure how personal opt-out is possible.