Page MenuHomePhabricator

[Story] Statistics for Special:EntityData usage
Closed, ResolvedPublic

Description

Can you please make access statistics for the exports at https://www.wikidata.org/entity/Q1.json and similar available? The available formats are json, rdf, n3, xml and ttl.
Statistics split by format would be most useful.

Details

Reference
bz62874
Related Gerrit Patches:
analytics/limn-wikidata-data : masterEntityData usage tracking

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:06 AM
bzimport set Reference to bz62874.
bzimport added a subscriber: Unknown Object (MLST).

Erik Z and I will discuss and prioritize.

-Toby

I reckon prio 'high' is for assessment of requirements and doability. So without further ado some questions here:

Lydia, can you please explain in more detail what this is about? The link above just points to some structured data file, without any further explanation. Is this a data dump for one article? Just guessing.

Statistics split by format. Do I understand correctly you want as many monthly totals as there are formats, no further granularity (I hope so).

Where to find those numbers? Is there a table or api log which stores api requests, that you know of? Or should we be look at general traffic logs? We have 1:1000 sampled squid log reports (that would only work if api requests come by 100,000's per month, also those are more or less frozen in a partially functional state, as new infrastructure for traffic analysis is still expected to happen soonish).

Thanks for follow-up.

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1491

(In reply to Erik Zachte from comment #2)

Lydia, can you please explain in more detail what this is about? The link
above just points to some structured data file, without any further
explanation. Is this a data dump for one article? Just guessing.

Yes. https://www.wikidata.org/entity/Q1.json is part of wikidata's linked data interface. Basically, https://www.wikidata.org/entity/<id>[.<format>] URLs allow access to the machine readable description of an entity in the given format. If the format is not given, content negotiation is applied.

In the end, these URLs are resolved to a redirect (303 and/or 302) to wiki/Special:EntityData with the appropriate parameters. E.g. the example above results in a redirect to https://www.wikidata.org/wiki/Special:EntityData/Q1.json

I suppose that is already counted, but only as a single count, not for each entity/format.

Statistics split by format. Do I understand correctly you want as many
monthly totals as there are formats, no further granularity (I hope so).

From Lydia's original description, I gather that we are not interested in per-entity counters, but only per format. Considering that the format is not always given explicitly in the original URL, it woud probably be easiest to look at requests for wiki/Special:EntityData/*.<format> and base the statistics on that.

Where to find those numbers? Is there a table or api log which stores api
requests, that you know of? Or should we be look at general traffic logs?

That's our question to you (and Dario, I guess). But this has nothing to do with the API. This is a special purpose URL path that gets resolved to a special page. So I guess looking at the general purpose web logs should work.

This task has not seen updates for 16 months. Is this still high priority?

Nemo_bis renamed this task from stats for Wikidata exports to Stats for Wikidata exports.Jul 31 2015, 7:24 AM
Nemo_bis changed the task status from Open to Stalled.
Nemo_bis set Security to None.

It is still important for us, yes.

Lydia_Pintscher changed the task status from Stalled to Open.Jul 31 2015, 3:52 PM
Lydia_Pintscher added a project: Wikidata.
thiemowmde renamed this task from Stats for Wikidata exports to [Story] Statistics for Wikidata exports.Aug 13 2015, 5:11 PM
thiemowmde updated the task description. (Show Details)
thiemowmde removed a subscriber: wikibugs-l-list.

I've just had another person ask for this. (3rd party researcher who wants to use it for a workshop) It'd be really great to get this published.

Adam: Can you get us an internal overview from hadoop?

Adam: Poke? This is getting rather important and urgent.

Can you get us an internal overview from hadoop?

Yes

Just as a sample here is for a single hour

6 text/html; charset=UTF-8
2460 application/rdf+xml; charset=UTF-8
2473 application/vnd.php.serialized; charset=UTF-8
2479 application/n-triples; charset=UTF-8
2487 text/n3; charset=UTF-8
8820 application/json; charset=UTF-8
36716 text/turtle; charset=UTF-8

2015-11-10 1:00 UTC

Hi, quick questions on that:
Is the need regular, or would one shots make it ?
Also, what level of aggregation ? Daily is good ?
Below is a hive request that makes daily aggregation over (so thought) interesting dimension.
DISCLAIMER: These request need to scan a BIG volume of data (500Gb per day), so let's discuss how to handle the thing if you need regular updates.

SELECT
    CONCAT(LPAD(year, 4 ,0), '-', LPAD(month, 2, 0), '-', LPAD(day, 2, 0)) as day,
    regexp_extract(uri_path, '^/entity/.+(\\..+)$', 1) AS entity_format,
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) AS special_entity_format,
    access_method,
    agent_type,
    http_status,
    COUNT(1) as count
FROM wmf.webrequest
WHERE webrequest_source = 'text'
    AND year = 2015
    AND month = 11
    AND day = 16
    AND normalized_host.project_class = 'wikidata'
    AND uri_path rlike '^(/entity/|/wiki/Special:EntityData/).*$'
GROUP BY
    year, month, day,
    access_method,
    agent_type,
    http_status,
    regexp_extract(uri_path, '^/entity/.+(\\..+)$', 1),
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1)
ORDER BY
    day, entity_format, special_entity_format, access_method, agent_type, http_status
LIMIT 100000;
dayentity_formatspecial_entity_formataccess_methodagent_typehttp_statuscount
2015-11-16desktopspider2002
2015-11-16desktopspider301345473
2015-11-16desktopspider30275
2015-11-16desktopspider303312186
2015-11-16desktopspider40021
2015-11-16desktopspider5032
2015-11-16desktopuser20018
2015-11-16desktopuser3011398
2015-11-16desktopuser30238
2015-11-16desktopuser3032714
2015-11-16desktopuser40025
2015-11-16desktopuser4292
2015-11-16.jsondesktopspider200719297
2015-11-16.jsondesktopspider301501004
2015-11-16.jsondesktopspider30410315
2015-11-16.jsondesktopspider4007
2015-11-16.jsondesktopspider4041777
2015-11-16.jsondesktopspider5034
2015-11-16.jsondesktopuser2007675
2015-11-16.jsondesktopuser30197
2015-11-16.jsondesktopuser30210
2015-11-16.jsondesktopuser3041017
2015-11-16.jsondesktopuser4001
2015-11-16.jsondesktopuser4042
2015-11-16.jsondesktopuser42942
2015-11-16.n3desktopspider20065982
2015-11-16.n3desktopspider3011952
2015-11-16.n3desktopspider30417417
2015-11-16.n3desktopspider40413
2015-11-16.n3desktopspider5031
2015-11-16.n3desktopuser200169
2015-11-16.n3desktopuser30212
2015-11-16.ntdesktopspider20065717
2015-11-16.ntdesktopspider3012045
2015-11-16.ntdesktopspider3047927
2015-11-16.ntdesktopspider40413
2015-11-16.ntdesktopspider5031
2015-11-16.ntdesktopuser200203
2015-11-16.ntdesktopuser30215
2015-11-16.org/resource/desktopuser4004
2015-11-16.phpdesktopspider20065394
2015-11-16.phpdesktopspider3012048
2015-11-16.phpdesktopspider3047745
2015-11-16.phpdesktopspider40422
2015-11-16.phpdesktopuser200168
2015-11-16.phpdesktopuser30214
2015-11-16.rdfdesktopspider20066840
2015-11-16.rdfdesktopspider3012088
2015-11-16.rdfdesktopspider30413343
2015-11-16.rdfdesktopspider40417
2015-11-16.rdfdesktopspider5034
2015-11-16.rdfdesktopuser200182
2015-11-16.rdfdesktopuser30210
2015-11-16.ttldesktopspider200867900
2015-11-16.ttldesktopspider3012069
2015-11-16.ttldesktopspider30417560
2015-11-16.ttldesktopspider40435
2015-11-16.ttldesktopspider50324
2015-11-16.ttldesktopuser200181
2015-11-16.ttldesktopuser30213
2015-11-16.ttldesktopuser3041
2015-11-16.jsondesktopspider3019089
2015-11-16.jsondesktopspider3034551
2015-11-16.jsondesktopuser30320
2015-11-16.org/resource/desktopuser3014
2015-11-16.org/resource/desktopuser3034
2015-11-16.ttldesktopuser3032
Addshore renamed this task from [Story] Statistics for Wikidata exports to [Story] Statistics for Special:EntityData usage.Nov 18 2015, 12:38 PM

Daily would be good.
Grouped by format.
We can probably ignore /entity/ as they should all redirect to Special:EntityData.

It would also be great to have this running (perhaps with all possible historical data (I think that is about a months worth) but the end of this year!

Nuria added a subscriber: Nuria.Nov 19 2015, 6:05 PM

@Addshore: Do you have access to cluster 1002 to run querys yourself? Timeline wise if you need this before end of year it might be faster if you start working on it while we help you get changes going.

@Nuria yes I do!
I should be able to do this but of course if there is any chance of your team doing it that would be great!
Otherwise this will probably be one of the later things on my list!

Nuria added a comment.Nov 20 2015, 4:44 PM

@Addshore: It is on our backlog but we have several things before it so we cannot give an ETA. Now, I suggest that 1) you do some ad-hoc querying and get the data you need to met your end of December deadline. And 2) we can work together on oozification of this job later.

This is so our team doesn't block you and you can have your data for dev summit, that is a different deliverable than having an oozie job that calculates this data on a fixed interval.

Let me know if this sounds Ok.

I have reduced the above query to only show what we want to look at:

SELECT
    CONCAT(LPAD(year, 4 ,0), '-', LPAD(month, 2, 0), '-', LPAD(day, 2, 0)) as day,
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1) AS format,
    agent_type,
    COUNT(1) as count
FROM wmf.webrequest
WHERE webrequest_source = 'text'
    AND year = 2015
    AND month = 12
    AND day = 22
    AND http_status = 200
    AND normalized_host.project_class = 'wikidata'
    AND uri_path rlike '^/wiki/Special:EntityData/.*$'
GROUP BY
    year, month, day,
    agent_type,
    regexp_extract(uri_path, '^/wiki/Special:EntityData/.+(\\..+)$', 1)
ORDER BY
    day, format, agent_type
LIMIT 100000;

Change 260576 had a related patch set uploaded (by Addshore):
EntityData usage tracking

https://gerrit.wikimedia.org/r/260576

Change 260576 merged by Addshore:
EntityData usage tracking

https://gerrit.wikimedia.org/r/260576