Page MenuHomePhabricator

Reproduce wikidata-todo/stats data using analytics infrastructure
Closed, ResolvedPublic

Description

The stats at https://tools.wmflabs.org/wikidata-todo/stats.php are generated from this php script https://bitbucket.org/magnusmanske/wikidata-todo/src/fd1a1177c709c4fb1c6fe174ad437cd3c5938410/stats/stats_references.php from the JSON dump every two weeks.

It is possible to do node metrics (quantification and comparison) more efficiently and more frequently with either RDF/SPARQL or some SPARK application.

Event Timeline

Christopher raised the priority of this task from to Needs Triage.
Christopher updated the task description. (Show Details)
Christopher moved this task to Incoming on the WMDE-Analytics-Engineering board.
Christopher subscribed.
Christopher renamed this task from Reproduce wikidata-todo data using analytics infrastructure to Reproduce wikidata-todo/stats data using analytics infrastructure .Oct 30 2015, 4:45 PM
Christopher set Security to None.

Another option here would be to use the wikidata toolkit to scan the dump (which we are already doing for some other things)
http://github.com/wmde/wikidata-analysis
and simply count the stuff that we want to count! :)

(that is is we cant simply do it all through our sparql endpoint)

I have started working on an rough implementation of the Java thing to get these numbers in the wikidata-analysis repo
See:

and

Of course simply using the sparql endpoint would be the easiest and probably best solution here.
Or perhaps loading the RDF or JSON into something else.

But if neither of these work we have something to fallback on besides the wikidata-todo stats.
We can also easily add more counters to the dump scanner, and can easily understand where each number comes from.
These could also possibly be fired straight into graphite for easy use and graphing etc!

@Christopher thoughts?

I think one important thing to think about here is to only count what we actually want to count rather than mindlessly copying the wikidata-todo stats

My thinking is that we should look at this processing and measuring task primarily from the perspective of performance and manageability. The dumps are going to keep growing, and more and more data will have to be analysed. The solution that we implement now should be scalable to a multiple or even exponential growth factor of the current data set.

So, how can we evaluate our methods? I suggest that we profile them from the application level first. This can be quite scientific if done correctly, but not really trivial. With java apps, profiling can be done by using jstat, jconsole or Mbeans/JMX. I think some expectations of processing time per metric should be established as well. What kind of time is "normal" and "excessive"? Of course, faster is always better, so considering the methods to optimize throughput (like Hadoop/Spark, etc.) are important as well.

I agree that obtaining clarity on what counts are relevant and what are not is significant, but the task can be generalized to the point of providing counts for any type to type relationship. And there are a lot of possibilities for this I think, even more than what is done with wikidata-todo stats. SPARQL and/or SQL definitely facilitates the comparison logic needed for grouping, etc. which is why offline db analysis seems the natural way to go here.

For testing, I modified the wdqs munger (see T115242 in https://github.com/christopher-johnson/wdm-rdf) and rebuilt it on a labs instance (wdm-rdf.eqiad.wmflabs). The munger took 4 hours to process the RDF dump from 10-26-2015 (with all RDF types, but without site links and only english labels) and created 195 gz chunks.

Loading in these chunks to Blazegraph takes approximately 2 to 4 minutes per chunk, so I expect the entire loading process will take ~ 8 - 10 hours. Total time to process ~ 21M items into Blazegraph for measuring is thus 12-16 hours using the current wdqs code.

Once this is loaded, I will analyze how SPARQL performs for counting comparison queries like (# of statements without references).

Note that this is a one time process, and updates can be done incrementally, which may be an management advantage over continuous full dump processing as it is done by todo/stats (and the toolkit?).

I totally agree, being able to avoid using the dumps here would be the best solution!
SPARQL it all!

Update: All data loaded into Blazegraph (it took over 24 hours). Sync now running and up to 27 October.

Using Fast Range Counts returns counts of content objects instantly.

Examples:
curl -G http://wdm-rdf.wmflabs.org/bigdata/namespace/wdq/sparql --data-urlencode ESTCARD --data-urlencode 'o=http://wikiba.se/ontology#Item'
Number of Items: 18,733,307
curl -G http://wdm-rdf.wmflabs.org/bigdata/namespace/wdq/sparql --data-urlencode ESTCARD --data-urlencode 'o=http://wikiba.se/ontology#Statement'
Number of Statements: 74,709,111
curl -G http://wdm-rdf.wmflabs.org/bigdata/namespace/wdq/sparql --data-urlencode ESTCARD --data-urlencode 'p=http://www.w3.org/ns/prov#wasDerivedFrom'
Number of Predicate wasDerivedFrom: 38,985,221

Trending these kinds of objects should show interesting usage frequency patterns.

As the above blocking task has been resolved is it possible to perform these on the live query service?

No. the blocking task code enables an option to not filter item, statement, value and reference rdf:types in the munger. I decided not to wait for this, so that I could get started, but having it in master is very helpful going forward.

In order to have these types on live wdqs, would require a complete rebuild of their data, which takes a long time. The wdm-rdf instance is a clone that includes these types, and should eventually synch up to production (hopefully in another 5 or 6 day ... 24 hours of edits takes approx. 12 hours to process).

It is possible to do estimated cardinality queries on live wdqs for the property usage counts and anything else other than these primary types, however.

Yes. It seems I need to disable the 10 minute query timeout set here first:
https://github.com/wikimedia/wikidata-query-rdf/blob/b3e646284f0b74131bce99a1b7d5fc6bfe675ec1/war/src/config/web.xml#L55

A fat query like this:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT (count(distinct(?wds)) AS ?scount) WHERE {
       ?wds ?p wikibase:Statement .
   OPTIONAL {
     ?wds1 <http://www.w3.org/ns/prov#wasDerivedFrom> ?o .
     FILTER (?wds1 = ?wds) .
  }
  FILTER (!bound(?wds1)) .
}

to find out how many statements do not have references is currently not possible.

There may be a better way to ask for this, but the way that the data is coded does not really facilitate type joins. An important point is that wikidata-todo/stats, and possibly the standing perception of the data, assumes an iterable hierarchy. But RDF does not make hierarchy. So an Item does not "contain" statements, and statements do not "contain" references.

The relationship between statements and references is difficult to query by type, because a binding triple looks like this:

wd:statement/Q20913766-CD281698-E1D0-43A1-BEEA-E2A60E5A88F1 prov:wasDerivedFrom	wdref:39f3ce979f9d84a0ebf09abe1702bf22326695e9

Note that simply counting the frequency of http://www.w3.org/ns/prov#wasDerivedFrom and comparing that to the frequency of wikibase:Statement would provide a kind of global ratio that is a fast and easy alternative to counting individual statements without references.

I am rebuilding wdm-rdf now with the new Munger and no query timeout.

Also, I will load the dump from 17 November, so that the updater has some chance to sync. It had fallen back to 14 days old, and I doubt that it would ever have caught up.

to find out how many statements do not have references is currently not possible.

We may not actually need this, for example if we know the number of items, and the number of referenced statements we must know the number of unreferenced statements.

Also, for reference I have just created T119182 which covers the notes for the general topic of data completeness

True, a statement is either referenced or "unreferenced". Getting the number of references (currently 41,735,203) is easy and fast with:

curl -G https://query.wikidata.org/bigdata/namespace/wdq/sparql --data-urlencode ESTCARD --data-urlencode 'p=<http://www.w3.org/ns/prov#wasDerivedFrom>'

So we use the total of wikibase:Statement objects to represent the total number of statements and subtract referenced statements to get "unreferenced statements".

What is still murky to me, and I think possibly wrong with the todo/stats data, is the "Referenced statements by statement type". Something does not add up there because the total should not be greater than the sum of "Statements referenced to Wikipedia by statement type" and "Statements referenced to other sources by statement type" ?

For getting counts of objects per item, this means running 19M separate queries or is there another way? Creating a script to do this would be very similar to the property distribution method that I have already done I guess. Basically ask "list all of the items" and then "lapply(items, count labels, statements, links, descriptions)"

What is still murky to me, and I think possibly wrong with the todo/stats data, is the "Referenced statements by statement type". Something does not add up there because the total should not be greater than the sum of "Statements referenced to Wikipedia by statement type" and "Statements referenced to other sources by statement type" ?

Well I noticed something odd here yesterday, no value and some value are counted as statement types. this is not the case!

So for the wikidata-todo stats the counts of statements by type - the no values - the some values should give you the number of statements.

OK. So the title "Referenced Statements by Statement Type" is just wrong then. Rather, it shows All Statements by Type"

DateitemlinkstringglobecoordinatetimequantitysomevaluenovalueTotal
2015-10-1946,177,56020,631,3912,363,1913,588,295470,4769,6304,43673,244,979

Truthy statement counts per Item can be done like this:

PREFIX wd: <http://www.wikidata.org/entity/>

SELECT (count(distinct(?o)) AS ?ocount)   WHERE {
 wd:Q7239 ?p ?o
 FILTER(STRSTARTS(STR(?p), "http://www.wikidata.org/prop/direct")) 
}

Labels per Item like this:

PREFIX wd: <http://www.wikidata.org/entity/>

SELECT (count(distinct(?o)) AS ?ocount)   WHERE {
 wd:Q7239 ?p ?o
 FILTER (REGEX(STR(?p), "http://www.w3.org/2000/01/rdf-schema#label")) 
}

Descriptions per Item:

PREFIX wd: <http://www.wikidata.org/entity/>

SELECT (count(distinct(?o)) AS ?ocount)   WHERE {
 wd:Q7239 ?p ?o
 FILTER (REGEX(STR(?p), "http://schema.org/description")) 
}

Sitelinks per item:

PREFIX wd: <http://www.wikidata.org/entity/>

SELECT (count(distinct(?s)) AS ?ocount)   WHERE {
 ?s ?p wd:Q7239
 FILTER (REGEX(STR(?p), "http://schema.org/about")) 
}

OK. So the title "Referenced Statements by Statement Type" is just wrong then. Rather, it shows All Statements by Type"

DateitemlinkstringglobecoordinatetimequantitysomevaluenovalueTotal
2015-10-1946,177,56020,631,3912,363,1913,588,295470,4769,6304,43673,244,979

Yes, wow, how has no one spotted that before?

I am blocked on this by several problems with the data model/ontology. The question of the relationship of the data model and the RDF node definitions is a bit complicated, perhaps more so than it should be. A reference is a special type of statement defined by its relationship to other statements. An "unreferenced statement" is undefined in the ontology and in the RDF format. All statements should in practice have a reference node. But this is not an enforceable constraint in the data model apparently.

I think that when a statement is born, it should also create a reference "placeholder" or blank node in the RDF. With this information in the RDF, counting these "bad" statements would be much easier.

So lots of this is now done using the query service.
We need to assess what has been missed / is missing and doesn't already have a ticket on the board

The only way to get a count of statements with references in the current model/format is like this:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT (count(distinct(?s)) AS ?scount) WHERE {
  ?s prov:wasDerivedFrom ?wdref .  
}

This query is super slow! In fact, it has crashed Blazegraph because on an unlimited query timeout, it uses all of the 8GB allocated heap space.

Since a single statement can have multiple references, just counting prov:wasDerivedFrom using estimated cardinality only returns a count of all references.

I asked the experts in the mailing list how we can address this reference query problem, and no one has responded with anything useful yet. This is an issue that could be handled in the Wikibase RDF serialization with any number of different solutions. In addition to the idea of introducing a null reference object, another possibility would be to create a new attribute like wikibase:hasReference with a boolean datatype constraint. I will create a new ticket for this issue I guess.

So a rough version of my approach can be seen at https://github.com/wikimedia/analytics-limn-wikidata-data/blob/master/graphite/sparql/references.php
Firstly get all properties that should be used as references

SELECT ?s WHERE {?s wdt:P31/wdt:P279* wd:Q18608359}

And then query the counts for each

		$query .= "SELECT (count(?s) AS ?scount) WHERE {";
		$query .= "?wdref <http://www.wikidata.org/prop/reference/$propertyId> ?x .";
		$query .= "?s prov:wasDerivedFrom ?wdref";
		$query .= "}";

Of course this runs into the issue of a single statement can be returned in multiple counts.

So instead of this I will simply query for the statements that are referenced by the property (a query which completes but on the public interface times out when sending back the result) and then do some post processing to figure out the number that we actually want.

Me doing this is just blocked on T120010 now.

Also when digging into all of these queries it turns out adding distinct is what actually causes them to run over the execution time limit.
If you remove the distinct from your query it will actually complete rather quickly.

So basically a clever adaptation as to what I suggested in T119775 to get statements referenced to the Wikipedias. It works, but seems a very hacky approach around the core problem of not having a way to ask how many references a statement has.

So, just so I am clear on this, a statement to reference triple is always unique in the dataset? I was under the assumption that a singular reference statement could potentially be duplicated with different hashes, which is why distinct would need to be enforced on the subject. In theory, there should also be metadata on the reference that identifies it as "the latest" version, and previous revisions should not simply be replaced. This is another issue, I guess.

Imho, there are clear problems with the reference implementation that should be addressed and not just worked around which is why I created T120166 to start. Is the objective here just to produce some numbers or to improve the quality of the data?

You do need distinct if you want the correct number there! I was simply pointing out that distinct is what makes the query a long one, not actually the count.

I think the issue with potential duplication is being addressed and the datasets are being rebuilt this week https://phabricator.wikimedia.org/T116622#1839670

My goal is to have these number by the end of the year, hence my working around any potential problems right now.!

@Addshore Some progress was made on this in T120166. The only "practical" way to get the statement and reference metrics is to facet the data by property. It is just not possible to run counting queries against the whole database and get any reasonable response time.

This means that any large domain or range metric counts should iterate over all 1800+ properties with separate SPARQL calls and then aggregate the numbers. We can do this for the statement -> reference arity with:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
prefix p: <http://www.wikidata.org/prop/>

SELECT ?nrefs (COUNT(?wds) AS ?count) WHERE {
  {
    SELECT ?wds (COUNT(DISTINCT(?ref)) AS ?nrefs)
    WHERE {
        ?item p:$property ?wds .
        OPTIONAL {?wds prov:wasDerivedFrom ?ref } .
    } GROUP BY ?wds
  }
} GROUP BY ?nrefs 
ORDER BY ?nrefs

Would you do this in PHP? If you want to handle this, just let me know, otherwise we could reuse the bulk sparql scripts that I have already done in R.

In addition to tracking aggregates, it would also be useful to show all property counts in a table like I did for here http://wdm.wmflabs.org/?t=wikidata_property_usage_count.

You above query is slightly off somewhere and the below is actually correct!

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
prefix p: <http://www.wikidata.org/prop/>

SELECT ?nrefs (COUNT(?wds) AS ?count) WHERE {
  {
    SELECT ?wds (COUNT(DISTINCT(?ref)) AS ?nrefs)
    WHERE {
        ?item p:P227 ?wds .
        OPTIONAL { ?wds prov:wasDerivedFrom ?ref } .
    } GROUP BY ?wds
  }
} GROUP BY ?nrefs 
ORDER BY ?nrefs

So this is listing the number of statements that have n number of references?

In addition to tracking aggregates, it would also be useful to show all property counts in a table like I did for here http://wdm.wmflabs.org/?t=wikidata_property_usage_count.

Table support is in the next version of grafana

I can look at adding this query to our collection next week!

I think that you may have missed the point. I added the $property variable in the above query to indicate that this has to be run for every property. p:P227 is a random example.

I am still confused, Running this for P143 gives the following:

nrefs 	count
0	920
1	8

Since P143 is primarily a "reference type" property, it should be used when
the reference node is the subject (with a few exceptions apparently). The
query only evaluates the arity of the reference nodes as objects. So, the
results for P143 are expected.

Okay, I'm struggling to see which part of the todo stats this is covering

Obviously, a main aspect of the data presented in the todo stats is
"referenced statements". (even though the chart labels there are wrong).
Whether or not this query maps directly to todo is actually not the key
issue. Clearly, measuring data quality requires that the arity of
statement to reference relationships are quantified. Right?

This assumption is based on Wikipedia's policy of maintaining a NPOV. And,
unfortunately, all unreferenced statements contain a "bias" that makes the
data theoretically worthless, even though they may in fact be "correct".

As far as I can see we are now covering all of the parts of wikidata-todo/stats that we wanted!