Page MenuHomePhabricator

data for analyzing and visualizing the identifier landscape of Wikidata
Closed, ResolvedPublic

Description

We want to visualize the external identifier landscape of Wikidata. For this Goran needs data. We need the following for every Item and Lexeme:

Item ID -> Property ID of Property used on that Item of type external identifier
Lexeme ID -> Property ID of Property used on that Lexeme of type external identifier (including usage in Forms and Senses)

We want to get this data again roughly every 3 months to update the visualisations.

Event Timeline

Lydia_Pintscher created this task.
Lydia_Pintscher moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.

@Lydia_Pintscher @Addshore @JAllemandou

Hey: don't you think that we could get this from the pagelinks table in Wikidatawiki's database?

The way I see it (please correct me if I am wrong):

0. In the pagelinks table we have the following fields:

  • pl_from int unsigned NOT NULL default 0 - Key to the page_id of the page containing the link;
  • pl_from_namespace int NOT NULL default 0 - Namespace for this page;
  • pl_namespace int NOT NULL default 0 - Key to page_namespace of the target page;
  • pl_title varchar(255) binary NOT NULL default - Key to page_title of the target page;

1. I get all the Wikidata IDs - and these should as well correspond to the respective page titles on the Wikidatawiki - for all external identifiers of interest (I can do SPARQL, or a Blazegraph GAS program for this);
2. I would probably need to have the Wikidatawiki's pagelinks table sqooped to Hadoop, so that I can cash it to Spark and join with the external identifiers' titles there.

Since we have the pl_from_namespace field in the pagelinks table, we don't even need the whole table (because we are looking for items and lexemes only, so we first filter out everything else).

Also, it is not impossible that R alone could handle this on, say, stat1007 - the Wiktionary Cognate Dashboard back-end/update engine works with hundreds of millions of rows exported from SQL there, and cracks the numbers w. {data.table} package regularly (it's update every six hours). I could experiment to see how the problems scales and then decide if we go for Pyspark or we do it in R?

Thus: if the external identifiers are recognized as pagelinks on the respective item/lexeme page on Wikidatawiki... well, we have the data then and it boils down to the question of optimal data engineering for the problem at hand.

Please let me know what you think. Thanks.

It's probably more accurate to generate this from the actual JSON of the entities.
the page links table might get you a pretty accurate result, but there can be other situations where a property could be "linked" on an item when it is not actually used as a property value for a statement.
For example properties can be used in statement values (or in any snak value)

image.png (281×693 px, 15 KB)

@Addshore Exactly (@Lydia_Pintscher - We have already discussed similar approximate solutions).

Ok, I do not know if the Wikidata dump is already in the Data Lake or not, but I can check (I forgot which Phab ticket that was). Now,

  • if I have the dump in the Data Lake, I can study the structure of the table and get the dataset out for T204440, but if not
  • someone who can parse the dumps would need to give me a hand (in theory I could parse it in R, but I guess a Java person who can do WDTK could do it more efficiently).

It's probably more accurate to generate this from the actual JSON of the entities.
the page links table might get you a pretty accurate result, but there can be other situations where a property could be "linked" on an item when it is not actually used as a property value for a statement.
For example properties can be used in statement values (or in any snak value)

image.png (281×693 px, 15 KB)

True but likely happens so rarely that we can ignore it for a general visualization. So I'd be ok with it if it makes things easier.

@Lydia_Pintscher

So I'd be ok with it if it makes things easier.

Depends :) My proposal in T214897#4918562 does not make things easier for me in any way, on the contrary.
But I guess we should give it a try if other approaches (e.g. someone parsing the dump w. WDTK to produce the dataset, having the WD dump in the Data Lake) are more expensive.
Let me know what you think.

Hi folks - Sorry for late answer, I was at WMF all-hands last week and did not check tasks.
I have started work work on having the wikidata-json dumps imported on the cluster, and while some data is available for ad-hoc analysis (see hdfs:///user/joal/wmf/data/wmf/mediawiki/wikidata_parquet), this dataset is not updated on a regular basis (not production-ready).
I however think that for a manual update every 3 month, it could be easy.
@GoranSMilovanovic - What do you think?

@JAllemandou Let me take a look at it and I'll get back to you. Thanks!

@JAllemandou @Lydia_Pintscher @Addshore I can access the JSON WD dumps from hdfs now.
I should be able to solve this in Pyspark. Reporting back as soon as I have something.

@JAllemandou Thanks for the dataset.

@Addshore @RazShuty @Lydia_Pintscher

Is there any reason why the Wikidata namespace = 0 (items) page ids are not included in the wbc_entity_usage table for wikidatawiki?

Of course the table would be huge, but huge SQL tables can still be sqooped.

  • testing pyspark solution for a visualization dataset now;
  • dump parsed; attempt at stat.crosstab before filter.
  • resulting dataset is too large to collect.

@Addshore @RazShuty @Lydia_Pintscher

Is there any reason why the Wikidata namespace = 0 (items) page ids are not included in the wbc_entity_usage table for wikidatawiki?

Of course the table would be huge, but huge SQL tables can still be sqooped.

Yes, as data is not provided to wikidata NS 0 via the wikidata client code.

You should be able to infer which entities are used on a NS 0 item however by looking at the links tables.

@Addshore

You should be able to infer which entities are used on a NS 0 item however by looking at the links tables.

I've already tried, but it's a monstrous SQL query to run; experimenting w. JSON dumps and pyspark still.

Thanks.

  • Current status: attempt no.2 at Spark stat.crosstab, now following filter for external-id dataType.

No way this is going to work with Spark stat.crosstab:

  • the limit on the number of pairs to collect from a contingency table is 1e6,
  • while we're looking at the approximately 55M x 4247 sized problem
  • (i.e. there are ~55M items to inspect x 4247 external identifiers to cross-tabulate across the items).

This is going to be tough.

Edit. The size of the problem is ~26M items x 1K+ identifiers.

  • Cutting this thing into batches (2.5M items x various number of external identifiers per batch) w. pyspark;
  • hopefully, R {data.table} will be able to put it back together w. rbindlist() and compute the contingency table.
  • 250,728,825 item-property pairs, and we are talking external identifiers only. That's Wikidata.
  • item-property pairs in long format imported to R (w. fread, rbindlist);
  • running a spread operation now to produce the contingency table.
  • spread operation failed (in-memory R);
  • working on an alternative approach now.
  • testing bigmemory R approaches now (bigtable()).
  • failed; now experimenting w. {data.table} and dcast() to produce contingency, and
  • {parallelDist} to compute the similarity matrix;
  • if this fails, I am going to test a sampling strategy.
  • failed; even {data.table} which is notoriously efficient in dealing with large data sets fails to produce the contingency (run on stat1007):
### --- Contingency w. {data.table}
t1 <- Sys.time()
cT <- dcast(dataSet, property ~ item, 
            fill = 0,
            drop = F, 
            value.var = "value")
Sys.time() - t1

results in

Error in CJ(1:1190, 1:22295882) :
  Cross product of elements provided to CJ() would result in 26532099580 rows which exceeds .Machine$integer.max == 2147483647

Fallback option is active now: sampling.

Hey @GoranSMilovanovic - I don't have a good understanding of what you're after, but having read pairs and contingency table above, maybe this Spark function could be helpful: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/DataFrameStatFunctions.html

@JAllemandou Thank you. I have already considered using stat.crosstab see T214897#5024647. Spark did the ETL part here and produced the data successfully.

I don't have a good understanding of what you're after

In a nutshell:

  • 250M rows = item x property pairs = two columns,
  • build a contingency table unique(items) x unique(properties)
  • it will be binary since every considered property matches an item zero times or only once;
  • the contingency table is of approx. 26M x 1K+ size;
  • compute a property x property Jaccard similarity distance matrix from binary contingencies.

And I will have to sample, at this point I don't see a workaround.

  • The results of first experiments show the sampling approach to be quite reliable.

@RazShuty @Lydia_Pintscher @JAllemandou

Our approach here will be to use an approx. 1M sized sample of WD items to produce the identifier x identifier Jaccard distance matrix

  • Ratio:
    • the dataset as produced in Spark has 250M rows x two columns (item-identifier pairs);
    • the desired binary contingency matrix to compute the Jaccard distances is of approx. dimesion 26M x 1000K+;
    • due to internal constraints, Spark stat.crosstab() cannot produce a binary contingency matrix that we need to compute the Jaccard distances;
    • while R {data.table} can manage the dataset, it still cannot produce the desired contingency matrix;
    • moreover, even if could have the contingency matrix produced in an efficient manner, it is questionable what procedure could deliver the Jaccard distances efficiently.

      The results of the following experiment testify that we can safely proceed with sampling:
  • Take 10 random samples from the ~250M items x identifiers pairs
  • by sampling identifiers proportionally (i.e. compute p(identifier), weight the identifier sample appropriately)
  • and including one observation for each identifier with p = 0 (due to rounding, not due to the absence of the identifier);
  • for each sample, produce a binary contingency matrix;
  • from each contingency matrix compute all pair-wise identifier-identifier Jaccard distances, store as vector;
  • compute Pearson correlation coefficients between the distance vectors obtained from 10 random samples.

Here's the correlation matrix; obviously, ~1M sized proportional random samples of item-property pairs are quite representative of the approx. ~26M item-property pairs dataset:

sample1sample2sample3sample4sample5sample6sample7sample8sample9sample10
sample110.99920.99920.99890.99910.99910.99920.99920.9990.9992
sample20.999210.99950.99920.99960.99920.99940.99950.99910.9994
sample30.99920.999510.99930.99960.99930.99940.99960.99920.9994
sample40.99890.99920.999310.99920.99910.99920.99910.9990.9991
sample50.99910.99960.99960.999210.99940.99940.99950.99920.9994
sample60.99910.99920.99930.99910.999410.99930.99920.99910.9991
sample70.99920.99940.99940.99920.99940.999310.99940.99920.9994
sample80.99920.99950.99960.99910.99950.99920.999410.99920.9995
sample90.9990.99910.99920.9990.99920.99910.99920.999210.9991
sample100.99920.99940.99940.99910.99940.99910.99940.99950.99911
  • Next steps: (1) proceed to produce the dataset; (2) resolve to ticket and proceed to visualization: T204440.

Final data set produced. Closing the ticket.

  • Sampling is phased out;
  • the Jaccard similarity matrix is computed from the full dataset.