Page MenuHomePhabricator

RDF dump performance for SDC
Open, Needs TriagePublic


I did a small test of RDF dump generation for SDC/mediainfo. Elasticsearch data shows that there are about 500k files on Commons with labels and about 850k files with statements (these largely intersect). The way we dump entities right now, we scan all the files (page IDs) and skip those that do not have structured data. However, as right now only about 2% of files has data, so it is very wasteful process - we process 100 pages to find one proper mediainfo entity, essentially. We may want to find a way to do better, though not sure that current classes allow it - we may have to implement some special class instead of SqlEntityIdPager.

I tried dumping 100K mediainfo entities, and that took 166.5 minutes. On one hand, given that we can parallelize, if we split it into 8 shards, we might be done in reasonable time. On the other hand, average of 10 items per second is too slow. If we expect coverage of files with mediainfo to increase significantly (e.g. 10x and more) then it's maybe not that big of a deal (though T222497: dumpRDF for MediaInfo entities loads each page individually) still remains a factor but as it is now, RDF dumping process for mediainfo is very inefficient.

Event Timeline

@Smalyshev Do you know how many entries have structured data on deployment-prep? Is that a useful testing ground right now or should we be populating the data over there first?

Probably not a lot. Search for English labels returns 188 results, unfortunately search for statements and every label doesn't seem to work (probably needs a reindex?) so I don't know how many but probably also not a lot. I'll check tomorrow if I can get more specific figures.

I'm looking at deployment-db05 now, and there are 63332 rows in the revision table, with 53250 rows in the content table. I guess we need to double the number of revisions and then add the structured data for those entries. we can probably be clever about this via a script. Starting to get clever about this: ability to generate 50k small images with metadata that can be extracted for using in depicts and/or caption statements.

Gehel added a subscriber: Gehel.

Note that T222497 needs to be resolved before we can actually have a working dump.