Page MenuHomePhabricator

RDF dump performance for SDC
Closed, ResolvedPublic


I did a small test of RDF dump generation for SDC/mediainfo. Elasticsearch data shows that there are about 500k files on Commons with labels and about 850k files with statements (these largely intersect). The way we dump entities right now, we scan all the files (page IDs) and skip those that do not have structured data. However, as right now only about 2% of files has data, so it is very wasteful process - we process 100 pages to find one proper mediainfo entity, essentially. We may want to find a way to do better, though not sure that current classes allow it - we may have to implement some special class instead of SqlEntityIdPager.

I tried dumping 100K mediainfo entities, and that took 166.5 minutes. On one hand, given that we can parallelize, if we split it into 8 shards, we might be done in reasonable time. On the other hand, average of 10 items per second is too slow. If we expect coverage of files with mediainfo to increase significantly (e.g. 10x and more) then it's maybe not that big of a deal (though T222497: dumpRDF for MediaInfo entities loads each page individually) still remains a factor but as it is now, RDF dumping process for mediainfo is very inefficient.

Related Objects

Event Timeline

@Smalyshev Do you know how many entries have structured data on deployment-prep? Is that a useful testing ground right now or should we be populating the data over there first?

Probably not a lot. Search for English labels returns 188 results, unfortunately search for statements and every label doesn't seem to work (probably needs a reindex?) so I don't know how many but probably also not a lot. I'll check tomorrow if I can get more specific figures.

I'm looking at deployment-db05 now, and there are 63332 rows in the revision table, with 53250 rows in the content table. I guess we need to double the number of revisions and then add the structured data for those entries. we can probably be clever about this via a script. Starting to get clever about this: ability to generate 50k small images with metadata that can be extracted for using in depicts and/or caption statements.

Gehel added a subscriber: Gehel.

Note that T222497 needs to be resolved before we can actually have a working dump.

Assigning to Cormac for now so he can assess.

I've started generating, uploading and captioning images in beta commons today, using the latest version of the script linked above. I'd like to add some depicts statements too. In any case, by the end of the week expect that we'll have several batches of these little icons with diferent borders and background colors, all captioned for folks' testing needs.

Adding items to wikidata in deployment-prep for use in depicts statements for the uploaded images in beta commons. Depicts statements early next week most likely.

Bulk adds of depicts statements on deployment-prep will start this evening, now that the code is ready. It will run over a couple of days at least. Once complete we'll have 3k images on beta commons with captions and depicts statements in them, referencing 1k items on beta wikidata. I'd like to get us up to 50k total over the next few weeks.

I'm not sure if T222497 covers this stuff and, if not, what is actionable here by the structured data team. @ArielGlenn any thoughts?

I'm not sure if T222497 covers this stuff and, if not, what is actionable here by the structured data team. @ArielGlenn any thoughts?

Yes it does. In the meantime there are about 15k entries on beta commons with MediaInfo content now (captions and depicts), which can be used for short tests.