RDF dump performance for SDC
Closed, ResolvedPublic
Actions

Description

I did a small test of RDF dump generation for SDC/mediainfo. Elasticsearch data shows that there are about 500k files on Commons with labels and about 850k files with statements (these largely intersect). The way we dump entities right now, we scan all the files (page IDs) and skip those that do not have structured data. However, as right now only about 2% of files has data, so it is very wasteful process - we process 100 pages to find one proper mediainfo entity, essentially. We may want to find a way to do better, though not sure that current classes allow it - we may have to implement some special class instead of SqlEntityIdPager.

I tried dumping 100K mediainfo entities, and that took 166.5 minutes. On one hand, given that we can parallelize, if we split it into 8 shards, we might be done in reasonable time. On the other hand, average of 10 items per second is too slow. If we expect coverage of files with mediainfo to increase significantly (e.g. 10x and more) then it's maybe not that big of a deal (though T222497: dumpRDF for MediaInfo entities loads each page individually) still remains a factor but as it is now, RDF dumping process for mediainfo is very inefficient.

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Duplicate		None	T141602 [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Resolved		ArielGlenn	T221917 Create RDF dump of structured data on Commons
Resolved		Cparle	T230856 RDF dump performance for SDC
Resolved		Cparle	T222497 dumpRDF for MediaInfo entities loads each page individually
Resolved		ArielGlenn	T239905 dumpRdf for mediainfo entities loads data from db more often than it needs to

Event Timeline

Smalyshev created this task.Aug 21 2019, 3:40 AM

@Smalyshev Do you know how many entries have structured data on deployment-prep? Is that a useful testing ground right now or should we be populating the data over there first?

Probably not a lot. Search for English labels returns 188 results, unfortunately search for statements and every label doesn't seem to work (probably needs a reindex?) so I don't know how many but probably also not a lot. I'll check tomorrow if I can get more specific figures.

I'm looking at deployment-db05 now, and there are 63332 rows in the revision table, with 53250 rows in the content table. I guess we need to double the number of revisions and then add the structured data for those entries. we can probably be clever about this via a script.

https://github.com/apergos/misc-wmf-crap/tree/master/glyph-image-generator Starting to get clever about this: ability to generate 50k small images with metadata that can be extracted for using in depicts and/or caption statements.

Jdforrester-WMF subscribed.Aug 23 2019, 4:45 PM

Multichill subscribed.Aug 26 2019, 7:45 PM

Smalyshev moved this task from Incoming to SDAW on the Wikidata-Query-Service board.Aug 27 2019, 9:39 PM

Note that T222497 needs to be resolved before we can actually have a working dump.

Assigning to Cormac for now so he can assess.

I've started generating, uploading and captioning images in beta commons today, using the latest version of the script linked above. I'd like to add some depicts statements too. In any case, by the end of the week expect that we'll have several batches of these little icons with diferent borders and background colors, all captioned for folks' testing needs.

Addshore moved this task from incoming to monitoring on the Wikidata board.Oct 30 2019, 2:17 PM

Adding items to wikidata in deployment-prep for use in depicts statements for the uploaded images in beta commons. Depicts statements early next week most likely.

Bulk adds of depicts statements on deployment-prep will start this evening, now that the code is ready. It will run over a couple of days at least. Once complete we'll have 3k images on beta commons with captions and depicts statements in them, referencing 1k items on beta wikidata. I'd like to get us up to 50k total over the next few weeks.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Nov 4 2019, 1:51 PM

• Ramsey-WMF moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.Nov 25 2019, 5:27 PM

I'm not sure if T222497 covers this stuff and, if not, what is actionable here by the structured data team. @ArielGlenn any thoughts?

In T230856#5692766, @Cparle wrote:

I'm not sure if T222497 covers this stuff and, if not, what is actionable here by the structured data team. @ArielGlenn any thoughts?

Yes it does. In the meantime there are about 15k entries on beta commons with MediaInfo content now (captions and depicts), which can be used for short tests.

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Nov 27 2019, 1:52 PM

Cparle closed subtask T239905: dumpRdf for mediainfo entities loads data from db more often than it needs to as Resolved.Jan 6 2020, 9:17 AM

Cparle closed this task as Resolved.Jan 6 2020, 11:18 AM

Cparle closed subtask T222497: dumpRDF for MediaInfo entities loads each page individually as Resolved.

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.May 20 2020, 8:09 AM

RDF dump performance for SDCClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

RDF dump performance for SDC
Closed, ResolvedPublic
Actions

Related Objects
Search...