When developing the T299781: [EPIC] Image suggestions backend , we collected image Wikidata QIDs from two properties, i.e., P18 and P373. The dataset lives in the image_suggestions_wikidata_data Hive table.
We may gather additional image QIDs via the Structured Data on Commons (SDoC) datasets available in the Data Lake, i.e., structured_data.commons_entity table.
Update
As of 2023-01-30, the structured_data.commons_entity dataset contains statements with 48 distinct Wikidata properties.
We use the same sample as T311832: [SPIKE] Consider other Wikidata properties to gather additional image QIDs and queried the Wikimedia Commons Query Service for what we considered as relevant properties (see observations below). Result:
topics | section score | topics with values | total depicts values | total relevant property values |
200 | > 10 | 48 | 843 | 863 |
Observations
- We should reverse-traverse SDoC statements (MID, property, QID), because we need to look up an image (MID) from a topic (QID)
- depicts statements can certainly increase the amount of image QIDs
- some other properties look relevant:
- Commons quality assessment can act as a filter for low-quality images
- the implementation cost to add these properties is relatively high
Conclusion
- We don't have any images linked from SDoC statements
- this is a meaningful way to leverage structured data
- the depicts property may be worth the effort, not others
Sample dataset
Following round 2 evaluation plan of T316149: [L] Create tool for manual evaluation of section-level image suggestions, for each target wiki we sampled 2k images from SDoC statements with depicts, digital representation of, and main subject properties.
The first impression is that depicts images are noisy for the section-level image suggestions use case: besides obviously wrong statements, the intuition is that a given image indeed depicts a section topic, but is unrelated to the actual Wikipedia content where that section topics originates.