Page MenuHomePhabricator

Structured Data on Commons reconciliation service accepts Commons category names as input
Closed, InvalidPublicFeature

Description

Update (Sep 9, 2021): we're continuing discussing this feature request on GitHub as it is likely to fit better in OpenRefine's code base: ๐Ÿ‘‰ https://github.com/OpenRefine/OpenRefine/issues/4143 ๐Ÿ‘ˆ

Feature summary (what you would like to be able to do and where):

Users of OpenRefine can take a simple list of Wikimedia Commons categories (one or multiple categories). The Structured Data on Commons reconciliation service then takes these categories and retrieves all file names and M-IDs from them.

Benefits (why should this be implemented?):

Regular Wikimedia Commons contributors are quite used to taking Commons categories as input / starting point in various tools. Examples include AC/DC, the ISA Tool, and VisualFileChange.
GLAM files are usually well organized in categories, e.g. specific to the GLAM (example), to departments (example), to specific media file uploads (example). Making it possible to start with such categories will make it easier for GLAMs to contribute structured data with the help of the SDC reconciliation service and a tool like e.g. OpenRefine.

If we don't provide this as a functionality in the SDC reconciliation service, end users will first have to turn to another tool (e.g. PetScan) to retrieve lists of file names. This makes their workflow more convoluted (it adds an extra step) and requires them to learn to use yet another tool.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptAug 31 2021, 3:44 PM

I understand the need but intuitively I don't see how that fits in the reconciliation service. Am I understanding correctly that this would be a new importer in OpenRefine? You would create a new OpenRefine project by supplying a category, and then you would get a new project with each row corresponding to a file in that category?

This is likely to be work for Joey on the OpenRefine side of things. If listing the files in a category can be done via SPARQL, then maybe this could be made possible by having a SPARQL importer (which is a long-standing feature request): https://github.com/OpenRefine/OpenRefine/issues/1212

Another possibility would be to let people use the "Create project by fetching URL" functionality, and teach users to use it with URLs such that
https://common-recon-service.toolforge.org/fetch_category?name=Category:My_Category
If we manage to document it at the right place then I guess it can be workable (but it's not a super nice integration).

In that case we can indeed add this feature to the recon service, but would just be another feature supported by the webservice outside of the reconciliation API specs.

Thanks for your suggestions @Pintoch ๐Ÿ˜„

IMHO It's absolutely not necessary for this to be a part of the reconciliation service. It can also very well be an importer indeed. I can update the task description accordingly (and then it's probably more appropriate to move it to GitHub).

Am I understanding correctly that this would be a new importer in OpenRefine? You would create a new OpenRefine project by supplying a category, and then you would get a new project with each row corresponding to a file in that category?

From the hypothetical 'typical end user's' point of view, that sounds like the smoothest scenario. It would indeed be most convenient if they can 'feed' a category (or a number of categories; including the option to drill down the category tree for n levels) to OpenRefine, which would then create a project with a file in each row.

If listing the files in a category can be done via SPARQL, then maybe this could be made possible by having a SPARQL importer (which is a long-standing feature request): https://github.com/OpenRefine/OpenRefine/issues/1212

I would start with the assumption that our typical end user doesn't know SPARQL. So this sounds like something nice to have, but probably only for advanced users. I expect regular users to show up with just a list of category names.

Another possibility would be to let people use the "Create project by fetching URL" functionality, and teach users to use it with URLs such that
https://common-recon-service.toolforge.org/fetch_category?name=Category:My_Category
If we manage to document it at the right place then I guess it can be workable (but it's not a super nice integration).

In that case we can indeed add this feature to the recon service, but would just be another feature supported by the webservice outside of the reconciliation API specs.

I agree that this could be doable, but also would be a bit convoluted, and probably hard to understand. The 'feed OpenRefine some category names' scenario is by far the easiest for our end users IMO ๐Ÿ˜Ž

From the hypothetical 'typical end user's' point of view, that sounds like the smoothest scenario. It would indeed be most convenient if they can 'feed' a category (or a number of categories; including the option to drill down the category tree for n levels) to OpenRefine, which would then create a project with a file in each row.

I'm particularly fond of the pleasant way in which the ISA Tool does this. The user can specify many different categories there, and can indicate for each category how many levels deep the tool should 'dig up' files. Can be tried here https://isa.toolforge.org/campaigns/create (after logging in)

image.png (239ร—1 px, 24 KB)

image.png (328ร—1 px, 29 KB)

The 'feed OpenRefine some category names' scenario is by far the easiest for our end users IMO ๐Ÿ˜Ž

I do get that, but because this feature is really specific to Commons we need to be a bit cautious about not splashing the UI with something that is only understandable by a fraction of our user base. So far the Wikibase integration has had a relatively minimal impact on the UI for those who do not use it (you only see an extension button on the top right of project screens, the rest of the UI is only brought if you activate the extension).

Spinster changed the task status from Open to Stalled.Sep 9 2021, 8:26 AM

Thanks for chiming in! I think we have have many good arguments pro and con. It looks like this feature (if developed) should live in the OpenRefine code base, and not as part of the Commons Reconciliation service, so I went ahead and created an issue on GitHub where we can further discuss this.

๐Ÿ‘‰ https://github.com/OpenRefine/OpenRefine/issues/4143

I'm not closing/declining this Phab task yet, as we may still have some discussion here.

This issue was moved to the Commons Extension repository - this feature will be developed as part of the Commons Extension in OpenRefine: https://github.com/OpenRefine/CommonsExtension/issues/3