Page MenuHomePhabricator

The Structured Data on Commons reconciliation service recognizes the most widely used Commons file name formats
Open, MediumPublic

Description

In the original grant application for Structured Data on Commons (SDC) support for OpenRefine, we wrote about the SDC reconciliation service:

[the reconciliation service] allows OpenRefine (and tools outside of OpenRefine) to take a list of file names from Wikimedia Commons and to convert these file names to their corresponding entity identifiers (“M numbers” or M-ids - the Wikimedia Commons equivalent of Q-ids). These M-ids are needed to perform further SDC operations.

Ideally, the Commons reconciliation service recognizes and reconciles the most commonly file name notation formats (with or without File: prefix, with underscores vs spaces in file names) that are produced as exports from the most widely used tools (PetScan, the Wikidata and Wikimedia Commons Query Services, ...).

done?what?filename written as (example)example export file
[x]PetScan's CSV, TSV and JSON output (query)Badende_vogel_bij_roze_bloem_Bloemen-_en_vogelschetsen_van_Keinen_(serietitel)_Keinen_kacho_gafu_(serietitel_op_object),_RP-P-2004-508D-9.jpg
[ ]PetScan's Plain text output (query)File:Badende vogel bij roze bloem Bloemen- en vogelschetsen van Keinen (serietitel) Keinen kacho gafu (serietitel op object), RP-P-2004-508D-9.jpg
[ ]The Wikidata Query Service's output for Commons filenames (query)https://commons.wikimedia.org/wiki/Special:FilePath/Mosaics%20%281953%29%20by%20Nel%20Klaassen%2C%20Peek%20%26%20Cloppenburg%20building%2C%20Hoogstraat%20%2850979423667%29.jpg
[x]The Wikimedia Commons Query Service output - entity URIs (query)https://commons.wikimedia.org/entity/M93645431
[x]Simple URLs of Commons file pageshttps://commons.wikimedia.org/wiki/File:Mosaics_(1953)_by_Nel_Klaassen,_Peek_&_Cloppenburg_building,_Hoogstraat_(50979423667).jpg

Additionally, let's discuss and decide whether we indeed want to only allow a list of file names as input, or whether we want to provide more flexible options to end users. Categories, for instance: T290089: Structured Data on Commons reconciliation service accepts Commons category names as input

Event Timeline

Just wondering, @Husky - as inspiration: can we see somewhere what types/formats of input your Minefield tool recognizes and accepts? I vaguely remember that you made this pretty flexible.

Spinster renamed this task from Decide what can serve as "input" for the Structured Data on Commons reconciliation service to The Structured Data on Commons reconciliation service recognizes the most widely used Commons file name formats.Oct 20 2021, 6:05 PM
Spinster updated the task description. (Show Details)

I added a few ways in which file names can be written in exports from PagePile.

PagePile also has a Wikitext export which creates a Wikitable for the file names:

There, file names are included in the table as

[[:File:Badende vogel bij roze bloem Bloemen- en vogelschetsen van Keinen (serietitel) Keinen kacho gafu (serietitel op object), RP-P-2004-508D-9.jpg|]]

However, while OpenRefine can process Wikitables during project creation, the file names appear not to be processes in this specific example (see empty first column):

image.png (1×1 px, 454 KB)

@Spinster: the algorithm for 'normalizing' input in Minefield is roughly as follows:

  1. Separate input by newline
  2. Strip off anything until the wiki/ part. This means that any URL will work in Minefield, including links from Wikipedia, as long as there's /wiki/ somewhere.
  3. Replace 'Special:FilePath/' with File: if it's in the URL (the WD query service gives back media files in this format)
  4. Encode the remaining string (this part does not work for all files, see this bug). I would greatly appreciate a piece of Javascript code that properly encodes filenames to something the API understand.
  5. Send over the string to the Commons API

The relevant piece of code for Minefield is here. Especially the getCommonsFilepage, encodePageTitle and getMidsForFilepages functions are relevant.

Spinster updated the task description. (Show Details)
Spinster updated the task description. (Show Details)

I added a few additional examples of types of input we'd like to support. Some additional comments:

  • https://commons.wikimedia.org/wiki/Special:FilePath/Mosaics%20%281953%29%20by%20Nel%20Klaassen%2C%20Peek%20%26%20Cloppenburg%20building%2C%20Hoogstraat%20%2850979423667%29.jpg is also the way the Wikimedia Commons Query Service exports full file paths (example query)
  • We also want input like M93645431 (bare M-ids) and https://commons.wikimedia.org/entity/M93645431 (full Commons entity URIs) to reconcile directly. Similarly, if I throw something like Q12709593 or http://www.wikidata.org/entity/Q12709593 at the Wikidata reconciliation service in OpenRefine, it will happily reconcile it as well. We want the same for Commons.

Change 735406 had a related patch set uploaded (by Eugene233; author: Eugene233):

[labs/tools/commons-recon-service@main] Add support for other file formats in the commons reconcile tool

https://gerrit.wikimedia.org/r/735406

Change 735406 merged by jenkins-bot:

[labs/tools/commons-recon-service@main] Add support for other file formats in the commons reconcile tool

https://gerrit.wikimedia.org/r/735406

Yay, great progress on this task!! Thank you :-)

Two small updates still to look at. Working with the test dataset assembled via T299135: Assemble various sets of interesting Commons files for testing SDC features in OpenRefine I have noticed that most of the variety of file path formats reconcile very nicely \o/ except for these cases:

  • File paths that have URL encoded or percent-encoded URLs (yeah, I needed to look up that name). Example: %20 for a space.
  • 'Bare' M-ids. We want these to reconcile to Commons files as well.

For ease of testing, the following paths did not reconcile yet in that test set:

https://commons.wikimedia.org/wiki/File:%D0%98%D1%81%D1%81%D0%BB%D0%B5%D0%B4%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D1%8F_%D0%B8_%D0%B7%D0%B0%D0%BC%D0%B5%D1%82%D0%BA%D0%B8_%D0%BA%D0%BD%D1%8F%D0%B7%D1%8F_%D0%9C.%D0%90._%D0%9E%D0%B1%D0%BE%D0%BB%D0%B5%D0%BD%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%BF%D0%BE_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%BC_%D0%B8_%D1%81%D0%BB%D0%B0%D0%B2%D1%8F%D0%BD%D1%81%D0%BA%D0%B8%D0%BC_%D0%B4%D1%80%D0%B5%D0%B2%D0%BD%D0%BE%D1%81%D1%82%D1%8F%D0%BC.tif
https://commons.wikimedia.org/wiki/File:%D8%A3%D9%84%D9%81_%D9%84%D9%8A%D9%84%D8%A9_%D9%88%D9%84%D9%8A%D9%84%D8%A9.djvu
https://commons.wikimedia.org/wiki/File:%D9%85%D9%88%D8%B2%D9%87_%D8%B3%D9%86%DA%AF.jpg
https://commons.wikimedia.org/wiki/File:%E4%B8%89%E6%9C%A8%E5%B1%B1%E6%A3%AE%E6%9E%97%E5%85%AC%E5%9C%92%E8%8C%B6%E5%AE%A4.jpg
https://commons.wikimedia.org/wiki/File:Ankomst_v%C3%A5gfront_specialfall.xcf
https://commons.wikimedia.org/wiki/File:La_Lib%C3%A9ration_de_Paris,_1944.ogv
https://commons.wikimedia.org/wiki/File:Lemonnier_-_L%27Hallali,_sd.pdf
https://commons.wikimedia.org/wiki/File:M%C3%BCnster,_Wolbeck,_Wolbecker_Tiergarten,_Naturwaldzelle_-Teppes_Viertel-_--_2014_--_7094-2.jpg
https://commons.wikimedia.org/wiki/File:PikiWiki_Israel_2504_People_of_Israel_%D7%A8%D7%91%D7%A7%D7%94_%D7%91%D7%9F_%D7%90%D7%91%D7%A8%D7%94%D7%9D.jpg
https://commons.wikimedia.org/wiki/File:W%C3%BCrfelzucker_--_2018_--_3564.jpg
https://commons.wikimedia.org/wiki/Special:FilePath/Mosaics%20%281953%29%20by%20Nel%20Klaassen%2C%20Peek%20%26%20Cloppenburg%20building%2C%20Hoogstraat%20%2850979423667%29.jpg
M57634624
M92164502

Last fixes (URL encoded file paths and bare M-ids, see comment above) to be finished during current sprint :-)

Change 758508 had a related patch set uploaded (by Eugene233; author: Eugene233):

[labs/tools/commons-recon-service@main] The Structured Data on Commons reconciliation service recognizes the most widely used Commons file name formats

https://gerrit.wikimedia.org/r/758508

@Spinster Please what is the state of this task?

@Spinster Please what is the state of this task?

Thanks for the cleanup actions @Eugene233!

I just tested all five file name formats mentioned in this issue. 3 of the 5 do reconcile. Two do not yet:

Let's keep this task open so that anyone who wants to help here, can step up and help support these file name formats too.

Spinster lowered the priority of this task from High to Medium.
Spinster updated the task description. (Show Details)
Spinster moved this task from In progress to Backlog on the Reconciliation board.
Spinster added a subscriber: Eugene233.