So @Gilles, @Tgr and @Krinkle's replies to the parent task have been of great help. The big usability problem with the current state of the Mediarequests per file API is that we're using upload.wikimedia.org URI paths as keys, while the most user friendly approach would be using File: paths, as they match the wiki and the name originally intended by the uploader, without any obstacles such as the md5 digits. My idea for fixing this issues is: we allow users to use File: page links, in addition to upload.wikimedia.org paths.
Why allow both forms? About 0.12% of Mediarequests are directed to files that weren't uploaded from a specific wiki, therefore not having a File: page associated. These are:
- upload.wikimedia.org/favicon.ico
- Images generated by the MediaWiki Math Extension (https://www.mediawiki.org/wiki/Extension:Math)
- Images generated by the MediaWiki Music Score Extension (https://www.mediawiki.org/wiki/Extension:Score)
Proposed solution
Here's two examples of File: media URLs:
commons.wikimedia.org/wiki/File:Libyan_Civil_War.png
en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg
Let's reconstruct the upload.wikimedia.org link from these links. So far we know that the file name is whatever comes after "File:". So:
Libyan_Civil_War.png
Speed_Limit_50_Minimum_5_sign.svg
We also know, as explained in the task, that we can get the previous two bits (eg /3/3b/) by converting the full name to an MD5 hash and taking its first digit for the first value, and the first two digits for the second:
Libyan_Civil_War.png => 90fa67e125817479499f05f2e1be227e => /9/90/Libyan_Civil_War.png
Speed_Limit_50_Minimum_5_sign.svg => 358558037555ad2fb32ef469ccdd1fe4 => /3/35/Speed_Limit_50_Minimum_5_sign.svg
Lastly, there's the trickiest part, which is getting the wiki namespace. As @Krinkle said, a lot of relatively old files have /wikipedia/ as their namespace even though they were uploaded from other project families. This means that, at the AQS level, we can do a first try of requesting from Cassandra the wiki specified in the File: link provided by the user...
Libyan_Civil_War.png => /wikipedia/commons/9/90/Libyan_Civil_War.png => ✅ Speed_Limit_50_Minimum_5_sign.svg => /wikisource/en/3/35/Speed_Limit_50_Minimum_5_sign.svg => ❌
... and if that returns no results, query the wikipedia namespace:
Speed_Limit_50_Minimum_5_sign.svg => /wikipedia/commons/3/35/Speed_Limit_50_Minimum_5_sign.svg => ✅
I've made a JS function to make the conversion and put it in an Observable notebook with a few hundred namespace 6 (File:) URLs and the conversions look like they work.
I think we could use this to either:
a) Not reload the whole per file and top mediarequests. This solution is a bit of a hack, but this way we don't have to deal with turning the clunky /wikipedia/commons naming to a dimension or exposing it in any way to the user.
b) Do the reloading work but ease into it by providing this conversion in AQS in the meantime.
I'm advocating for option a (not reloading) for the following reasons:
- Reloading both metrics will take about 4 months of monitored loading time
- Even though File: page names are way more user friendly than upload.wikimedia.org URIs, they don't cover all media file cases (the math, score, and favicon cases described above).
- More than one File: page can be pointing to an upload.wikimedia.org link. As an example, all of the following pages point to the same file. Using a File: URL as key seems problematic for this reason.
Please let me know of anything I'm missing!