Page MenuHomePhabricator

Allow users to query mediarequests using a file page link
Open, HighPublic

Description

So @Gilles, @Tgr and @Krinkle's replies to the parent task have been of great help. The big usability problem with the current state of the Mediarequests per file API is that we're using upload.wikimedia.org URI paths as keys, while the most user friendly approach would be using File: paths, as they match the wiki and the name originally intended by the uploader, without any obstacles such as the md5 digits. My idea for fixing this issues is: we allow users to use File: page links, in addition to upload.wikimedia.org paths.

Why allow both forms? About 0.12% of Mediarequests are directed to files that weren't uploaded from a specific wiki, therefore not having a File: page associated. These are:

Proposed solution

Here's two examples of File: media URLs:

commons.wikimedia.org/wiki/File:Libyan_Civil_War.png
en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg

Let's reconstruct the upload.wikimedia.org link from these links. So far we know that the file name is whatever comes after "File:". So:

Libyan_Civil_War.png
Speed_Limit_50_Minimum_5_sign.svg

We also know, as explained in the task, that we can get the previous two bits (eg /3/3b/) by converting the full name to an MD5 hash and taking its first digit for the first value, and the first two digits for the second:

Libyan_Civil_War.png => 90fa67e125817479499f05f2e1be227e => /9/90/Libyan_Civil_War.png
Speed_Limit_50_Minimum_5_sign.svg => 358558037555ad2fb32ef469ccdd1fe4 => /3/35/Speed_Limit_50_Minimum_5_sign.svg

Lastly, there's the trickiest part, which is getting the wiki namespace. As @Krinkle said, a lot of relatively old files have /wikipedia/ as their namespace even though they were uploaded from other project families. This means that, at the AQS level, we can do a first try of requesting from Cassandra the wiki specified in the File: link provided by the user...

Libyan_Civil_War.png              => /wikipedia/commons/9/90/Libyan_Civil_War.png           => ✅
Speed_Limit_50_Minimum_5_sign.svg => /wikisource/en/3/35/Speed_Limit_50_Minimum_5_sign.svg  => ❌

... and if that returns no results, query the wikipedia namespace:

Speed_Limit_50_Minimum_5_sign.svg => /wikipedia/commons/3/35/Speed_Limit_50_Minimum_5_sign.svg  => ✅

I've made a JS function to make the conversion and put it in an Observable notebook with a few hundred namespace 6 (File:) URLs and the conversions look like they work.

I think we could use this to either:

a) Not reload the whole per file and top mediarequests. This solution is a bit of a hack, but this way we don't have to deal with turning the clunky /wikipedia/commons naming to a dimension or exposing it in any way to the user.
b) Do the reloading work but ease into it by providing this conversion in AQS in the meantime.

I'm advocating for option a (not reloading) for the following reasons:

Please let me know of anything I'm missing!

Event Timeline

fdans moved this task from Incoming to Analytics Query Service on the Analytics board.

Old versions of images are also something not easily expressed as a file page URL. For now, you only see them when you visit the file page (when you upload a new version, it takes over the "normal" file URL, and the old version moves to an archive URL, which is only used when showing file history), so they won't make much difference. At some indeterminate point in the future we'll probably move to content-based URLs (T149847: RFC: Use content hash based image / thumb URLs) so at that point each new version of the file would result in a new URL. Since users would typically want the data for all file revisions merged (typically a new version is just some minor retouching or conversion), it's worth considering when designing the storage part of the API how that can supported in the future.

I think it would be preferable to ask the MediaWiki API about the thumbnail URL prefix instead of trying to guess it, to avoid various edge cases:

  • as Timo said in the parent task, the hash part is considered opaque, and it might change in the future, and this would be one more place to update
  • without relying on the API you can't tell whether en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg refers to a file on Commons or a file on Wikisource that had 0 views in the requested period
  • users will probably expect file redirects to work

As @Krinkle said, a lot of relatively old files have /wikipedia/ as their namespace even though they were uploaded from other project families.

He was talking about relatively old wikis; files on the same wiki always map to the same prefix.

I've made a JS function to make the conversion and put it in an Observable notebook with a few hundred namespace 6 (File:) URLs and the conversions look like they work.

The logic there is not quite right: it does not handle percent-encoded namespace separators, and would not handle : within the file name.
More importantly, the prefix logic is wrong; e.g. Wikisource would be like /wikisource/ru, not /wikisource/commons. I don't think you can get around hardcoding the domain -> prefix mapping somewhere, or using the MW API.

without relying on the API you can't tell whether en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg refers to a file on Commons or a file on Wikisource that had 0 views in the requested period

This was my concern. The File:Texas_FM_XXX.svg example isn't the best, since they all route to Commons. There are (correct me if I'm wrong) cases where for instance en.wikisource has a file under the same name as one on en.wikipedia, but they are completely different images. So yes, I assume you'd need to use the action API or something to get the correct storage path. I think the mediarequests API should accept a {project} parameter, similar to the pageviews APIs.

I have no ideas about math and music scores, but I do think this data would be useful. favicon.ico on the other hand I assume is of narrow interest.

Thanks for following up.

Old versions of images are also something not easily expressed as a file page URL.

It is worth having in mind that our core use here is to provide (for GLAM folks and others) and easy way to see what images are used the most, this is on a "readers" context so we are not very concern with the use case of looking at older versions of one file though commons UI.

Since users would typically want the data for all file revisions merged (typically a new version is just some minor retouching or conversion), it's worth considering when designing the storage part of the API how that can supported in the future.

Well, a convention on how you can identify prior versions would need to be setup and the tally probably needs to be computed upon loading (or it could also be done on the API glue layer). Now, given that we have not seen any movement at all in media urls for years I am not sure it is worth thinking about these use cases quite yet.

without relying on the API you can't tell whether en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg refers to a file on Commons or a file on Wikisource that had 0 views in the requested period

I do not understand this comment, could you explain a bit more? I would think a file of commons would be requested as /wikipedia/commons/blah/File:Speed_Limit_50_Minimum_5_sign.svg/ so the combination of project+filename is unique

I think the mediarequests API should accept a {project} parameter, similar to the pageviews APIs.

I do not see how this would work. The bulk of the media displayed on all projects comes from commons, so commons is the common repository for all projects, right? (asking honestly). The API already supports a referrer parameter which is more inline - I think- with the notion of "project-in-which-the-image-is-shown"

The logic there is not quite right: it does not handle percent-encoded namespace separators, and would not handle : within the file name.
More importantly, the prefix logic is wrong; e.g. Wikisource would be like /wikisource/ru, not /wikisource/commons. I don't think you can get around hardcoding the domain -> prefix mapping somewhere, or using the MW API.

@Tgr my bad, I didn't realize you had to hit "publish" in observable for every change to get published. Some of your concerns are already addressed, but this was just a quick prototype of how the approach would work.

There are (correct me if I'm wrong) cases where for instance en.wikisource has a file under the same name as one on en.wikipedia, but they are completely different images.

@MusikAnimal Yeah I see what you're saying: what happens when an image is uploaded from enwikisource, but then it's used from eswiki, therefore creating a File: page there. If we look for the upload path in /es/wikipedia it won't be there, and when we fall back to commons/wikipedia, it won't be there either.

I see what you mean but this won't be solved by a project field like you suggest. Basically, if you don't know the original wiki that a file was uploaded from, unless the fall back to /commons/wikipedia brings results (which to be fair, 90.87% of mediarequests are to /commons/wikipedia files), then you need to use the upload.wikimedia.org path of the file. I just can't see a situation where you know which wiki the file was uploaded from but you don't know the File: path of your file, so adding a project field to solve this problem is ineffective.

I see what you mean but this won't be solved by a project field like you suggest. Basically, if you don't know the original wiki that a file was uploaded from, unless the fall back to /commons/wikipedia brings results (which to be fair, 90.87% of mediarequests are to /commons/wikipedia files), then you need to use the upload.wikimedia.org path of the file. I just can't see a situation where you know which wiki the file was uploaded from but you don't know the File: path of your file, so adding a project field to solve this problem is ineffective.

Sorry, maybe I'm misunderstanding what we're talking about! I thought we were discussing changing the API such that you (the consumer) wouldn't need to know the file upload path. If so, I think you'd need to tell the API what wiki you're referring to. Take for instance FooFighters-FooFighters.jpg, uploaded locally to enwiki (because it is non-free imagery). I uploaded a different image to testwiki under the same name: https://upload.wikimedia.org/wikipedia/test/0/0d/FooFighters-FooFighters.jpg (this might get deleted later, FYI). As you see the file paths are different, so the Mediarequests API presumably wouldn't know which one you wanted if it was given only the file name.

Commons is of course where the vast majority of media lives, and this is mainly what GLAM and most consumers care about. However I think it's really neat that we have stats on local uploads too. Say there was an editathon to upload missing music album artwork, it wouldn't happen on Commons.

I thought we were discussing changing the API such that you (the consumer) wouldn't need to know the file upload path

Not the full file upload (as @bd808 pointed out the md5 chuncks should not get exposed) but maybe it is not far fetched to ask users to enter: "/wikipedia/commons/SomeFile.png" path? (given that a "project" path and a file uniquely identifies a file)

Not the full file upload (as bd808 pointed out the md5 chuncks should not get exposed) but maybe it is not far fetched to ask users to enter: "/wikipedia/commons/SomeFile.png" path? (given that a "project" path and a file uniquely identifies a file)

That seems fine, though I as a consumer still need to know the domain to prefix mapping. Having a {project} parameter feels more natural to me, as I don't need to look up anything or have any prior knowledge, but I realize that's a big and perhaps breaking change to the already-deployed APIs.

without relying on the API you can't tell whether en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg refers to a file on Commons or a file on Wikisource that had 0 views in the requested period

I do not understand this comment, could you explain a bit more? I would think a file of commons would be requested as /wikipedia/commons/blah/File:Speed_Limit_50_Minimum_5_sign.svg/ so the combination of project+filename is unique

The combination of project + filename is unique, yes.
If you want to identify files by their talk page links (as proposed in the task description) then you have to deal with the fact that file pages are not unique: if a file is uploaded to Commons, it will also have a file page on every other project which has no local file by that name.
The proposed solution was for something like en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg to fall back to the Commons file URL if the local file URL had no views, but it's entirely possible for the local file to exist and have no views.

Personally I think you should just use {project}/{filename} where project is where the file was uploaded (e.g. en.wikisource/Speed_Limit_50_Minimum_5_sign.svg). That does not create misleading expectations as much as using the file page URL would, matches how projects are designated in the other AQS endpoints, and does not put the burden on the client to somehow figure out the correct thumb URL prefix (of course that means the burden is on the API). It will exclude Math etc. but frankly no one will care. The raw URL could be exposed as a separate API endpoint, but I doubt there is demand for it - IMO being able to provide information about usage types (such as full image vs. size buckets, or PDF views by page, or how many people started viewing a video VS how many finished it) would be a much more valuable place to invest energies.

As you see the file paths are different, so the Mediarequests API presumably wouldn't know which one you wanted if it was given only the file name.

@MusikAnimal Yes, I'm sorry, I don't think I illustrated well the point I was trying to make. But @Tgr 's last comment helps. See:

Personally I think you should just use {project}/{filename} where project is where the file was uploaded (e.g. en.wikisource/Speed_Limit_50_Minimum_5_sign.svg). That does not create misleading expectations as much as using the file page URL would

This comment itself exposes the problem of this approach: the file mentioned has a File: page in Wikisource, so you would reasonably think that in the Mediarequests API, you would request the file as:

[...]/en.wikisource/Speed_Limit_50_Minimum_5_sign.svg/[...]

The problem is, that file wasn't uploaded from Wikisource. It's a commons file. If you use @MusikAnimal 's Mediaviews and try to get stats for it under en.wikisource, you won't find it (just tried, let me know otherwise). The only ways to know for sure which wiki the file was uploaded from are:

  1. You uploaded the file yourself, so you just know
  2. Getting the upload.wikimedia.org link from the file's metadata/headers (like @MusikAnimal is doing in Mediaviews)
  3. Getting the upload.wikimedia.org link by inspecting the file's thumbnail, "open image in a new tab...", etc.

This is a problem that won't be solved by adding a project parameter to the endpoint. I'm trying to mitigate it by adding the fallback to searching in commons (i.e: "your image doesn't seem to be uploaded under the en.wikisource domain, but let me quickly see if it's in commons!").

Let me know if you don't think this is an important problem and that users should know for sure which project the file was uploaded from. But in that case we might as well use upload.wikimedia.org paths? Maybe having AQS deal with the md5?

Personally I think you should just use {project}/{filename}

That is what I am proposing but I think "project" is not the right term. In the wikimedia ecosystem project comes to mean: "the site in which things are occurring", so edits for project fr.wikipedia means edits that happen in frwiki . For images it does not work that way. 90% of image files are in commons so they would have a "commons" file path but their usage might happen exclusively in "fr.wikipedia" and , I think, you would expect in this case for the project to be "fr.wikipedia".

Our goal with this data is not to precisely represent views in every single case but rather provide value for consumers such us GLAM, I think an acceptable compromise might be to just report views of media in commons. What do people think?

For some reason the bot is not adding the Patch-for-review tag:

https://gerrit.wikimedia.org/r/#/c/analytics/aqs/+/571968/

This change should remove most of the concerns with the current state of the endpoint. It adds more flexibility to the path parameter, allowing both page file urls and upload.wikimedia.org paths.

Also with this, t's even easier to query using upload paths, as direct links to thumbnails, previews, transcodings, timelines... will also work. We've basically ported the logic that refinery uses to extract the base path of any URI to a media file to AQS.

I think an acceptable compromise might be to just report views of media in commons. What do people think?

We have the data on local uploads, so I think it'd be great to expose it in some way. It sounds like fdans' patch doesn't remove this functionality, though?

I think you are right that there could be confusion on project vs. referer in general but for Mediaviews I hope to alleviate this with documentation.

Thanks for all the hard work and soliciting input!

Swift bucket identifiers like wikipedia/commons/thumb are internal and should imho not be turned into public APIs as strings that users should somehow know, provide or crop out of (otherwise) standard values. I think it would be most natural for end-users to provide two inputs: wiki project (as canonical domain name, e.g. commons.wikimedia.org), and file name (as Example.png). Similar to page view API, user info APIs, and other APIs. These are inputs you can explain to a user, and a user can find and recognise their correct value.

The mapping of wiki projects to Swift identifiers can be generated by us internally and perhaps even applied at ingestion time (no need to store the Swift hashes at all), or if we can't change the storage anymore, it could be applied to the backend query by the API service at run-time.

Specific UIs might want to offer some kind of "paste whatever" box that takes things like full file description page urls, and thumbnail urls (which may use curid instead of title if they came from their Watchlist, and have extra url-encoding, or perhaps contain random advertisement or social-media-related query parameters, etc.). But that woudl be relatively easy for the browser JS to decode/normalise. Might not need support in the API service itself. (Also less ideal for caching).

Why allow both forms? About 0.12% of Mediarequests are directed to files that weren't uploaded from a specific wiki, therefore not having a File: page associated. These are:

These are not "media requests" by my definition. I don't think anyone would expect this to work through this API. Supporting Math/Score views would be a cool project, but also a big and complex project that is in my opinion unrelated to multimedia files from MediaWiki. The Math/Score files are generated based on current input hash in wikitext (editing them changes the hash. They have no stable identifier to get long-term counts for). They are generated by the software which is currently configured to store them using the Swift protocol in a dedicated bucket that we happen to expose over upload.wikimedia.org. That is as far as they are related. Conceptually, I think these are closer to requests for load.php or /w/resources/…/oojs-ui/icons/foo.svg.

As @Krinkle said, a lot of relatively old files have /wikipedia/ as their namespace even though they were uploaded from other project families.

He was talking about relatively old wikis; files on the same wiki always map to the same prefix.

Yeah, the first time we created "special" wikis, they were wikipedia.org subdomains like commons.wikipedia.org and meta.wikipedia.org (redirect to wikimedia.org now). But when they changed, we did not change the MySQL and Swift names, they never changed, including for new files on those wikis. There is a 1-1 mapping between between Swift buckets and wiki projects. There is no overlap or multiple names for the same thing. Anyway, these buckets should be considered internal to WMF. Not something for API users to worry about.

https://en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg

Speed_Limit_50_Minimum_5_sign.svg => 358558037555ad2fb32ef469ccdd1fe4 => /3/35/Speed_Limit_50_Minimum_5_sign.svg

[…]

Speed_Limit_50_Minimum_5_sign.svg => /wikisource/en/3/35/Speed_Limit_50_Minimum_5_sign.svg  => ❌

... and if that returns no results, query the wikipedia namespace:

Speed_Limit_50_Minimum_5_sign.svg => /wikipedia/commons/3/35/Speed_Limit_50_Minimum_5_sign.svg  => ✅

I think I understand what happened here. When you view https://en.wikisource.org/wiki/File:Speed_Limit_50_Minimum_5_sign.svg in a browser, it shows a Commons file. Same if you embed [[File:Speed_Limit_50_Minimum_5_sign.svg]] syntax on that wiki. But, I am not sure there is a use case for the API to know about this. It would also be very hard to make that work in a way that we will not regret, I think.

For example:

  • If the file does not exist today, I query for project=en.wikisource.org. I expect to hear there are no views. Instead, I hear about unrelated views to an unrelated file on Commons.
  • If I upload the file tomorrow, and then query for project=en.wikisource.org, I probably expect yesterdays views to not have changed. But I think based on this proposal we would need to change what we show as soon we there is 1 view for a local file. This is also confusing.
  • If the next week, I delete this file I expect to still be able to see the views from last week. But it looks like, if there happens to be a file on Commons by the same name, do we start hiding the local views again? Or do we keep showing the old once it has existed at least once?

Ultimately, I do not think these questions have good answers. The good news is, I do not think users need this magic fallback. I think most API consumers for this will be providing values from sources that describe files, not wikitext input. For example, I might want views for files in a Commons category (original wiki + file names), or files I uploaded on Meta-Wiki (original wiki + file names), or a file I found via search (original wiki + file name).

The use case of providing wikitext expansion in context of a wiki (e.g. views for the file I get when I type [[File:Example.jpg]] on en.wikipedia.org), might be better to handle in a dynamic interface. Different tools will have differnet ideas for how to make that work. That is more high-level than what I think the base API should provide.

Other examples that a tool might want to support:

But we don't need to support these for API input. We do not support these as input for the Page View API either, instead the consumer provides the wiki project and page names.

A good API I think provides a primitive with high-confidence in its results, accessed with parameters that have only 1 correct value and can be easily explained. The other parameter features for file description URLs or thumbnail URLs I think would be better to explore in a later version after we learn about any frequent requests/struggles there might be.

@Dominicbm, hi!
We Data Products team are reviewing this task now to see what we can do.
We realized that there might be some overlap between this task's requested functionalities and the Commons Impact Metrics data prototype that we presented in the GLAM conference recently.
Just for the record, I'll leave the links to those here:

We'd like to know if the prototype covers the use cases that motivated this task (and how much).
Your comments would be much appreciated!

Hi @mforns, sorry for late reply. I think I am not sure how the Commons Impact Metrics project is going to affect the existing AQS APIs. For my own part, if I have the ability to query data at the category level instead of per-image, I will do that for my own work. That seems to be the main difference conceptually between the two. So Commons Impact Metrics might lower the prioritization for this task for me personally. However, as long as the mediarequests API is still maintained, this feature is still needed for it, because it is a major performance and usability issue to not be able to query a Commons file without first getting its path in a separate query.