Mediarequests returning "file not found" for filenames with specific characters
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT
Actions

Assigned To

Authored By

	KAP_Jasa
	Oct 2 2023, 6:26 PM

Description

Local project: commons.wikimedia.org.

Test the chars for pageviews.wmcloud.org:

Debela griža, kot jo vidi zmaj.jpg is not correct.
Mbeya city-2.jpg is correct.
Mandelbox 4D OpenCL 569829516 32K.jpg is correct.
Three images of a paragraph from The Red Badge of Courage (Stephen Crane, 1895) to illustrate kerning, ligatures, hyphenation and microtypography.png is not correct.
Abox - Mod Kali-V3 x Sierpinski 3D OpenCL 46149707045 8K.jpg is correct.
File:Sign-1020906,_M7,_Co._Limerick,_Ireland.jpg is not correct.
Sign-1020906, M7, Co. Limerick, Ireland.jpg is not correct.
File:Bucuresti,_Romania_(Mister_Tolanesco_intr-o_relaxare_absoluta)(2).JPG is not correct.
File:)(_-_Flickr_-_Time.Captured..jpg is not correct.

The data over https://wikimedia.org/api/rest_v1/#/Pageviews%20data are published.

Debela griža, kot jo vidi zmaj.jpg is correct.
File:Sign-1020906,_M7,_Co._Limerick,_Ireland.jpg is correct.

Result: Statistics are not available if at least 2 of the following characters are immediately behind present in the file: ( ) - or space.

Details

	Subject	Repo	Branch	Lines +/-
	media-analytics: bump version	operations/deployment-charts	master	+1 -1
	Fix to deal with filepaths with some specific punctuation marks	generated-data-platform/aqs/media-analytics	main	+61 -5

Customize query in gerrit

Related Objects

Mentioned In: T350827: [Media Analytics] Request to Media analytics per file endpoints to files with special char fails
Mentioned Here: T247333: Image files with quotes do not resolve on the mediarequest API

Event Timeline

KAP_Jasa created this task.Oct 2 2023, 6:26 PM

Restricted Application added subscribers: MusikAnimal, Aklapper. · View Herald TranscriptOct 2 2023, 6:26 PM

Cannot reproduce.

I only see an Uncaught DOMException: The operation is insecure. when going to that link.

So there must be another reason for the error? (I'm sorry, I am way over my head here - I just noticed an error that keeps popping up in some files; I thought it might be the letters... (and it keeps appearing; I can't see mediawiews analysis for the file provided))

Screen Shot 2023-10-02 at 21.20.26.png (695×1 px, 448 KB)

Well first, the file was only uploaded 22 hours ago, so the data might simply not yet be available. However I did some digging and it appears you might be right; The API does not appear to like characters with diacritics. I get a 404 when browsing to https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fc%2Fc4%2FDebela_gri%C5%BEa%2C_kot_jo_vidi_zmaj.jpg/daily/2023091100/2023100100 and seemingly the same for any file with diacritics in the name.

Other examples:

The last example, File:Catedral de la Encarnación, Málaga, España, 2023-05-19, DD 37-39 HDR.jpg, is a featured image and the file page has had a decent number of pageviews, so there should definitely be media requests showing up in Mediaviews Analysis.

This sounds eerily similar to T247333 but I think it's a different issue. Anyway, I'm tagging Pageviews-API and Data-Engineering so that this gets reported to the right people.

In T347899#9217165, @Aklapper wrote:

Cannot reproduce.

I only see an Uncaught DOMException: The operation is insecure. when going to that link.

Do you have ad blockers or privacy extensions enabled, by chance? If so, that would explain the error. If you're not using any such browser extension, we have a bug of some sort. Could you share your browser version and OS? Thanks!

Do you have ad blockers or privacy extensions enabled, by chance?

Ah, yes. Sorry for the noise!

Dusan_Krehel renamed this task from Mediaviews analysis doesn't work for files with non-standard letters in the filename? to None result with some chars in the file name.Oct 2 2023, 10:30 PM

Dusan_Krehel triaged this task as Medium priority.

Dusan_Krehel removed a project: Pageviews-API.

Dusan_Krehel updated the task description. (Show Details)

Dusan_Krehel updated the task description. (Show Details)Oct 2 2023, 10:40 PM

Lokal_Profil subscribed.Oct 9 2023, 8:45 AM

In T347899#9217630, @MusikAnimal wrote:

Well first, the file was only uploaded 22 hours ago, so the data might simply not yet be available. However I did some digging and it appears you might be right; The API does not appear to like characters with diacritics. I get a 404 when browsing to https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fc%2Fc4%2FDebela_gri%C5%BEa%2C_kot_jo_vidi_zmaj.jpg/daily/2023091100/2023100100 and seemingly the same for any file with diacritics in the name.

Other examples:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F8%2F80%2FKettenbr%C3%BCcke-Saaz4.jpg/daily/2015070100/2023100100

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F5%2F59%2FNi%C3%B1o(censura).jpg/daily/2015070100/2023100100

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F4%2F47%2FCatedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg/daily/2015070100/2023100100

The last example, File:Catedral de la Encarnación, Málaga, España, 2023-05-19, DD 37-39 HDR.jpg, is a featured image and the file page has had a decent number of pageviews, so there should definitely be media requests showing up in Mediaviews Analysis.

This sounds eerily similar to T247333 but I think it's a different issue. Anyway, I'm tagging Pageviews-API and Data-Engineering so that this gets reported to the right people.

Doesn't seem to be all diacritics though
https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fef%2FAB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%25C3%25A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif/monthly/20230101/20231001
gives a result despite containing an ä
At the same time any file with at parenthesis in the name will fail.

Dusan_Krehel updated the task description. (Show Details)Oct 15 2023, 2:38 PM

Raising to UBN as per duplicate task

MusikAnimal renamed this task from None result with some chars in the file name to Mediarequests returning "file not found" for filenames with specific characters.Oct 16 2023, 10:28 PM

MusikAnimal added a project: Data Products (Sprint 02).

MusikAnimal moved this task from Sprint Backlog to In Process on the Data Products (Sprint 02) board.

Sorry for all the noise! I didn't realize until after merging the old task was assigned etc.

Upon investigation , we concluded that this is a bug in AQS 2.0 media analytics service . It's missing some logic from AQS 1.0 which is related to handling of url encoded file paths and querying the database . We are already looking to approaches to fix it and will update here once fixed and released . Thank you!

Hi,

In the description of this ticket there is a list with some items and the text "is correct" or "is not correct". Does that mean that these items are working or not for you when you try to get the media request information?

does "is correct" mean that you get data for this filename?
does "is not correct" mean that you get a 404 for this filename?

Thank you!

In T347899#9256819, @Sfaci wrote:

Hi,

In the description of this ticket there is a list with some items and the text "is correct" or "is not correct". Does that mean that these items are working or not for you when you try to get the media request information?

From what I'm seeing in the list, those files don't have a special character in them. You can take a look at https://mvc.toolforge.org/index.php?category=Videos+by+Tagesschau+(ARD)&timespan=&rangestart=2023-08-01&rangeend=2023-09-01&limit=100

does "is correct" mean that you get data for this filename?

Yes

does "is not correct" mean that you get a 404 for this filename?

Yes

Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001

I get a good response:

    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023010100",
            "agent": "all-agents",
            "requests": 16
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023020100",
. . .
. . .

In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?

Please, correct me if I wrong
Thanks!!

I don't know that file (I didn't report it) so I can't say it was among the incorrect ones or not.

The ones that I reported are stuff like: https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100

(File:
https://commons.wikimedia.org/wiki/File:Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm )

And https://commons.wikimedia.org/wiki/File:Bundestagswahl_erkl%C3%A4rt_Erst-_und_Zweitstimme_von_Tagesschau.webm which doesn't have punctuation (the API req made is https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fe%2Fe0%2FBundestagswahl_erkl%C3%A4rt_Erst-_und_Zweitstimme_von_Tagesschau.webm/daily/2023080100/2023090100)

I don't know what are you talking about adding File: to the path. For my tool nor Leon's we don't add that.

Ok! No worries!
Just waiting for some sample data to test some edge cases before pushing a fix for all this. We'll keep you posted.
Thanks!

Sfaci moved this task from In Process to BLOCKED on the Data Products (Sprint 02) board.Oct 18 2023, 3:13 PM

Sfaci moved this task from BLOCKED to In Process on the Data Products (Sprint 02) board.Oct 19 2023, 8:39 AM

Change 967168 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/media-analytics@main] Fix to deal with filepaths with some specific punctuation marks

https://gerrit.wikimedia.org/r/967168

gerritbot added a project: Patch-For-Review.Oct 19 2023, 11:05 AM

Sfaci reassigned this task from Sfaci to SGupta-WMF.Oct 19 2023, 11:06 AM

Sfaci moved this task from In Process to Ready for Code Review on the Data Products (Sprint 02) board.

Change 967168 merged by jenkins-bot:

[generated-data-platform/aqs/media-analytics@main] Fix to deal with filepaths with some specific punctuation marks

https://gerrit.wikimedia.org/r/967168

Sfaci reassigned this task from SGupta-WMF to EChukwukere-WMF.Oct 20 2023, 12:02 PM

Sfaci moved this task from Ready for Code Review to Ready for Testing on the Data Products (Sprint 02) board.

Maintenance_bot removed a project: Patch-For-Review.Oct 20 2023, 12:10 PM

Change 967438 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] media-analytics: bump version

https://gerrit.wikimedia.org/r/967438

gerritbot added a project: Patch-For-Review.Oct 20 2023, 12:20 PM

In T347899#9258560, @Sfaci wrote:
Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001

I get a good response:
    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023010100",
            "agent": "all-agents",
            "requests": 16
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023020100",
. . .
. . .
In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?

Please, correct me if I wrong
Thanks!!

For )(_-_Flickr_-_Time.Captured..jpg I think the underlying issue might be that it is unclear which characters are expected to be encoded or not. In my case (where I got 404's for files with parenthesis) I now see that I had encoded the parenthesis.

In T347899#9268234, @Lokal_Profil wrote:
In T347899#9258560, @Sfaci wrote:
Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001

I get a good response:
    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023010100",
            "agent": "all-agents",
            "requests": 16
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023020100",
. . .
. . .
In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?

Please, correct me if I wrong
Thanks!!
For )(_-_Flickr_-_Time.Captured..jpg I think the underlying issue might be that it is unclear which characters are expected to be encoded or not. In my case (where I got 404's for files with parenthesis) I now see that I had encoded the parenthesis.

Sure! We misunderstood that at the beginning. That's why we are fixing this right now. The current issue is due to some edge cases like this. Anyway, according to the fix we have done, from now on you should be able to use encoded or non-encoded way to request for files. In the test I have been doing I can use both ways. For example, the following requests give the same response:

metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_%28moon%29.ogg/daily/2023080100/2023090100
metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_(moon).ogg/daily/2023080100/2023090100

{
    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/b/bd/Titan_(moon).ogg",
            "granularity": "daily",
            "timestamp": "2023080100",
            "agent": "user",
            "requests": 1
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/b/bd/Titan_(moon).ogg",
            "granularity": "daily",
            "timestamp": "2023080300",
            "agent": "user",
            "requests": 3
. . .
. . .

And in the response you can see how the file_path is really store in the dataset. It's the way it should have worked

Test status: QA PASS

tested response and data ( compared with AQS 1.0 : The two files were semantically identical. )

All status return 200

mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fef%2FAB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%25C3%25A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif/monthly/20230101/20231001

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_(moon).ogg/daily/2023080100/2023090100

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F4%2F47%2FCatedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg/daily/2023080100/2023100100

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_%28moon%29.ogg/daily/2023080100/2023090100

EChukwukere-WMF moved this task from Ready for Testing to Done on the Data Products (Sprint 02) board.Oct 20 2023, 5:02 PM

In T347899#9269170, @EChukwukere-WMF wrote:

Test status: QA PASS

tested response and data ( compared with AQS 1.0 : The two files were semantically identical. )

All status return 200

mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fef%2FAB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%25C3%25A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif/monthly/20230101/20231001

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_(moon).ogg/daily/2023080100/2023090100

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F4%2F47%2FCatedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg/daily/2023080100/2023100100

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100

This one is returning 404 to me:
https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100

mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_%28moon%29.ogg/daily/2023080100/2023090100

@Ladsgroup I just checked again now, in the QA test env its returning 200 and the right json response. Not sure the fix has been deployed

@SGupta-WMF can advice further. Its looking good on my end

Sfaci moved this task from Done to Sign Off on the Data Products (Sprint 02) board.Oct 22 2023, 12:40 PM

@Ladsgroup Keep in mind that these tests have been run locally to test the fix before deploying to production.
The fix is done and merged and these tests are showing that it's working fine, but the service hasn't been deployed yet. Hopefully we'll do that next Monday. We'll ping you through this ticket as soon as it's done.

Change 967438 merged by jenkins-bot:

[operations/deployment-charts@master] media-analytics: bump version

https://gerrit.wikimedia.org/r/967438

Maintenance_bot removed a project: Patch-For-Review.Oct 23 2023, 10:10 AM

• WDoranWMF moved this task from Sign Off to To Deploy on the Data Products (Sprint 02) board.Oct 24 2023, 12:06 PM

• WDoranWMF moved this task from To Deploy to Teleport to Sprint 03 on the Data Products (Sprint 02) board.Oct 24 2023, 12:11 PM

• WDoranWMF edited projects, added Data Products (Data Products (Sprint 03)); removed Data Products (Sprint 02).

• WDoranWMF moved this task from Sprint Backlog to To Deploy on the Data Products (Data Products (Sprint 03)) board.

• WDoranWMF edited projects, added Data Products (Sprint 02); removed Data Products (Data Products (Sprint 03)).Oct 24 2023, 12:17 PM

• WDoranWMF set Final Story Points to 5.

• WDoranWMF moved this task from Teleport to Sprint 03 to Done on the Data Products (Sprint 02) board.

• WDoranWMF set the point value for this task to 5.