Page MenuHomePhabricator

Mediarequests returning "file not found" for filenames with specific characters
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

Local project: commons.wikimedia.org.

Test the chars for pageviews.wmcloud.org:

  • Debela griža, kot jo vidi zmaj.jpg is not correct.
  • Mbeya city-2.jpg is correct.
  • Mandelbox 4D OpenCL 569829516 32K.jpg is correct.
  • Three images of a paragraph from The Red Badge of Courage (Stephen Crane, 1895) to illustrate kerning, ligatures, hyphenation and microtypography.png is not correct.
  • Abox - Mod Kali-V3 x Sierpinski 3D OpenCL 46149707045 8K.jpg is correct.
  • File:Sign-1020906,_M7,_Co._Limerick,_Ireland.jpg is not correct.
  • Sign-1020906, M7, Co. Limerick, Ireland.jpg is not correct.
  • File:Bucuresti,_Romania_(Mister_Tolanesco_intr-o_relaxare_absoluta)(2).JPG is not correct.
  • File:)(_-_Flickr_-_Time.Captured..jpg is not correct.

The data over https://wikimedia.org/api/rest_v1/#/Pageviews%20data are published.

  • Debela griža, kot jo vidi zmaj.jpg is correct.
  • File:Sign-1020906,_M7,_Co._Limerick,_Ireland.jpg is correct.

Result: Statistics are not available if at least 2 of the following characters are immediately behind present in the file: ( ) - or space.

Event Timeline

Aklapper removed a subscriber: MusikAnimal.

Cannot reproduce.

I only see an Uncaught DOMException: The operation is insecure. when going to that link.

So there must be another reason for the error? (I'm sorry, I am way over my head here - I just noticed an error that keeps popping up in some files; I thought it might be the letters... (and it keeps appearing; I can't see mediawiews analysis for the file provided))

Well first, the file was only uploaded 22 hours ago, so the data might simply not yet be available. However I did some digging and it appears you might be right; The API does not appear to like characters with diacritics. I get a 404 when browsing to https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fc%2Fc4%2FDebela_gri%C5%BEa%2C_kot_jo_vidi_zmaj.jpg/daily/2023091100/2023100100 and seemingly the same for any file with diacritics in the name.

Other examples:

The last example, File:Catedral de la Encarnación, Málaga, España, 2023-05-19, DD 37-39 HDR.jpg, is a featured image and the file page has had a decent number of pageviews, so there should definitely be media requests showing up in Mediaviews Analysis.

This sounds eerily similar to T247333 but I think it's a different issue. Anyway, I'm tagging Pageviews-API and Data-Engineering so that this gets reported to the right people.

Cannot reproduce.

I only see an Uncaught DOMException: The operation is insecure. when going to that link.

Do you have ad blockers or privacy extensions enabled, by chance? If so, that would explain the error. If you're not using any such browser extension, we have a bug of some sort. Could you share your browser version and OS? Thanks!

Do you have ad blockers or privacy extensions enabled, by chance?

Ah, yes. Sorry for the noise!

Dusan_Krehel renamed this task from Mediaviews analysis doesn't work for files with non-standard letters in the filename? to None result with some chars in the file name.Oct 2 2023, 10:30 PM
Dusan_Krehel triaged this task as Medium priority.
Dusan_Krehel removed a project: Pageviews-API.
Dusan_Krehel updated the task description. (Show Details)

Well first, the file was only uploaded 22 hours ago, so the data might simply not yet be available. However I did some digging and it appears you might be right; The API does not appear to like characters with diacritics. I get a 404 when browsing to https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fc%2Fc4%2FDebela_gri%C5%BEa%2C_kot_jo_vidi_zmaj.jpg/daily/2023091100/2023100100 and seemingly the same for any file with diacritics in the name.

Other examples:

The last example, File:Catedral de la Encarnación, Málaga, España, 2023-05-19, DD 37-39 HDR.jpg, is a featured image and the file page has had a decent number of pageviews, so there should definitely be media requests showing up in Mediaviews Analysis.

This sounds eerily similar to T247333 but I think it's a different issue. Anyway, I'm tagging Pageviews-API and Data-Engineering so that this gets reported to the right people.

Doesn't seem to be all diacritics though
https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fef%2FAB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%25C3%25A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif/monthly/20230101/20231001
gives a result despite containing an ä
At the same time any file with at parenthesis in the name will fail.

MusikAnimal raised the priority of this task from Medium to Unbreak Now!.Oct 16 2023, 10:26 PM
MusikAnimal added subscribers: Ladsgroup, Sfaci, SGupta-WMF and 3 others.

Raising to UBN as per duplicate task

MusikAnimal renamed this task from None result with some chars in the file name to Mediarequests returning "file not found" for filenames with specific characters.Oct 16 2023, 10:28 PM
MusikAnimal moved this task from Sprint Backlog to In Process on the Data Products (Sprint 02) board.

Sorry for all the noise! I didn't realize until after merging the old task was assigned etc.

Upon investigation , we concluded that this is a bug in AQS 2.0 media analytics service . It's missing some logic from AQS 1.0 which is related to handling of url encoded file paths and querying the database . We are already looking to approaches to fix it and will update here once fixed and released . Thank you!

Hi,

In the description of this ticket there is a list with some items and the text "is correct" or "is not correct". Does that mean that these items are working or not for you when you try to get the media request information?

  • does "is correct" mean that you get data for this filename?
  • does "is not correct" mean that you get a 404 for this filename?

Thank you!

Hi,

In the description of this ticket there is a list with some items and the text "is correct" or "is not correct". Does that mean that these items are working or not for you when you try to get the media request information?

From what I'm seeing in the list, those files don't have a special character in them. You can take a look at https://mvc.toolforge.org/index.php?category=Videos+by+Tagesschau+(ARD)&timespan=&rangestart=2023-08-01&rangeend=2023-09-01&limit=100

  • does "is correct" mean that you get data for this filename?

Yes

  • does "is not correct" mean that you get a 404 for this filename?

Yes

Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001

I get a good response:

    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023010100",
            "agent": "all-agents",
            "requests": 16
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023020100",
. . .
. . .

In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?

Please, correct me if I wrong
Thanks!!

Ok! No worries!
Just waiting for some sample data to test some edge cases before pushing a fix for all this. We'll keep you posted.
Thanks!

Change 967168 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/media-analytics@main] Fix to deal with filepaths with some specific punctuation marks

https://gerrit.wikimedia.org/r/967168

Change 967168 merged by jenkins-bot:

[generated-data-platform/aqs/media-analytics@main] Fix to deal with filepaths with some specific punctuation marks

https://gerrit.wikimedia.org/r/967168

Change 967438 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] media-analytics: bump version

https://gerrit.wikimedia.org/r/967438

Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001

I get a good response:

    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023010100",
            "agent": "all-agents",
            "requests": 16
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023020100",
. . .
. . .

In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?

Please, correct me if I wrong
Thanks!!

For )(_-_Flickr_-_Time.Captured..jpg I think the underlying issue might be that it is unclear which characters are expected to be encoded or not. In my case (where I got 404's for files with parenthesis) I now see that I had encoded the parenthesis.

Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001

I get a good response:

    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023010100",
            "agent": "all-agents",
            "requests": 16
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg",
            "granularity": "monthly",
            "timestamp": "2023020100",
. . .
. . .

In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?

Please, correct me if I wrong
Thanks!!

For )(_-_Flickr_-_Time.Captured..jpg I think the underlying issue might be that it is unclear which characters are expected to be encoded or not. In my case (where I got 404's for files with parenthesis) I now see that I had encoded the parenthesis.

Sure! We misunderstood that at the beginning. That's why we are fixing this right now. The current issue is due to some edge cases like this. Anyway, according to the fix we have done, from now on you should be able to use encoded or non-encoded way to request for files. In the test I have been doing I can use both ways. For example, the following requests give the same response:

  • metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_%28moon%29.ogg/daily/2023080100/2023090100
  • metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_(moon).ogg/daily/2023080100/2023090100
{
    "items": [
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/b/bd/Titan_(moon).ogg",
            "granularity": "daily",
            "timestamp": "2023080100",
            "agent": "user",
            "requests": 1
        },
        {
            "referer": "all-referers",
            "file_path": "/wikipedia/commons/b/bd/Titan_(moon).ogg",
            "granularity": "daily",
            "timestamp": "2023080300",
            "agent": "user",
            "requests": 3
. . .
. . .

And in the response you can see how the file_path is really store in the dataset. It's the way it should have worked

Test status: QA PASS

tested response and data ( compared with AQS 1.0 : The two files were semantically identical. )

All status return 200

  • mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fef%2FAB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%25C3%25A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif/monthly/20230101/20231001
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_(moon).ogg/daily/2023080100/2023090100
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F4%2F47%2FCatedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg/daily/2023080100/2023100100
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_%28moon%29.ogg/daily/2023080100/2023090100

Test status: QA PASS

tested response and data ( compared with AQS 1.0 : The two files were semantically identical. )

All status return 200

  • mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fef%2FAB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%25C3%25A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif/monthly/20230101/20231001
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_(moon).ogg/daily/2023080100/2023090100
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F4%2F47%2FCatedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg/daily/2023080100/2023100100
  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100

This one is returning 404 to me:
https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2F0%2F0e%2FAngkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm/daily/2023080100/2023090100

  • mediarequests/per-file/all-referers/user/%2Fwikipedia%2Fcommons%2Fb%2Fbd%2FTitan_%28moon%29.ogg/daily/2023080100/2023090100

@Ladsgroup I just checked again now, in the QA test env its returning 200 and the right json response. Not sure the fix has been deployed

@SGupta-WMF can advice further. Its looking good on my end

@Ladsgroup Keep in mind that these tests have been run locally to test the fix before deploying to production.
The fix is done and merged and these tests are showing that it's working fine, but the service hasn't been deployed yet. Hopefully we'll do that next Monday. We'll ping you through this ticket as soon as it's done.

Change 967438 merged by jenkins-bot:

[operations/deployment-charts@master] media-analytics: bump version

https://gerrit.wikimedia.org/r/967438

WDoranWMF set Final Story Points to 5.
WDoranWMF moved this task from Teleport to Sprint 03 to Done on the Data Products (Sprint 02) board.
WDoranWMF set the point value for this task to 5.

This change has been deployed, and 404 errors have greatly dropped off. Please update if you see any persisting examples of this behaviour

Thanks. From my contacts, this seems to be fixed now.