Page MenuHomePhabricator

0px by 0px image in search results breaks MediaSearch
Closed, ResolvedPublicBUG REPORT

Description

If there's a 0px by 0px image in the MediaSearch results (any tab except audio), then MediaSearch breaks

Examples:
https://commons.wikimedia.org/w/index.php?search=30bytes&title=Special:MediaSearch&go=Go&type=image

(the file 30bytes.gif is a 0px by 0px image)

It's fine if you're searching for audio files:
https://commons.wikimedia.org/w/index.php?search=Haunu&title=Special:MediaSearch&go=Go&type=audio

Note that there are 12 out of 73267622 images on commons currently that are 0px in width, so this might not be very urgent

I'd guess the problem is ImageHandler::validateThumbParams() which fails if an image's width is zero

Event Timeline

Seddon triaged this task as High priority.Nov 13 2021, 2:14 AM

Assuming this task is about the SDAW-MediaSearch code project, thus adding that project tag so other people who don't know or don't care about team tags can also find this task when searching via projects.

Weirdly, this seems to be specific to the search term "test". Other search terms work fine

Seddon lowered the priority of this task from High to Low.Nov 15 2021, 1:41 PM
Cparle renamed this task from Scroll for more results not functioning to 0px by 0px image in search results breaks MediaSearch.Nov 15 2021, 6:06 PM
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)

Testing notes (for verifying after the fix)

  • after Invalid search in one of filter tabs, the Invalid search message will be displayed for all other tabs although the search might be valid there. So the order of checking search queries is important. e.g. 1) enter "sport club logo" in Image tab 2) check other filter tabs

@Cparle will look into the database to see how many of these files there are, and what file types they're associated with; and how often they turn up in search results. Based on that impact we will determine whether to invest in fixing this and what approach to take.

Options to fix this include:
-figure out why 0x0 images are being created, and prevent that from happening, and fix the ones that exist, or;

-fix the way search handles 0x0 images so that it doesn't return an error

Cparle added a subscriber: matthiasmullie.

As far as I can see all of these are caused by 0px by 0px files

Also I can't reproduce an invalid search on one tab also giving invalid search on another

Here is the number of files with width=0 that we expect will product this error if they show up in search results:

select count(*),img_media_type,img_major_mime from image where img_width=0 group by img_media_type,img_major_mime;
count(*)img_media_typeimg_major_mime
10BITMAPimage
2DRAWINGimage
1630840AUDIOapplication
709526AUDIOaudio
1VIDEOapplication
34VIDEOvideo
2MULTIMEDIAapplication
66OFFICEapplication
1OFFICEimage

Audio files do not trigger the problem, because we don't try to create thumbnails for them, so the total number of problematic files is 116 out of ~79M. We can expect the number of problematic searches to be much higher than the number of problematic files because a problematic file may be returned in many different searches.

Neither myself and @matthiasmullie have figured out any way to work out what proportion of media searches have this problem using logstash - I suspect if we really want to find out we'd need to instrument the search page, and to involve analytics

Thanks for this @Cparle and @matthiasmullie! I don't think putting more effort into working out the proportion of media searches have the problem would be worthwhile, because that will change depending on which files are 0x0. Is it possible to look into why/how we have 0x0 files in the first place?

https://commons.wikimedia.org/wiki/Special:UploadWizard doesn't prevent a user from uploading 0x0 files - at least I can upload a 0x0 file in my local version of commonswiki

Hmm. Are we thinking it would be more difficult to prevent UploadWizard from accepting 0x0 files, or more difficult to fix MediaSearch's handling of them? If it's the latter, @Sannita, maybe we should reach out to the community and see if there's a reason people want them to exist?

If it's the latter, @Sannita, maybe we should reach out to the community and see if there's a reason people want them to exist?

I'm not aware of any reasons why this should be ok, but I'll do some research and ask around.

Change 747096 had a related patch set uploaded (by Simone Cuomo; author: Simone Cuomo):

[mediawiki/extensions/MediaSearch@master] Filter out image with no width/height

https://gerrit.wikimedia.org/r/747096

Change 747096 merged by jenkins-bot:

[mediawiki/extensions/MediaSearch@master] Filter out image with no width/height

https://gerrit.wikimedia.org/r/747096

Tested on commons betalabs
select img_name,img_media_type,img_major_mime from image where img_width=0 and img_media_type !="AUDIO" ;

+----------------------------------------------------------+----------------+----------------+
| img_name                                                 | img_media_type | img_major_mime |
+----------------------------------------------------------+----------------+----------------+
| 104-10001-10015_01.pdf                                   | OFFICE         | application    |
| 104-10001-10015_02.pdf                                   | OFFICE         | application    |
| 104-10001-10015_03.pdf                                   | OFFICE         | application    |
| 104-10001-10015_04.pdf                                   | OFFICE         | application    |
| 104-10001-10015_05.pdf                                   | OFFICE         | application    |
| 104-10001-10015_06.pdf                                   | OFFICE         | application    |
| 104-10001-10015_07.pdf                                   | OFFICE         | application    |
| ACDC_test_file_1.pdf                                     | OFFICE         | application    |
| ACDC_test_file_2.pdf                                     | OFFICE         | application    |
| Amsterdam_Museum_logo.pdf                                | OFFICE         | application    |
| Asdlkjadjklsajkl.pdf                                     | OFFICE         | application    |
| CK.pdf.pdf                                               | OFFICE         | application    |
| Cu31924022189587.pdf                                     | OFFICE         | application    |
| De-Jong_Koninkrijk_deel-06_tweede-helft_zw.pdf           | OFFICE         | application    |
| Describing_with_0-13613385481441342_2014-01-17_07-29.png | BITMAP         | image          |
| LastTest-_-_1.jpeg                                       | BITMAP         | image          |
| LastTest-_-_2.jpeg                                       | BITMAP         | image          |
| LastTest-_-_3.jpeg                                       | BITMAP         | image          |
| LastTest-_-_4.jpeg                                       | BITMAP         | image          |
| Strlogotest.svg                                          | DRAWING        | image          |
| Title_0.12128706848586002.png                            | BITMAP         | image          |
| Title_0.13385609312435276.png                            | BITMAP         | image          |
| Title_0.27637197550359593.png                            | BITMAP         | image          |
| Title_0.3097767931192519.png                             | BITMAP         | image          |
| Title_0.6233659238693904.png                             | BITMAP         | image          |
+----------------------------------------------------------+----------------+----------------+
25 rows in set (0.032 sec)

The issue is still present for the following searches:
(1) https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:MediaSearch&search=104-10001-10015&type=other
Even the more general search fails - it never finishes fetching results: https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:MediaSearch&search=104+&type=other

Screen Shot 2021-12-21 at 12.41.43 PM.png (1×2 px, 147 KB)
Screen Shot 2021-12-21 at 1.09.16 PM.png (1×2 px, 139 KB)

(2) https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:MediaSearch&search=Amsterdam+Museum+logo&type=other&filemime=pdf
The partial search for museum - https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:MediaSearch&search=museum&type=image - fails too.

(3) Search for de jong koninkrijk - https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:MediaSearch&search=de+jong+koninkrijk&type=other

(4) Search for LastTest-_-_3
https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:MediaSearch&search=LastTest-_-_3&type=other

These issues may turn out to be beta-specific.
E.g. for (1), the offending image is 104-10001-10015_01.pdf.
If we look at how it's indexed (https://commons.wikimedia.beta.wmflabs.org/wiki/File:104-10001-10015_01.pdf?action=cirrusDump), it has a file_resolution (and file_width & file_height) - that's not what the DB & the error message tell us, though, so there's some data corruption.

I just ran the same query on prod and checked a few dozen of the results, and the data in the search index seems to accurately reflect the data in DB (i.e. file_resolution = 0), in which case our workaround should suffice.

So basically, I'd go with "beta has some corrupt data, it may still no longer be an issue on prod" for now.

These issues may turn out to be beta-specific.
E.g. for (1), the offending image is 104-10001-10015_01.pdf.
If we look at how it's indexed (https://commons.wikimedia.beta.wmflabs.org/wiki/File:104-10001-10015_01.pdf?action=cirrusDump), it has a file_resolution (and file_width & file_height) - that's not what the DB & the error message tell us, though, so there's some data corruption.

I just ran the same query on prod and checked a few dozen of the results, and the data in the search index seems to accurately reflect the data in DB (i.e. file_resolution = 0), in which case our workaround should suffice.

So basically, I'd go with "beta has some corrupt data, it may still no longer be an issue on prod" for now.

I agree that it is beta specific issues. I will look if I can get production queries for comparison to review the results in wmf.16.

Testing notes (for verifying after the fix)

  • after Invalid search in one of filter tabs, the Invalid search message will be displayed for all other tabs although the search might be valid there. So the order of checking search queries is important. e.g. 1) enter "sport club logo" in Image tab 2) check other filter tabs

Re-checked the cases on commons wmf.16 - all works as expected.