Page MenuHomePhabricator

Incorrect thumbnail being returned by drmrs, eqiad and esams
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

  • The old version of the image is returned (most of shoulder not visible)
  • Other sizes of the image do get the correct version
  • Purge the image and try again, same result

What should have happened instead?:

  • A correct thumbnail of the new version should be returned by the url

Some users are apparently consistently seeing the correct version and others consistently the wrong version, which might hint at a caching center problem. Possibly esams ??

NOTE: I found a more logical reason.. retina screens vs non-retina screens getting different thumbnails served to them. For those with retina screens, the 100px thumbnail delivers 200px images, for others the 200px image delivers the actual 200px image, and those with retina get the 400px version. This explains the difference in perception for people.

Of note: x-cache: cp3055 hit, cp3053 hit/5

Summary
URL: https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
Status: 200
Source: Network
Address: 2620:0:862:ed1a::2:b:443

Request
:method: GET
:scheme: https
:authority: upload.wikimedia.org
:path: /wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
Cookie: <cookie removed>
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: upload.wikimedia.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.15
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate, br
Connection: keep-alive

Response
:status: 200
Timing-Allow-Origin: *
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Server-Timing: cache;desc="hit-front", host;desc="cp3053"
Report-To: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
Content-Length: 11263
Date: Sun, 05 Feb 2023 18:32:59 GMT
Access-Control-Expose-Headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
Age: 2758
Content-Type: image/jpeg
ETag: 442f796c711de0f8e65ee483a2ad180c
Last-Modified: Thu, 14 Sep 2017 15:13:14 GMT
Server: ATS/9.1.4
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
x-client-ip: An IPv6 IP
permissions-policy: interest-cohort=(),ch-ua-arch=(self "intake-analytics.wikimedia.org"),ch-ua-bitness=(self "intake-analytics.wikimedia.org"),ch-ua-full-version-list=(self "intake-analytics.wikimedia.org"),ch-ua-model=(self "intake-analytics.wikimedia.org"),ch-ua-platform-version=(self "intake-analytics.wikimedia.org")
accept-ch: Sec-CH-UA-Arch,Sec-CH-UA-Bitness,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-UA-Platform-Version
x-cache-status: hit-front
nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
x-cache: cp3055 hit, cp3053 hit/5

Event Timeline

I was the original person who found this problem, which is/was being discussed at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#How_do_I_fix_non-uniform_image_scaling.
I was a little reluctant to open a ticket before we had full information, but here we are.

NOTE: I found a more logical reason.. retina screens vs non-retina screens getting different thumbnails served to them. For those with retina screens, the 100px thumbnail delivers 200px images, for others the 200px image delivers the actual 200px image, and those with retina get the 400px version. This explains the difference in perception for people.

Yes, this was discussed in the VPT thread, but it shouldn't affect what happens when the URL is retrieved manually, rather than depending on a browser to handle an <img srcset> or whatever. In particular, the wrongly retrieved image is:

jhawk@lrr /tmp % wget 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg'      
--2023-02-05 13:00:20--  https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
Resolving upload.wikimedia.org (upload.wikimedia.org)... 208.80.154.240
Connecting to upload.wikimedia.org (upload.wikimedia.org)|208.80.154.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11263 (11K) [image/jpeg]
Saving to: ‘200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg’

200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg 100%[====================================================================================>]  11.00K  --.-KB/s    in 0s      

2023-02-05 13:00:20 (96.8 MB/s) - ‘200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg’ saved [11263/11263]

jhawk@lrr /tmp % ls -ld 200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg                                                                                          
-rw-r--r--@ 1 jhawk  wheel  11263 Sep 14  2017 200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
jhawk@lrr /tmp % sha256sum 200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg                                                                                        
0c513efce857393aa97bb79296757529d375df27be03f9cf371e0c959d2a416b  200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
jhawk@lrr /tmp %

Of course, I don't have a SHA256 sum for the correct version of the image at this particular resolution, so I cannot tell you exactly what it should be.

It does sound like there are some people who retrieve that particular URL and get a different result, and I'm not sure what is going on there. What is the load balancing situation with upload.wikimedia.org?

This seems to depend on the IP I connect to:

$ curl --resolve upload.wikimedia.org:443:208.80.154.240 -s 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg' |exiftool - |egrep 'Width|Height'
Image Width                     : 200
Image Height                    : 300

$ curl --resolve upload.wikimedia.org:443:198.35.26.112 -s 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg' |exiftool - |egrep 'Width|Height'
Image Width                     : 200
Image Height                    : 250

This seems to depend on the IP I connect to:

Indeed. And it's not just one server. Assuming this is the full list, they seem to be about equally divided:

jhawk@lrr ~ % for d in codfw drmrs eqiad eqsin esams ulsfo; do echo -n $d: ; curl  -sI --resolve upload.wikimedia.org:443:$(host -t a upload-lb.$d.wikimedia.org | awk '{print $NF}') https://upload.wikimedia.org/wikipedia/commons/thumb/5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg | grep content-length; done
codfw:content-length: 18860
drmrs:content-length: 11263
eqiad:content-length: 11263
eqsin:content-length: 18860
esams:content-length: 11263
ulsfo:content-length: 18860
jhawk@lrr ~ %

(The 11263 byte image is the old/wrong one)

Of note: This file was most recently updated on 5 July 2018. It appears that some of these cached images are almost five years old.

Ideally, some root cause analysis can be performed here so that whatever update failure is causing this problem gets fixed for all of the other improperly cached files that are not being reported by helpful editors, and so that this problem can be avoided in the future.

TheDJ renamed this task from Incorrect thumbnail being returned for some users to Incorrect thumbnail being returned by drmrs, eqiad and esams.Feb 6 2023, 6:13 PM

The difference between the two groups of datacenters is the swift backend serving them. From what I understand, a bad thumbnail is stored in the eqiad swift datastore.

akosiaris claimed this task.
akosiaris subscribed.

The difference between the two groups of datacenters is the swift backend serving them. From what I understand, a bad thumbnail is stored in the eqiad swift datastore.

Indeed, that was it. I just deleted that thumbnail manually using swift delete and I can now see the picture correctly. Interestingly, that 200px thumbnail wasn't in the output of swift list, but was around in the output of swift stat.

By the way, the 192px thumbnail was already there, with correct quality and it doesn't require weird actions to regenerate (just an action=purge in the image's main page, that is https://commons.wikimedia.org/wiki/File:EPA_Deputy_Admin_Bob_Perciasepe.jpg in this case. Whatever uses those 200px would do itself a favor by switching to using the 192px variant.

I am gonna resolve this, feel free to reopen though.

I am gonna resolve this, feel free to reopen though.

On what basis? As others explained initially, this isn't about one image. It's about how we got here and whether we can identify the bug that lead to this problem and resolve it. Is there any speculation about that? This was never about a single image.

By the way, the 192px thumbnail was already there, with correct quality and it doesn't require weird actions to regenerate (just an action=purge in the image's main page, that is https://commons.wikimedia.org/wiki/File:EPA_Deputy_Admin_Bob_Perciasepe.jpg in this case.

Huh. I did not try that because I understood enwiki user Kusma to have done so unsuccessfully.
(Oops.)
That said, it seems like https://commons.wikimedia.org/w/thumb.php?f=EPA_Deputy_Admin_Bob_Perciasepe.jpg&w=100&action=purge should have generated a purge to the parent image...

Whatever uses those 200px would do itself a favor by switching to using the 192px variant.

Well, it's not so simple? https://en.wikipedia.org/wiki/Administrator_of_the_Environmental_Protection_Agency has 25 different images, all (most?) specifying 100px. The 200px get served as part of a srcset to hi-dpi browsers. I'm not sure if you're suggesting page designers shouldn't specify a 100px width for images, or that MediaWiki shouldn't select a 200px thumbnail in those cases, or that there's something specific about this image but not all images in general, or what.

I performed action=purge while trying to troubleshoot this problem, and it did not fix the problem.

How can regular editors who do not have access to swift fix this problem for individual images without filing a bug report?

I think this ticket should be reopened for root cause analysis. This invalid thumbnail was apparently five years old. There are presumably more out there.

I performed action=purge while trying to troubleshoot this problem, and it did not fix the problem.

I also purged, I always do this before reporting problems like this.

How can regular editors who do not have access to swift fix this problem for individual images without filing a bug report?

I don't think you can.

I think this ticket should be reopened for root cause analysis. This invalid thumbnail was apparently five years old. There are presumably more out there.

Looking at the symptoms, combined with what was stated about the difference between list and stat commands, I'm guessing that openswift (the filestorage layer/api) itself was confused about the existence of this filename.

I'm guessing the delete for it failed at some point long, long ago, causing the file to still be present but not fully known to the system (the index). The new purges are not deleting this file (does not exist in the list), yet asking for it specifically still returns it, so a new version won't have to be generated. Something vague like that.

I'm not really sure how you would detect a problem like this actively, for every single thumbnail. That'd be very expensive I think. As long as this doesn't happen too often it is probably not worth investing more time into.

I am gonna resolve this, feel free to reopen though.

On what basis? As others explained initially, this isn't about one image. It's about how we got here and whether we can identify the bug that lead to this problem and resolve it.

I must have missed the part about this impacting more images, I fear. Would you be so kind to point it out to me? I did visit the VP article, but I fear I failed to find a list of images.

If this impacts more images, it's an issue worth investigating more, but I most certainly did not come up with something concrete while investigating that 1 image.

Is there any speculation about that? This was never about a single image.

My current speculation is on similar grounds as @TheDJ 's. Somehow swift ended up thinking it did not have the image, but it did and was able to serve it. Given that "ghost" image was from 5 years ago, the infra has changed since then and no logs are around from that time, it wasn't feasible to dig deeper

By the way, the 192px thumbnail was already there, with correct quality and it doesn't require weird actions to regenerate (just an action=purge in the image's main page, that is https://commons.wikimedia.org/wiki/File:EPA_Deputy_Admin_Bob_Perciasepe.jpg in this case.

Huh. I did not try that because I understood enwiki user Kusma to have done so unsuccessfully.
(Oops.)
That said, it seems like https://commons.wikimedia.org/w/thumb.php?f=EPA_Deputy_Admin_Bob_Perciasepe.jpg&w=100&action=purge should have generated a purge to the parent image...

Yes it should. I went through a similar purge (as others did) as well, to no avail. It might be explained by the swift weirdness outlined above.

Whatever uses those 200px would do itself a favor by switching to using the 192px variant.

Well, it's not so simple? https://en.wikipedia.org/wiki/Administrator_of_the_Environmental_Protection_Agency has 25 different images, all (most?) specifying 100px. The 200px get served as part of a srcset to hi-dpi browsers. I'm not sure if you're suggesting page designers shouldn't specify a 100px width for images, or that MediaWiki shouldn't select a 200px thumbnail in those cases, or that there's something specific about this image but not all images in general, or what.

It never is simple. What I am simply pointing out is that we are pregenerating thubnails for all the resolutions listed in e.g. https://commons.wikimedia.org/wiki/File:EPA_Deputy_Admin_Bob_Perciasepe.jpg. I wouldn't dare instruct page designers on how to design pages, just asking they consider the pros and cons of those pre generated thumbnails when doing so.

I performed action=purge while trying to troubleshoot this problem, and it did not fix the problem.

I did too. And as you point out it did not fix the problem.

How can regular editors who do not have access to swift fix this problem for individual images without filing a bug report?

They can't. And filling a bug report is the best path forward here and I think that we addressed the problem with the individual image this task listed, which is why I resolved it.

I think this ticket should be reopened for root cause analysis. This invalid thumbnail was apparently five years old. There are presumably more out there.

If there are, please do provide a few examples and reopen the task (or better file a new one mentioning this one). I have no way of coming up with such examples, no logs (they 've all been deleted after 5 years) and no way to test my speculation. I 'd be happy to help figure out the bug and the root cause of it, but I need more examples to do so.

Looking at the symptoms, combined with what was stated about the difference between list and stat commands, I'm guessing that openswift (the filestorage layer/api) itself was confused about the existence of this filename.

I'm guessing the delete for it failed at some point long, long ago, causing the file to still be present but not fully known to the system (the index). The new purges are not deleting this file (does not exist in the list), yet asking for it specifically still returns it, so a new version won't have to be generated. Something vague like that.

Similar theory from me as well, thanks for writing it down. And with such an old thumbnail, no way of proving the theory :-(

I'm not really sure how you would detect a problem like this actively, for every single thumbnail. That'd be very expensive I think. As long as this doesn't happen too often it is probably not worth investing more time into.

For what it's worth, I am of the same opinion.

Thanks for this

Just an update for this one. I 've dug into it a bit today and I can say that there clearly was a window of time before Jan 14th that at least some parts of the world would receive the older thumbnail 375px thumbnail. This apparently gets fixed after Jan 15th. The data I am looking at is highly sampled so the picture I am forming is yet incomplete, but it's enough to spend more time looking at higher resolution data. I 'll update once I have done so.

Prefatory note: The VPT discussion was archived and is now at
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_203#How_do_I_fix_non-uniform_image_scaling?

Part of the problem is we don't really have any idea how many images this problem impacts, and how common it is. And without a plausible explanation of how this occurs, there's just not enough information to speculate about it. That's part of why I find it frustrating that this was closed (and indeed, that the error condition was corrected).

@akosiaris did you preserve the output of swift list and swift stat?
Do we understand what the condition really was?

I did spend a while and audit all the images linked from the same article, as well as images uploaded by the same person in 2018, and a bunch of other things, and was not able to find another example of this.

I'm not sure that the examples @Jonesey95 listed are the same – it seemed like those were conditions that were corrected by a purge, which this was not.

I guess I will try to sit down this weekend and pick a few million random images and check for consistent images on multiple servers.

I'm not really sure how you would detect a problem like this actively, for every single thumbnail. That'd be very expensive I think.

Well, we certainly don't need to detect this for every single thumbnail.
Ideally, we would detect this for a few, and then we would be able to see what they hold in common and reason about the source of the problem.
And then by applying that reasoning, we would be able to figure out what is going on.

As for expense, we certainly ought to be able to modify the purge code to detect and rectify this situation. Purging is already relatively expensive. Adding sufficient instrumentation to detect this problem to the purge should not meaningfully increase the expense. Of course, that's kind of a workaround rather than a detection, but nonetheless.

As long as this doesn't happen too often it is probably not worth investing more time into.

Well, we just don't have sufficient information to understand whether that is true.
(Which is why I think this should not have been closed.)

I'm not sure that the examples @Jonesey95 listed are the same – it seemed like those were conditions that were corrected by a purge, which this was not.

Even if those thumbnails were corrected by a purge, the purge was performed only because someone noticed a problem. The outdated thumbnails were often months old. I have zero doubt that one or more servers are delivering incorrect thumbnails to readers and editors every day.

This situation very much reminds me of T157670, a bug report created in 2017 and not yet addressed in any meaningful way by developers. That bug leaves categories out of date, sometimes for years, just as this failure to maintain up-to-date thumbnails delivers invalid images to readers.

As someone who cares about the accuracy of information that we deliver to readers, I find it massively frustrating that fundamental bugs like these are left to dangle in the wind while the Community Wishlist process takes up limited volunteer and staff time proposing and vetting dozens of potential new features. /soapbox

Thanks for this

Just an update for this one. I 've dug into it a bit today and I can say that there clearly was a window of time before Jan 14th that at least some parts of the world would receive the older thumbnail 375px thumbnail. This apparently gets fixed after Jan 15th. The data I am looking at is highly sampled so the picture I am forming is yet incomplete, but it's enough to spend more time looking at higher resolution data. I 'll update once I have done so.

Let me add some more information. For starters (it should go without saying, but it's good to have confirmation) people weren't imagining it, parts of the global Content Delivery Network (CDN for short) were indeed serving an old and different thumbnail. Judging from logs, the new thumbnail was generated on Dec 3rd and was being served by half of the CDN and the other half was serving the old thumbnail (I don't have exact timeframe of when that one was generated). On January 15th, this was noticed, and apparently a purge was done. The purge worked, all parts of the CDN were serving the new thumbnail after that. @Jhawkinson is right, this isn't the exact same problem as the one described when this task was opened. It is still an issue of course, just of a different nature.

Part of the problem is we don't really have any idea how many images this problem impacts, and how common it is. And without a plausible explanation of how this occurs, there's just not enough information to speculate about it. That's part of why I find it frustrating that this was closed (and indeed, that the error condition was corrected).

@akosiaris did you preserve the output of swift list and swift stat?
Do we understand what the condition really was?

Let me say thanks for voicing your frustration without going into an open/close task war.

I closed this task as it was apparently about a very specific thumbnail, which I managed to track down and fix (without finding the reason it happened). I think that T327253 (eqiad vs codfw discrepancy specifically) and T168002 (old restricted task, I asked to unrestrict it) are related, so we can add this task as a data point to one of these tasks. However neither are a perfect match of what we witnessed with the original problematic thumbnail. I am hopeful for T327253, it is actively being worked on, by people closer to swift than yours truly and some of the findings explain some of the behaviors we witnessed here (although maybe not the original one)

As far as the output of the 2 commands go, no, I did not preserve it, but I can reconstruct a big part of it from rerunning this commands and a smaller part of it by memory (mostly the dates). I added it as a data point in T327253

As for expense, we certainly ought to be able to modify the purge code to detect and rectify this situation. Purging is already relatively expensive. Adding sufficient instrumentation to detect this problem to the purge should not meaningfully increase the expense. Of course, that's kind of a workaround rather than a detection, but nonetheless.

As long as this doesn't happen too often it is probably not worth investing more time into.

Well, we just don't have sufficient information to understand whether that is true.
(Which is why I think this should not have been closed.)

I am not sure you can purge something that swift isn't aware of having, for what it's worth. If I remember correctly, mw asks swift for the thumbnails to purge.

But in any case, we don't have more data than this 1 specific instance of swift not being aware of having a file, while having it (T327253 started the other way around). If we find more, and preferably fresh ones (old ones are really hard to chase down/debug), we can reason about that and dig deeper.

I'm out of my depth here, so I'm speculating, but these items from the swift list look peculiar:

5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg.gif
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/200px-EPA_Deputy_Admin_Bob_Perciasepe.jpg.png
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/201px-EPA_Deputy_Admin_Bob_Perciasepe.jpg
5/5d/EPA_Deputy_Admin_Bob_Perciasepe.jpg/202px-EPA_Deputy_Admin_Bob_Perciasepe.jpg

Do we understand what leads to the creation of the last 4 of those?

Perhaps it is unrelated to these problems. (They all seem fine on visual inspection.)

We don't currently ever expire thumbnails, but are moving gradually towards doing so.

Do we understand what leads to the creation of the last 4 of those?

201px and 202px are probably just ppl (like me), retrieving a direct url or a thumbnail in wikitext preview of the 200px version, and adding a pixel to the width to check how those versions render (to detect if this is an issue with a specific thumbnail, or with all the thumbnails). The GIF and PNG ones... are absolutely interesting... but valid. Someone requested a gif version, so thumbor will serve you up a gif version and this gets stored in swift (this use case is mostly for formats like svg, of which we serve up png variants upon request).

None of it seems suspicious to me.

Do we understand what leads to the creation of the last 4 of those?

201px and 202px are probably just ppl (like me), retrieving a direct url or a thumbnail in wikitext preview of the 200px version, and adding a pixel to the width to check how those versions render (to detect if this is an issue with a specific thumbnail, or with all the thumbnails). The GIF and PNG ones... are absolutely interesting... but valid. Someone requested a gif version, so thumbor will serve you up a gif version and this gets stored in swift (this use case is mostly for formats like svg, of which we serve up png variants upon request).

None of it seems suspicious to me.

Agreed on all of the above.