Page MenuHomePhabricator

cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues
Closed, ResolvedPublic

Description

It has been reported that at least one png image fails to load with ERR_CONTENT_DECODING_FAILED: https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Relief_map_of_Serbia.png/272px-Relief_map_of_Serbia.png in esams. All esams frontend are affected, the object is cached on cp3037's varnish-be.

The error is due to the fact that the image is not compressed, but the response contains Content-Encoding: gzip.

curl --compressed fails with the following message:

curl: (61) Error while processing content unencoding: invalid stored block lengths

Most likely, the problem is not related to esams-specific network conditions. esams is probably just the DC that happened to cache a bad copy.

It is not clear why we try to gzip pngs at all, even if the browser accepts gzip.

Using curl --compressed against vanrnish-be works fine.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Oct 21 2016, 1:10 PM
ema moved this task from Backlog to Caching on the Traffic board.

After further investigation we've noticed that the problem is not reproducible forcing a cache miss by adding some random query parameters. Further, we've tried purging the affected object from a varnish frontend and fetching it again; the issue couldn't be reproduced.

The two varnish backends in the path from esams to eqiad are cp1072 and cp3037, These are the possible scenarios we think might have triggered the problem:

  1. cp1072 temporarily emitted CE:gzip with this object then later stopped doing so (for the same cache object)
  2. cp3037 temporarily added CE:gzip on reception, then was later evicted/purged and cleaned itself up
  3. cp3037 may or may not have ever evicted/purged since, but temporarily added CE:gzip to its response to several frontends
  4. several frontends all added CE:gzip to their cache objects on reception from cp3037, and then whatever triggers that went away

Scenario 4) seems unlikely because that would imply multiple temporary issues on all frontends. Scenario 1) also is not the most likely one given that we know for sure (because of Age), that cp1072 is still using the same unevicted/unexpired cache object it was at the time the problem started, and doesn't have bad output now.

ema renamed this task from ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe to cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues.Oct 21 2016, 2:04 PM
ema updated the task description. (Show Details)
ema updated the task description. (Show Details)

The specific repro URL for the Serbia map has been PURGEd now to clear up the issue for users, since we're not getting much debug value out of keeping it broken.

To be clearer about what was debugged on IRC: this wasn't a case of actual bad gzip encoding. The object contents in all affected caches were always the correct, uncompressed PNG data. The issue was just that a Content-Encoding: gzip header was on the objects (and thus the outputs) in the affected frontends for unknown reasons, which caused clients to attempt to interpret the otherwise-ok PNG data as gzipped content (and thus fail to gzip decode what isn't gzipped).

BBlack claimed this task.

Closing for now as we haven't seen a further complaint of this. It may have been some temporary error condition we'll never reproduce. Will re-open if it happens again!

Reopening, another instance of this bug has been reported in T162035#3168304.

AFAIK with the resolution of T162035 we haven't had further reports.