Page MenuHomePhabricator

Update prod custom varnish package for upstream 3.0.7 + deploy
Closed, DeclinedPublic

Description

Notable fixes we care about in the 3.0.6 -> 3.0.7 diffs:

  1. gzip + streaming + http/1.0 fix: https://www.varnish-cache.org/trac/ticket/1627
  2. fixed memory leak in bans (which is different than purges - this applies to our manual bans)

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack subscribed.

FYI this is blocked up a little: the merge of the 3.0.6 -> 3.0.7 diffs onto our custom "-plus-wm" tree didn't go cleanly. There's a short but hairy conflict between our varnish-plus streaming+chunked stuff and the gzip+streaming+http/1.0 content-length bugfix, and it's going to take some digging to unravel how to merge them sanely without breaking anything.

Note T127294 was fixed by disabling do_gzip on the cache_misc cluster. I suspect that there was a bad interaction there with streamed pass-mode response from the backend using chunked+gzip. The ultimate trigger for making it worse was turning on nginx keepalives, but they're working fine elsewhere. I strongly suspect that https://www.varnish-cache.org/trac/ticket/1627 would've fixed this too, but we still don't have a 3.0.7 package that includes it...

Dropping some breadcrumbs in case I end up dropping this topic for a while again....

The 3.0.7 fix we care about the most, which is also the only one that creates a merge conflict onto our hacky 3.0-plus branches is this one for gzip: https://www.varnish-cache.org/trac/changeset/72981734a141a0a52172b85bae55f8877f69ff42

When merging that into our plus branch, this is the resulting merge conflict:

<<<<<<< HEAD
        if (!sp->wantbody)
                sp->wrk->res_mode &= ~RES_CHUNKED;

        if (!(sp->wrk->res_mode & RES_CHUNKED) &&
            sp->wrk->h_content_length != NULL &&
            sp->wantbody &&
            *phigh == -1) {
                http_Unset(sp->wrk->resp, H_Content_Length);
=======
        http_Unset(sp->wrk->resp, H_Content_Length);
        if (sp->wrk->res_mode & RES_LEN) {
                AN(sp->wrk->h_content_length);
>>>>>>> varnish-3.0.7

Or another way to think of the problem is that, prior to this new gzip patch, the diff from upstream to our plus branch in this little area is this:

+       if (!sp->wantbody)
+               sp->wrk->res_mode &= ~RES_CHUNKED;
+
        if (!(sp->wrk->res_mode & RES_CHUNKED) &&
-           sp->wrk->h_content_length != NULL) {
+           sp->wrk->h_content_length != NULL &&
+           sp->wantbody &&
+           *phigh == -1) {
                http_Unset(sp->wrk->resp, H_Content_Length);
                http_PrintfHeader(sp->wrk, sp->fd, sp->wrk->resp,
                    "Content-Length: %s", sp->wrk->h_content_length);

Because the gzip fix makes a real structural change to how the code operates, it's difficult to resolve this without fully understanding how everything related works, which is a whole of things to understand.

BBlack changed the task status from Open to Stalled.Feb 18 2016, 11:05 PM

I spent considerable time on this today, but I still wasn't able to understand the code well enough to be confident in a possible merge solution (although I came up with a few different ones that seem plausible). I think for now, perhaps we'll just leave do_gzip disabled on the misc cluster (not the end of the world) and hope that Varnish 4 moves us past all related issues. It's expected to land on maps and misc first, with one of those happening circa EOQ.

Change 273013 had a related patch set uploaded (by BBlack):
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

Change 273013 merged by BBlack:
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

At this point we're not investing further in varnish3 except for security fixes, so there's no point pursuing a real 3.0.7 package upgrade given the issues with the gzip/content-length part.