Page MenuHomePhabricator

https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw
Closed, ResolvedPublic

Description

Nodepool can't reach Jenkins API for some reason. That happened from time to time but it is now a constant stream of errors. First occurence being at 2016-02-17 19:39:59

2016-02-17 19:39:59,773 ERROR nodepool.NodePool: Unable to check status of gallium.wikimedia.org
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 1507, in reconfigureManagers
    info = config.jenkins_managers[t.name].getInfo()
  File "/usr/lib/python2.7/dist-packages/nodepool/jenkins_manager.py", line 137, in getInfo
    return self._client.get_info()
  File "/usr/lib/python2.7/dist-packages/jenkins/__init__.py", line 326, in get_info
    % self.server)
JenkinsException: Could not parse JSON info for server[https://integration.wikimedia.org/ci/]

Repeated every 10 seconds as that check is a Nodepool scheduled task.

The URL being hit should be https://integration.wikimedia.org/ci/api/json and it works file using curl from labnodepool1001.eqiad.wmnet.

Event Timeline

On gallium I do see the requests in /var/log/jenkins/access.log:

[18/Feb/2016:10:55:24 +0000] "GET /ci/api/json HTTP/1.1" 200 97014 "-" "Python-urllib/2.7"
[18/Feb/2016:10:55:35 +0000] "GET /ci/api/json HTTP/1.1" 200 97068 "-" "Python-urllib/2.7"

The returned json apparently can not be decoded. The error in the Jenkins python module is a ValueError exception.

$ curl --silent https://integration.wikimedia.org/ci/api/json|python -m json.tool
Invalid control character at: line 1 column 96213 (char 96212)

There is something borked related to the Jenkins views:

{"name":"Labs","url":"https://integration.wikimedia.org/ci/view/Labs/"},
{"name":"MediaWiki","url":"https://integration.w<80>qW8^O^?^@^@pqW8^O^?^@^@M*F^@^@^@^@^@R*F^@^@^@^@^@^@^@^@^@^@^@^@^@IP(M^O^?^@^@e","url":"https:^@mW8^O^?^@^@ation.wikimedia.org/ci/view/Mobile/"},
{"zj(M^O^?^@^@w-tools","url":"https://integration.wikimedia.org/ci^@^@^@^@^@^@^@^@-tooüÿÿÿÿÿÿÿõz^@^@^@^@^@^@^P^@^@^@^@^@^@^@ :"https^@^@^@^@nteg^@^@^@^@on.w^@^@^@^@edia4qW8^O^?^@^@^@^@^@^@^@^@^@^@^@^@^@^@{"na^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ÿÿÿÿÿÿÿÿzj(M^O^?^@^@tion.wikzj(M^O^?^@^@rg/ci/viO*F^@^@^@^@^@ebG^@^@^@^@^@^H^@^@^@nameT*F^@^@^@^@^@id","url":"https://integration.wikimedia.org/ci/view/Parsoid/"},(^@^@^@0^@^@^@ÐsW8^O^?^@^@^PsW8^O^?^@^@":"https://integration.wikimedia.org/ci/view/Pywikibot/"},
{"name":"WMF","url":"https://integration.wikimedia.org/ci/view/WMF/"},

Mentioned in SAL [2016-02-18T11:12:00Z] <hashar> Jenkins: reloading configuration from disk. Some metadata are corrupted T127294

Mentioned in SAL [2016-02-18T11:27:05Z] <hashar> Jenkins web UI busy with 'jenkins.model.RunIdMigrator doMigrate' while it migrate build records. I did a bunch of cleanup yesterday. Jenkins runs jobs in the background just fine though. T127294

Mentioned in SAL [2016-02-18T11:34:49Z] <hashar> Hard restarting Jenkins T127294

The Jenkins service itself seems to be just fine. On gallium have been doing curl --silent 'http://127.0.0.1:8080/ci/api/json'|python -m json.tool and that pass just fine.

So that might be either one of:

  • Nginx (https termination)
  • misc varnishes
  • Apache on gallium

We had some changes made to Nginx yesterday as part of T107749: HTTP/1.1 keepalive for local nginx->varnish conns by @BBlack

https://integration.wikimedia.org/ci/api/json started being corrupted since 2016-02-17 19:39:59 and there are a few related logs at:

9c38a20 nginx keepalives: enable for upload (Wed Feb 17 22:39:24 2016 +0000) <Brandon Black>
5154c3e tlsproxy: set files ulimit at 2x conns (Wed Feb 17 22:13:28 2016 +0000) <Brandon Black>
a926147 enable tcp_tw_reuse on caches (Wed Feb 17 21:17:04 2016 +0000) <Brandon Black>
4103f6a nginx keepalives: enable for text (Wed Feb 17 19:59:51 2016 +0000) <Brandon Black>
59f52a9 tlsproxy: upstreams need distinct names (Wed Feb 17 19:25:48 2016 +0000) <Brandon Black>
fbeb033 nginx keepalives: enable for maps+misc (Wed Feb 17 19:05:17 2016 +0000) <Brandon Black>

What puzzles me is that it works fine from gallium but is borked from labnodepool1001.eqiad.wmnet or my machine ...

curl with headers from my machine:

$ curl --silent --verbose  https://integration.wikimedia.org/ci/api/json |python -m json.tool>/dev/null
*   Trying 2620:0:862:ed1a::3:d...
* Connected to integration.wikimedia.org (2620:0:862:ed1a::3:d) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.wikipedia.org
* Server certificate: GlobalSign Organization Validation CA - SHA256 - G2
* Server certificate: GlobalSign Root CA
> GET /ci/api/json HTTP/1.1
> Host: integration.wikimedia.org
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: nginx/1.9.4
< Date: Thu, 18 Feb 2016 12:25:12 GMT
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: X-Forwarded-Proto
< Strict-Transport-Security: max-age=31536000
< X-Content-Type-Options: nosniff
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< X-Jenkins: 1.625.3
< X-Jenkins-Session: 5631497c
< Set-Cookie: JSESSIONID.5228d07a=1pe45ktk2tb8cbx4spczqtx11;Path=/ci;Secure;HttpOnly
< Backend-Timing: D=5878 t=1455798311935078
< Cache-Control: no-cache,must-revalidate
< X-Varnish: 559059089, 268932007
< Via: 1.1 varnish, 1.1 varnish
< Age: 0
< X-Cache: cp1069 pass(0), cp3020 frontend pass(0)
< Set-Cookie: WMF-Last-Access=18-Feb-2016;Path=/;HttpOnly;Expires=Mon, 21 Mar 2016 12:00:00 GMT
< X-Analytics: https=1;nocookies=1
< X-Client-IP: 2a01:e35:8be0:d0d0:90f3:540c:262e:3e1
< 
{ [462 bytes data]
* Connection #0 to host integration.wikimedia.org left intact
Invalid control character at: line 1 column 96213 (char 96212)

I noticed that after some X time, the request pass just fine but subsequent ones all fail. I gave it a try again forcing curl to use IPv4 (all others went over v6). The first passed, the second (and the other) failed. Diff of headers:

diff -u first-good.txt second-bad.txt 
--- first-good.txt	2016-02-18 13:27:12.000000000 +0100
+++ second-bad.txt	2016-02-18 13:27:27.000000000 +0100
@@ -11,7 +11,7 @@
 > 
 < HTTP/1.1 200 OK
 < Server: nginx/1.9.4
-< Date: Thu, 18 Feb 2016 12:26:11 GMT
+< Date: Thu, 18 Feb 2016 12:26:19 GMT
 < Content-Type: application/json;charset=UTF-8
 < Transfer-Encoding: chunked
 < Connection: keep-alive
@@ -21,16 +21,17 @@
 < Expires: Thu, 01 Jan 1970 00:00:00 GMT
 < X-Jenkins: 1.625.3
 < X-Jenkins-Session: 5631497c
-< Set-Cookie: JSESSIONID.5228d07a=1u4cnd15pwb5mrmqkpd5f3luk;Path=/ci;Secure;HttpOnly
-< Backend-Timing: D=16161 t=1455798371540989
+< Set-Cookie: JSESSIONID.5228d07a=1bqqxb4ybwnqu1ninnalrg55dt;Path=/ci;Secure;HttpOnly
+< Backend-Timing: D=15498 t=1455798379618953
 < Cache-Control: no-cache,must-revalidate
-< X-Varnish: 1037413737, 1489672389, 268932272
-< Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
+< X-Varnish: 1755477751, 268932282
+< Via: 1.1 varnish, 1.1 varnish
 < Age: 0
-< X-Cache: cp1056 pass(0), cp3021 miss(0), cp3020 frontend miss(0)
+< X-Cache: cp1057 miss(0), cp3020 frontend pass(0)
 < Set-Cookie: WMF-Last-Access=18-Feb-2016;Path=/;HttpOnly;Expires=Mon, 21 Mar 2016 12:00:00 GMT
 < X-Analytics: https=1;nocookies=1
 < X-Client-IP: 88.190.13.13
 < 
-{ [444 bytes data]
+{ [484 bytes data]
 * Connection #0 to host integration.wikimedia.org left intact
+Invalid control character at: line 1 column 96210 (char 96209)
hashar renamed this task from ERROR nodepool.NodePool: Unable to check status of gallium.wikimedia.org to https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw.Feb 18 2016, 12:30 PM
hashar added projects: Traffic, SRE.

Change 271559 had a related patch set uploaded (by BBlack):
cache_misc: disable do_gzip

https://gerrit.wikimedia.org/r/271559

Change 271559 merged by BBlack:
cache_misc: disable do_gzip

https://gerrit.wikimedia.org/r/271559

Fixed by disabling do_gzip on the misc cluster. With all of the random backend code on cache_misc and known deficiencies in do_gzip and varnish in general with certain kinds of backend responses, we're probably better off this way.

Nodepool is all happy reaching out to Jenkins API since do_gzip is false. Note that using curl --compressed was always fine.

I am wondering if varnish Gzip caching could use ESI somehow and ends up glueing plain response with gziped parts then get confused in the way.

Regardess I hope the real traffic (text / upload etc) is not impacted.

Thank you @BBlack for the investigation and fast fix!

Change 273012 had a related patch set uploaded (by BBlack):
disable keepalives everywhere

https://gerrit.wikimedia.org/r/273012

Change 273012 merged by BBlack:
disable keepalives everywhere

https://gerrit.wikimedia.org/r/273012

Change 273013 had a related patch set uploaded (by BBlack):
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

Change 273013 merged by BBlack:
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

Verified on Nodepool, requests to Jenkins are all fine:

labnodepool1001:~$ grep -c JSON /var/log/nodepool/nodepool.log*
/var/log/nodepool/nodepool.log.2016-02-17:5584
/var/log/nodepool/nodepool.log.2016-02-18:1068
/var/log/nodepool/nodepool.log.2016-02-19:0
/var/log/nodepool/nodepool.log.2016-02-20:0
/var/log/nodepool/nodepool.log.2016-02-21:0
/var/log/nodepool/nodepool.log.2016-02-22:0
/var/log/nodepool/nodepool.log.2016-02-23:0
/var/log/nodepool/nodepool.log:0