Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Traffic Security Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (101 w, 4 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
Unknown

Recent Activity

Yesterday

Vgutierrez added a comment to T243591: varnishmtail panics on buster.

reported to upstream as https://github.com/google/mtail/issues/289

Fri, Jan 24, 2:43 PM · Operations, Traffic
Vgutierrez closed T243591: varnishmtail panics on buster, a subtask of T242093: Upgrade cache cluster to debian buster, as Resolved.
Fri, Jan 24, 2:28 PM · Patch-For-Review, Operations, Traffic
Vgutierrez closed T243591: varnishmtail panics on buster as Resolved.

Solved after rebuilding mtail 3.0.0~rc5 for buster

Fri, Jan 24, 2:28 PM · Operations, Traffic
Vgutierrez added a comment to T243591: varnishmtail panics on buster.

that panic comes from mtail itself, after all varnishmtail is running:

/usr/bin/varnishncsa -n frontend -c -b -F "${FMT}" | mtail -progs "${PROGS}" -logs /dev/stdin
Fri, Jan 24, 11:57 AM · Operations, Traffic
Vgutierrez triaged T243591: varnishmtail panics on buster as Medium priority.
Fri, Jan 24, 11:50 AM · Operations, Traffic
Vgutierrez created T243591: varnishmtail panics on buster.
Fri, Jan 24, 11:48 AM · Operations, Traffic

Thu, Jan 23

Vgutierrez added a comment to T242093: Upgrade cache cluster to debian buster.

on buster, systemd is not quite happy with trafficserver using /var/run:

Jan 23 17:15:40 cp4032 systemd[1]: /lib/systemd/system/trafficserver.service:8: PIDFile= references path below legacy directory /var/run/, updating /var/run/trafficserver/manager.lock → /run/trafficserver/manager.lock; please update the unit file accordingly.
Thu, Jan 23, 5:28 PM · Patch-For-Review, Operations, Traffic
Vgutierrez closed T238625: Remove nginx puppetization for cache text/text_ats as Resolved.
Thu, Jan 23, 4:25 PM · Traffic, Operations
Vgutierrez closed T236120: Get rid of nginx puppetization for cache upload as Resolved.
Thu, Jan 23, 4:25 PM · Traffic, Operations
Vgutierrez closed T236120: Get rid of nginx puppetization for cache upload, a subtask of T231433: Move cache upload cluster from nginx to ats-tls, as Resolved.
Thu, Jan 23, 4:25 PM · Traffic, Operations
Vgutierrez created P10249 (An Untitled Masterwork).
Thu, Jan 23, 3:13 PM
Vgutierrez closed T243506: buster installation issues on cache nodes, a subtask of T242093: Upgrade cache cluster to debian buster, as Resolved.
Thu, Jan 23, 1:19 PM · Patch-For-Review, Operations, Traffic
Vgutierrez closed T243506: buster installation issues on cache nodes as Resolved.

The final culprit were 3 syntax errors on netboot.cfg as part of https://gerrit.wikimedia.org/r/#/q/Id93d599c6ef0efc5caa2d8cccc83773644bd7ec6 as soon as they got fixed the installation worked as expected :)

Thu, Jan 23, 1:19 PM · Operations, Traffic
Vgutierrez triaged T243506: buster installation issues on cache nodes as Medium priority.
Thu, Jan 23, 11:45 AM · Operations, Traffic
Vgutierrez created T243506: buster installation issues on cache nodes.
Thu, Jan 23, 11:45 AM · Operations, Traffic

Wed, Jan 22

Vgutierrez triaged T243391: Upgrade ncredir cluster to buster as Medium priority.
Wed, Jan 22, 11:55 AM · Traffic, Operations
Vgutierrez created T243391: Upgrade ncredir cluster to buster.
Wed, Jan 22, 11:54 AM · Traffic, Operations

Tue, Jan 21

Vgutierrez created P10236 (An Untitled Masterwork).
Tue, Jan 21, 3:42 PM

Mon, Jan 20

Vgutierrez closed T242778: ATS strict round robin parent select policy doesn't work as expected, a subtask of T242620: ats-tls is having issues when varnish-fe goes away, as Resolved.
Mon, Jan 20, 4:08 PM · Patch-For-Review, Operations, Traffic
Vgutierrez closed T242778: ATS strict round robin parent select policy doesn't work as expected as Resolved.

Solved in 8.0.5-1wm12

Mon, Jan 20, 4:08 PM · Operations, Traffic
Vgutierrez added a comment to T242374: Set up git-driven static microsite for wikiworkshop.org.

Most of this has been configured now, the remaining slightly difficult bit is configuring an alternate SNI cert for the domain on our new ats-tls termination.

Mon, Jan 20, 12:33 PM · Patch-For-Review, Research, Operations, Traffic

Sun, Jan 19

Vgutierrez added a comment to T238305: servers freeze across the caching cluster.

Is there any action plan to investigate these issues?

Sun, Jan 19, 10:16 AM · Traffic, Operations

Thu, Jan 16

Vgutierrez triaged T242991: Analyze the impact of removing TLSv1/v1.1 on puppetmasters as Medium priority.
Thu, Jan 16, 4:21 PM · Operations, Traffic
Vgutierrez created T242991: Analyze the impact of removing TLSv1/v1.1 on puppetmasters.
Thu, Jan 16, 4:21 PM · Operations, Traffic
Vgutierrez closed T242321: Provide non-canonical-redirect service from every datacenter as Resolved.
Thu, Jan 16, 3:13 PM · Patch-For-Review, Operations, Traffic

Wed, Jan 15

Vgutierrez added a comment to T242778: ATS strict round robin parent select policy doesn't work as expected.

before enabling DNS resolution on cp3052:

vgutierrez@cp3052:~$ for port in {3120..3127}; do ss  "( dport = $port or sport = $port )" |wc -l; done
3
144
4
135
1
96
2
117

After:

vgutierrez@cp3052:~$ for port in {3120..3127}; do ss  "( dport = $port or sport = $port )" |wc -l; done
37
61
40
47
31
51
52
54
Wed, Jan 15, 10:23 AM · Operations, Traffic
Vgutierrez added a comment to T242778: ATS strict round robin parent select policy doesn't work as expected.

Apparently, when we disabled DNS resolution for parent requests to fix T232209 we introduced part of the issue. On short-lived ATS instances, enabling proxy.config.http.no_dns_just_forward_to_parent causes that traffic only hits odd parents, in our scenario, this means odd varnish-fe ports:

vgutierrez@cp3052:~$ for port in {3120..3127}; do ss  "( dport = $port or sport = $port )" |wc -l; done
1
128
3
111
2
125
1
141
Wed, Jan 15, 10:07 AM · Operations, Traffic

Tue, Jan 14

Vgutierrez created T242778: ATS strict round robin parent select policy doesn't work as expected.
Tue, Jan 14, 6:32 PM · Operations, Traffic
Vgutierrez closed T239141: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/, a subtask of T238803: Retire fixcopyright.wikimedia.org, as Resolved.
Tue, Jan 14, 12:00 PM · Cleanup, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), fixcopyright.wikimedia.org, Wiki-Setup (Delete / Redirect), Traffic, Operations
Vgutierrez closed T239141: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ as Resolved.
vgutierrez@mw1321:~$ curl --resolve fixcopyright.wikimedia.org:443:$(dig +short mw1321.eqiad.wmnet) https://fixcopyright.wikimedia.org -v -o /dev/null 2>&1 |grep location:
location: https://policy.wikimedia.org/policy-landing/copyright/
Tue, Jan 14, 12:00 PM · fixcopyright.wikimedia.org, Traffic, Operations

Mon, Jan 13

Vgutierrez added a comment to T242620: ats-tls is having issues when varnish-fe goes away.

Apparently ATS solves our current issue about TTFB VS connect timeout on https://github.com/apache/trafficserver/pull/4028, this is currently backported to 7.x and 9.x, I'm checking with upstream what's the current blocker for 8,x.

Mon, Jan 13, 5:20 PM · Patch-For-Review, Operations, Traffic
Vgutierrez moved T242620: ats-tls is having issues when varnish-fe goes away from Triage to TLS on the Traffic board.
Mon, Jan 13, 3:58 PM · Patch-For-Review, Operations, Traffic
Vgutierrez updated the task description for T242620: ats-tls is having issues when varnish-fe goes away.
Mon, Jan 13, 3:58 PM · Patch-For-Review, Operations, Traffic
Vgutierrez created T242620: ats-tls is having issues when varnish-fe goes away.
Mon, Jan 13, 3:21 PM · Patch-For-Review, Operations, Traffic
Vgutierrez created P10136 (An Untitled Masterwork).
Mon, Jan 13, 11:25 AM

Thu, Jan 9

Vgutierrez moved T242321: Provide non-canonical-redirect service from every datacenter from Triage to TLS on the Traffic board.
Thu, Jan 9, 8:42 AM · Patch-For-Review, Operations, Traffic
Vgutierrez renamed T242321: Provide non-canonical-redirect service from every datacenter from Provide non-canonical-redirect from every datacenter to Provide non-canonical-redirect service from every datacenter.
Thu, Jan 9, 8:41 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T242321: Provide non-canonical-redirect service from every datacenter.
Thu, Jan 9, 8:40 AM · Patch-For-Review, Operations, Traffic

Wed, Jan 8

Vgutierrez moved T242093: Upgrade cache cluster to debian buster from Triage to Caching on the Traffic board.
Wed, Jan 8, 9:40 AM · Patch-For-Review, Operations, Traffic
Vgutierrez moved T242200: Docker registry needs cache to vary on Accept header value from Triage to Caching on the Traffic board.
Wed, Jan 8, 9:40 AM · Traffic, Operations

Tue, Jan 7

Vgutierrez triaged T242093: Upgrade cache cluster to debian buster as Medium priority.
Tue, Jan 7, 11:26 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T242093: Upgrade cache cluster to debian buster.
Tue, Jan 7, 11:18 AM · Patch-For-Review, Operations, Traffic

Thu, Jan 2

Vgutierrez closed T231286: Track TLS related ATS metrics in prometheus, a subtask of T221594: Puppetize ATS TLS configuration for incoming traffic, as Resolved.
Thu, Jan 2, 1:36 PM · Traffic, Operations
Vgutierrez closed T231286: Track TLS related ATS metrics in prometheus as Resolved.

We currently have 5 SSL/TLS related panels in the ATS instance drilldown

Thu, Jan 2, 1:36 PM · Traffic, Operations
Vgutierrez added a comment to T239386: memory leak on keyholder-proxy on buster/python 3.7.

@Volans I think we could close this task already as everything seems healthy on acmechief1001

Thu, Jan 2, 11:42 AM · Acme-chief, Traffic, Operations
Vgutierrez closed T241132: wikimedia.community domain name is not resolving an mx record as Resolved.
$ host -t mx wikimedia.community
wikimedia.community mail is handled by 10 mx1001.wikimedia.org.
wikimedia.community mail is handled by 50 mx2001.wikimedia.org.
Thu, Jan 2, 11:29 AM · Mail, Traffic, Operations, DNS

Dec 19 2019

Vgutierrez created T241145: Improve ATS backend connection reuse against origin servers.
Dec 19 2019, 1:57 PM · Performance-Team (Radar), Patch-For-Review, Traffic, Operations

Nov 30 2019

Vgutierrez added a comment to T238305: servers freeze across the caching cluster.

It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in which case it's also quite likely this can be pre-empted on the ones that haven't crashed yet by giving them a reboot (e.g. something deep has changed while the servers are live, and they stabilize once they've done a fresh boot with it, possibly a live update of some microcode or firmware?)

Nov 30 2019, 7:40 AM · Traffic, Operations
Vgutierrez updated the task description for T238305: servers freeze across the caching cluster.
Nov 30 2019, 7:38 AM · Traffic, Operations
Vgutierrez closed T239502: cp3057 crashed, a subtask of T238305: servers freeze across the caching cluster, as Resolved.
Nov 30 2019, 7:38 AM · Traffic, Operations
Vgutierrez closed T239502: cp3057 crashed as Resolved.

As with other occurrences of T238305, nothing on the console, nothing on the SEL and nothing weird on the logs.

Nov 30 2019, 7:37 AM · Traffic, Operations
Vgutierrez created T239502: cp3057 crashed.
Nov 30 2019, 7:28 AM · Traffic, Operations

Nov 29 2019

Vgutierrez added a subtask for T238305: servers freeze across the caching cluster: T239449: cp1087 reboot.
Nov 29 2019, 2:17 AM · Traffic, Operations
Vgutierrez added a parent task for T239449: cp1087 reboot: T238305: servers freeze across the caching cluster.
Nov 29 2019, 2:17 AM · Operations, Traffic
Vgutierrez updated the task description for T238305: servers freeze across the caching cluster.
Nov 29 2019, 2:17 AM · Traffic, Operations
Vgutierrez closed T239449: cp1087 reboot as Resolved.

Yep, this is most likely another occurrence of T238305

Nov 29 2019, 2:17 AM · Operations, Traffic
Vgutierrez moved T239449: cp1087 reboot from Triage to Hardware on the Traffic board.
Nov 29 2019, 2:15 AM · Operations, Traffic

Nov 28 2019

Vgutierrez added a comment to T238720: Deprecate and disable port 80 for one-off sites under canonical domains.

Hmm with HSTS the browser shouldn't even try port 80.

Nov 28 2019, 11:30 AM · Traffic, Operations
Vgutierrez created T239386: memory leak on keyholder-proxy on buster/python 3.7.
Nov 28 2019, 8:06 AM · Acme-chief, Traffic, Operations

Nov 27 2019

Vgutierrez closed T239310: cp3063 crashed, a subtask of T238305: servers freeze across the caching cluster, as Resolved.
Nov 27 2019, 7:15 AM · Traffic, Operations
Vgutierrez closed T239310: cp3063 crashed as Resolved.

As "expected" nothing on SEL or logs, this is yet another occurrence of T238305

Nov 27 2019, 7:15 AM · Traffic, Operations
Vgutierrez updated the task description for T238305: servers freeze across the caching cluster.
Nov 27 2019, 7:05 AM · Traffic, Operations
Vgutierrez created T239310: cp3063 crashed.
Nov 27 2019, 7:03 AM · Traffic, Operations

Nov 26 2019

Vgutierrez added a comment to T233274: ATS lua script reload doesn't work as expected.

I've debugged locally what we seen yesterday on production with the following lua script:

WEBSOCKET_SUPPORT = nil
function __init__(argtb)
    dofile("/etc/trafficserver/lua/config.lua")
    ts.debug("On init WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))
end
function __reload__()
    dofile("/etc/trafficserver/lua/config.lua")
    ts.debug("On reload WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))
end
function get_websocket_support()
    return WEBSOCKET_SUPPORT
end
function do_global_send_request()
    ts.debug("WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))
Nov 26 2019, 4:04 AM · Patch-For-Review, Operations, Traffic

Nov 25 2019

Vgutierrez added a comment to T238900: add TLS support for smokeping.wikimedia.org.

@Volans @ayounsi IMHO it doesn't make any sense to include smokeping.wm.o SNI on the librenms certificate, that would set a dependency between otherwise completely isolated services. So I'd say we continue with the 2 certs approach (https://gerrit.wikimedia.org/r/552398)

Nov 25 2019, 12:49 PM · netops, Operations, Traffic
Vgutierrez closed T239041: cp3053 is unreachable, a subtask of T238305: servers freeze across the caching cluster, as Resolved.
Nov 25 2019, 3:14 AM · Traffic, Operations
Vgutierrez closed T239041: cp3053 is unreachable as Resolved.

Nothing on the logs or on SEL

Nov 25 2019, 3:14 AM · Operations, Traffic, ops-esams
Vgutierrez added a parent task for T239041: cp3053 is unreachable: T238305: servers freeze across the caching cluster.
Nov 25 2019, 3:09 AM · Operations, Traffic, ops-esams
Vgutierrez added a subtask for T238305: servers freeze across the caching cluster: T239041: cp3053 is unreachable.
Nov 25 2019, 3:09 AM · Traffic, Operations
Vgutierrez updated the task description for T238305: servers freeze across the caching cluster.
Nov 25 2019, 3:09 AM · Traffic, Operations
Vgutierrez moved T239041: cp3053 is unreachable from Triage to Hardware on the Traffic board.
Nov 25 2019, 3:03 AM · Operations, Traffic, ops-esams
Vgutierrez triaged T239041: cp3053 is unreachable as Medium priority.
Nov 25 2019, 3:03 AM · Operations, Traffic, ops-esams
Vgutierrez created T239041: cp3053 is unreachable.
Nov 25 2019, 2:47 AM · Operations, Traffic, ops-esams
Vgutierrez added a comment to T238724: ATS logs aren't being rotated.

upon manual removal of the empty error.log file, trafficserver creates a new one (without issuing a reload):

-rw-r--r--  1 trafficserver trafficserver 2.8K Nov 25 02:17 error.log
-rw-r--r--  1 trafficserver trafficserver  44M Nov 25 02:14 error.log.1
Nov 25 2019, 2:18 AM · Operations, Traffic
Vgutierrez added a comment to T238724: ATS logs aren't being rotated.

Even though moving (mv) the file manually works as expected without a reload, the rotation triggered by logrotate isn't forcing ATS to open a new file:

-rw-r--r--  1 trafficserver trafficserver    0 Nov 24 06:25 error.log
-rw-r--r--  1 trafficserver trafficserver  11M Nov 25 01:44 error.log.1
Nov 25 2019, 1:49 AM · Operations, Traffic
Vgutierrez added a comment to T238901: Wikibase's Special:EntityData should not emit when responding with HTTP code 304.

last BAD_INCOMING_RESPONSE error on cp1075 was a few seconds before deploying this:

20191122.15h16m44s CONNECT:[0] could not connect [BAD_INCOMING_RESPONSE] to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Q76150152?action=constraintsrdf&nocache=1574435804338
Nov 25 2019, 1:42 AM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata

Nov 22 2019

Vgutierrez updated subscribers of T238900: add TLS support for smokeping.wikimedia.org.

@Volans @crusnov @ayounsi we need some clarification regarding TLS material on netmon boxes, right now they get access to librenms and netbox acme-chief managed certificates. Is netbox still needed there?

Nov 22 2019, 8:24 AM · netops, Operations, Traffic
Vgutierrez created T238900: add TLS support for smokeping.wikimedia.org.
Nov 22 2019, 8:06 AM · netops, Operations, Traffic
Vgutierrez added a parent task for T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content): T236988: ats-be on the text cluster is experiencing broken connections.
Nov 22 2019, 7:58 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a subtask for T236988: ats-be on the text cluster is experiencing broken connections: T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).
Nov 22 2019, 7:58 AM · Operations, Traffic
Vgutierrez added a comment to T238724: ATS logs aren't being rotated.

built-in support for rotating the diags.log triggers the rotation of diags.log and manager.log, but it keeps error.log untouched:

vgutierrez@cp1075:/var/log/trafficserver$ ls -alh
total 945M
drwxr-x---  2 trafficserver adm           4.0K Nov 22 04:25 .
drwxr-xr-x 17 root          root          4.0K Nov 21 13:20 ..
-rw-r--r--  1 trafficserver trafficserver   27 Nov 22 04:25 .diags.log.meta
-rw-r--r--  1 trafficserver trafficserver   48 Aug 27 12:36 .error.log.meta
-rw-r--r--  1 trafficserver trafficserver   27 Nov 22 04:25 .manager.log.meta
-rw-r--r--  1 trafficserver trafficserver    0 Nov 22 04:25 diags.log
-rw-r--r--  1 trafficserver trafficserver 426K Nov 21 02:57 diags.log.20190827.08h56m54s-20191122.04h25m30s.old
-rw-r--r--  1 trafficserver trafficserver 944M Nov 22 04:34 error.log
-rw-r--r--  1 trafficserver trafficserver    0 Nov 22 04:25 manager.log
-rw-r--r--  1 trafficserver trafficserver 1.6K Nov 18 13:29 manager.log.20190827.08h56m51s-20191122.04h25m26s.old
Nov 22 2019, 4:34 AM · Operations, Traffic
Vgutierrez moved T238050: envoy overwrites the server header from Triage to TLS on the Traffic board.
Nov 22 2019, 4:04 AM · RESTBase, Traffic, Operations
Vgutierrez moved T238724: ATS logs aren't being rotated from Triage to Caching on the Traffic board.
Nov 22 2019, 4:04 AM · Operations, Traffic
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

And I can reproduce the issue as well with docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2.
Dockerfile:

FROM docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2
RUN mkdir -p /tmp/php
RUN sed -i s/runuser/www-data/ /etc/apache2/apache2.conf
COPY poc.php /var/www/html
Nov 22 2019, 3:00 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

And using a dumb TCP client we can see the 4 excess bytes:

vgutierrez@mwdebug2001:~$ python3
[...]
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.connect(('10.192.0.98', 80))
>>> s.send("GET /w/vgutierrez.php HTTP/1.1\r\nHost: en.wikipedia.org\r\n\r\n".encode())
58
>>> s.recv(1024)
b'HTTP/1.1 304 Not Modified\r\nDate: Fri, 22 Nov 2019 02:33:07 GMT\r\nServer: mwdebug2001.codfw.wmnet\r\n\r\ntest'
>>> s.close()
Nov 22 2019, 2:35 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

please note that I cannot reproduce on mwdebug using curl -i, see the difference:

vgutierrez@mwdebug2001:~$ curl --resolve en.wikipedia.org:80:10.192.0.98 http://en.wikipedia.org/w/vgutierrez.php -v
* Added en.wikipedia.org:80:10.192.0.98 to DNS cache
* Hostname en.wikipedia.org was found in DNS cache
*   Trying 10.192.0.98...
* TCP_NODELAY set
* Connected to en.wikipedia.org (10.192.0.98) port 80 (#0)
> GET /w/vgutierrez.php HTTP/1.1
> Host: en.wikipedia.org
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 304 Not Modified
< Date: Fri, 22 Nov 2019 02:26:32 GMT
< Server: mwdebug2001.codfw.wmnet
<
* Excess found in a non pipelined read: excess = 4 url = /w/vgutierrez.php (zero-length body)
* Curl_http_done: called premature == 0
* Connection #0 to host en.wikipedia.org left intact
vgutierrez@mwdebug2001:~$ curl --resolve en.wikipedia.org:80:10.192.0.98 http://en.wikipedia.org/w/vgutierrez.php -i
HTTP/1.1 304 Not Modified
Date: Fri, 22 Nov 2019 02:26:35 GMT
Server: mwdebug2001.codfw.wmnet
Nov 22 2019, 2:27 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712

Nov 21 2019

Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

so it looks like the apache fix doesn't fix every scenario. This small PoC triggers the issue:

<?php
http_response_code(304);
echo("test");
flush();
Nov 21 2019, 10:08 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

The patch seems to be there, see https://salsa.debian.org/apache-team/apache2/blob/debian/2.4.25-3+deb9u7/modules%2Fproxy%2Fmod_proxy_fcgi.c#L663-675:

else if (status == HTTP_NOT_MODIFIED
         || status == HTTP_PRECONDITION_FAILED) {
    /* Special 'status' cases handled:
     * 1) HTTP 304 response MUST NOT contain
     *    a message-body, ignore it.
     * 2) HTTP 412 response.
     * The break is not added since there might
     * be more bytes to read from the FCGI
     * connection. Even if the message-body is
     * ignored (and the EOS bucket has already
     * been sent) we want to avoid subsequent
     * bogus reads. */
    ignore_body = 1;
Nov 21 2019, 9:03 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

reproducing it with furl shows that php-fpm returns a 304 with a body, and it looks like apache2 2.4.25-3+deb9u7 is failing to handle that

Nov 21 2019, 8:54 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

@Joe it seems pretty easy to trigger:

vgutierrez@mw1330:~$ curl --resolve www.wikidata.org:80:10.64.32.32 "www.wikidata.org/wiki/Special:EntityData/Q38646387.json" -o /dev/null -v -H 'if-modified-since: Sat, 30 Mar 2019 07:05:28 GMT'
* Added www.wikidata.org:80:10.64.32.32 to DNS cache
* Hostname www.wikidata.org was found in DNS cache
*   Trying 10.64.32.32...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to www.wikidata.org (10.64.32.32) port 80 (#0)
> GET /wiki/Special:EntityData/Q38646387.json HTTP/1.1
> Host: www.wikidata.org
> User-Agent: curl/7.52.1
> Accept: */*
> if-modified-since: Sat, 30 Mar 2019 07:05:28 GMT
>
< HTTP/1.1 304 Not Modified
< Date: Thu, 21 Nov 2019 07:44:56 GMT
< Server: mw1330.eqiad.wmnet
< Expires: Thu, 21 Nov 2019 07:44:56 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Vary: Accept-Encoding,Cookie,Authorization
<
* Excess found in a non pipelined read: excess = 13881 url = /wiki/Special:EntityData/Q38646387.json (zero-length body)
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Curl_http_done: called premature == 0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host www.wikidata.org left intact
Nov 21 2019, 7:47 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

Interesting enough, on the apache access log, that request looks OK:

2019-11-21T02:57:04	73916	10.64.0.62	proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/304	14079	GET	http://www.wikidata.org/wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=14622	-	-	-	10.64.0.130	curl/7.52.1	-	-	-	10.64.0.62
Nov 21 2019, 6:38 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

it looks like the issue is happening at mw servers as well, nginx (TLS termination for appservers-rw.discovery.wmnet) is screaming a lot with wikidata requests, for the request I've pasted in the previous comment:

2019/11/21 02:57:04 [error] 20871#20871: *165777371 upstream prematurely closed connection while reading upstream, client: 10.64.0.130, server: www.wikimedia.org, request: "GET /wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=14622 HTTP/1.1", upstream: "http://10.64.0.62:80/wiki/Special:EntityData/Q38646387.json?vgutierrez=1&RANDOM=14622", host: "www.wikidata.org"
Nov 21 2019, 4:12 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

nice stap, playing a little bit with debug logging for a specific client ip I've been able to get the same information:

[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:6371 (is_response_valid)> (http_trans) [2494107211] [is_response_valid] Response Error: Missing status code
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3448 (handle_response_from_server)> (http_trans) [2494107211] [handle_response_from_server] (hrfs)
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3606 (retry_server_connection_not_open)> (http_trans) [2494107211] [0] failed to connect [2] to 10.2.2.1
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3627 (retry_server_connection_not_open)> (http_trans) [2494107211] [retry_server_connection_not_open] attempts now: 1, max: 3
[Nov 21 02:57:06.975] {0x2b12a4d77700} DEBUG: <HttpTransact.cc:3518 (handle_response_from_server)> (http_trans) [2494107211] [handle_response_from_server] Error. Retrying...
Nov 21 2019, 3:06 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T238781: dropped packets to phab1003 22280/tcp.

@Dzahn that's right, Removing the wss:// -> ws:// from ats-tls doesn't stop the requests from reaching varnish-fe as they are accepted as part of the catch-all remap rule.

Nov 21 2019, 1:34 AM · Operations, serviceops

Nov 20 2019

Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

@Joe on the other hand ATS doesn't reap connections for hosts marked as down, and because ats-be uses KA it should have plenty of available connections against appserver-rw.discovery.wmnet. I believe this is the reason why we aren't seeing a broader affectation right now

Nov 20 2019, 10:56 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

One real problem is: some of our requests are incredibly long and might overflow the timeouts in ATS-BE - in that case, pybal won't depool a backend but we might still get an error counter increased, correct?

Nov 20 2019, 10:47 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

nope, a 5xx doesn't translate to BAD_INCOMING_RESPONSE, actually is specifically whitelisted:

case STATUS_CODE_SERVER_ERROR:
    TxnDebug("http_trans", "[is_response_valid] Response Error: Origin Server returned 500 - allowing");
    return true;
Nov 20 2019, 10:10 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

I find this pretty worrisome for the following reasons:

  1. right now we have one remap rule that catches all the requests handled by appservers-rw. This means that ATS only tracks one counter of server connection retries for all the sites handled by appservers-rw.discovery.wmnet, wikipedia, wikidata, wikivoyage.
  2. if one mw server behind appservers-rw.discovery.wmnet misbehaves and triggers enough BAD_INCOMING_RESPONSE errors, it could trigger ATS to mark appservers-rw.discovery.wmnet as down
  3. At that point that ats-be instance would return 5xx (502/504) for every request that otherwise would be handled by appservers-rw.discovery.wmnet
Nov 20 2019, 9:57 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

From ATS source code, it looks like ATS logs connect errors on the first attempt but it's configured to perform 3 attempts, so a log line indicating a connection error isn't the same as a 502 iff the request is retryable:

//////////////////////////////////////////
// on the first connect attempt failure //
// record the failue                   //
//////////////////////////////////////////
if (0 == s->current.attempts) {
  Log::error("CONNECT:[%d] could not connect [%s] to %s for '%s'", s->current.attempts,
             HttpDebugNames::get_server_state_name(conn_state),
             ats_ip_ntop(&s->current.server->dst_addr.sa, addrbuf, sizeof(addrbuf)), url_string ? url_string : "<none>");
}
vgutierrez@cp1075:/var/log/trafficserver$ sudo -i traffic_ctl config match connect_attempts_max_retries
proxy.config.http.connect_attempts_max_retries: 3
proxy.config.http.connect_attempts_max_retries_dead_server: 1
Nov 20 2019, 9:18 AM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
Vgutierrez updated the task description for T238724: ATS logs aren't being rotated.
Nov 20 2019, 7:36 AM · Operations, Traffic