Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Traffic Security Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (83 w, 4 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
Unknown

Recent Activity

Today

Vgutierrez closed T229091: acme-chief failing in puppet with "Cannot open input file" as Resolved.
Fri, Sep 20, 9:07 AM · Traffic, Operations
Vgutierrez closed T231849: Tune ATS SSL session cache as Resolved.
Fri, Sep 20, 9:06 AM · Operations, Traffic
Vgutierrez closed T233369: Blubberoid endpoint intermittently routing to MediaWiki backend as Resolved.

before:

vgutierrez@cp1075:~$ curl -H 'Host: blubberoid.wikimedia.org' -X POST -D - -H 'content-type: application/yaml' --data-binary @<(echo -e 'version: v4\nbase: docker-registry.wikimedia.org/foo\nvariants: { test: {} }') http://127.0.0.1:3128/v1/test
HTTP/1.1 404 Not Found
Date: Fri, 20 Sep 2019 04:44:28 GMT
Content-Type: text/html; charset=utf-8
Server: mw1324.eqiad.wmnet
X-Powered-By: PHP/7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1
Cache-Control: s-maxage=600
Backend-Timing: D=1267 t=1568954668293093
Vary: X-Seven
Age: 0
Transfer-Encoding: chunked
Connection: keep-alive
X-Cache-Int: cp1075 miss
X-ATS-Timestamp: 1568954668
Fri, Sep 20, 4:53 AM · Traffic, Operations, Release-Engineering-Team-TODO
Vgutierrez moved T233369: Blubberoid endpoint intermittently routing to MediaWiki backend from Triage to Caching on the Traffic board.
Fri, Sep 20, 4:19 AM · Traffic, Operations, Release-Engineering-Team-TODO
Vgutierrez moved T233373: Sections on some mobile pages are not collabsable from Triage to Caching on the Traffic board.
Fri, Sep 20, 4:19 AM · Operations, Traffic, MobileFrontend

Yesterday

Vgutierrez moved T232988: Use acme-chief provided OCSP stapling responses from Triage to TLS on the Traffic board.
Thu, Sep 19, 5:39 AM · Traffic, Operations, Acme-chief
Vgutierrez added a project to T232988: Use acme-chief provided OCSP stapling responses: Traffic.
Thu, Sep 19, 5:39 AM · Traffic, Operations, Acme-chief
Vgutierrez closed T232298: Investigate segfaults on ats-tls running on cp5001 as Resolved.
Thu, Sep 19, 5:35 AM · Patch-For-Review, Traffic, Operations
Vgutierrez closed T232298: Investigate segfaults on ats-tls running on cp5001 , a subtask of T231433: Move cache upload cluster from nginx to ats-tls, as Resolved.
Thu, Sep 19, 5:35 AM · Patch-For-Review, Traffic, Operations
Vgutierrez moved T233274: ATS lua script reload doesn't work as expected from Triage to Caching on the Traffic board.
Thu, Sep 19, 5:34 AM · Operations, Traffic
Vgutierrez triaged T233274: ATS lua script reload doesn't work as expected as Normal priority.
Thu, Sep 19, 5:34 AM · Operations, Traffic
Vgutierrez created T233274: ATS lua script reload doesn't work as expected.
Thu, Sep 19, 5:34 AM · Operations, Traffic

Wed, Sep 18

Vgutierrez closed T233205: Higher failed fetches error rate on some caching servers , a subtask of T231433: Move cache upload cluster from nginx to ats-tls, as Resolved.
Wed, Sep 18, 12:26 PM · Patch-For-Review, Traffic, Operations
Vgutierrez closed T233205: Higher failed fetches error rate on some caching servers as Resolved.

Solved by preventing Proxy-Connection from spreading across varnish-fe and ats-be, thanks for reporting the issue @jijiki

Wed, Sep 18, 12:26 PM · Operations, Traffic
Vgutierrez added a comment to T233205: Higher failed fetches error rate on some caching servers .

It looks like ats-tls setting Proxy-Connectionto Close is messing with varnish-fe<-->ats-be connections as it can be seen in https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=1568203523676&to=1568808323676&refresh=1m

Wed, Sep 18, 12:18 PM · Operations, Traffic
Vgutierrez moved T233205: Higher failed fetches error rate on some caching servers from Triage to Caching on the Traffic board.
Wed, Sep 18, 11:04 AM · Operations, Traffic

Tue, Sep 17

Vgutierrez closed T232724: ATS SSL session cache doesn't work, a subtask of T231433: Move cache upload cluster from nginx to ats-tls, as Resolved.
Tue, Sep 17, 6:02 AM · Patch-For-Review, Traffic, Operations
Vgutierrez closed T232724: ATS SSL session cache doesn't work as Resolved.

Issue looks solved after upgrading to 8.0.5-1wm7:

vgutierrez@cp4021:~$ openssl s_client -connect 127.0.0.1:443 -reconnect < /dev/null 2>1 |egrep -i "reconnect|reused"
drop connection and then reconnect
Reused, TLSv1.2, Cipher is ECDHE-ECDSA-CHACHA20-POLY1305
drop connection and then reconnect
Reused, TLSv1.2, Cipher is ECDHE-ECDSA-CHACHA20-POLY1305
drop connection and then reconnect
Reused, TLSv1.2, Cipher is ECDHE-ECDSA-CHACHA20-POLY1305
drop connection and then reconnect
Reused, TLSv1.2, Cipher is ECDHE-ECDSA-CHACHA20-POLY1305
drop connection and then reconnect
Reused, TLSv1.2, Cipher is ECDHE-ECDSA-CHACHA20-POLY1305
Tue, Sep 17, 6:02 AM · Traffic, Operations
Vgutierrez closed T232724: ATS SSL session cache doesn't work, a subtask of T231849: Tune ATS SSL session cache, as Resolved.
Tue, Sep 17, 6:02 AM · Operations, Traffic

Mon, Sep 16

Vgutierrez added a comment to T210411: Applayer services without TLS.

Please note that the docker-registry certificate is missing the public hostname: docker-registry.wikimedia.org

Mon, Sep 16, 3:05 PM · Patch-For-Review, serviceops, Operations, Traffic
Vgutierrez triaged T232988: Use acme-chief provided OCSP stapling responses as Normal priority.
Mon, Sep 16, 3:04 PM · Traffic, Operations, Acme-chief
Vgutierrez created T232988: Use acme-chief provided OCSP stapling responses.
Mon, Sep 16, 10:26 AM · Traffic, Operations, Acme-chief

Fri, Sep 13

Vgutierrez added a comment to T232724: ATS SSL session cache doesn't work.

Reported and proposed a solution to upstream in https://github.com/apache/trafficserver/pull/5935

Fri, Sep 13, 10:38 AM · Traffic, Operations

Thu, Sep 12

Vgutierrez added a comment to T232724: ATS SSL session cache doesn't work.

It looks like the culprit is https://github.com/apache/trafficserver/commit/03734d05e28af8a7b105a0579056c913fb5d1bc5, I've tested

https://gerrit.wikimedia.org/g/operations/debs/trafficserver/+/refs/changes/02/536302/3/debian/patches/0026-perform-quiet-ssl-shutdown.patch
``` on 8.0.5-1wm7 and it works:

willikins:~ vgutierrez$ openssl s_client -connect 127.0.0.1:3129 -reconnect < /dev/null 2>&1| egrep -i "reconnect|reused"
drop connection and then reconnect
Reused, TLSv1/SSLv3, Cipher is ECDHE-ECDSA-AES256-GCM-SHA384
drop connection and then reconnect
Reused, TLSv1/SSLv3, Cipher is ECDHE-ECDSA-AES256-GCM-SHA384
drop connection and then reconnect
Reused, TLSv1/SSLv3, Cipher is ECDHE-ECDSA-AES256-GCM-SHA384
drop connection and then reconnect
Reused, TLSv1/SSLv3, Cipher is ECDHE-ECDSA-AES256-GCM-SHA384
drop connection and then reconnect
Reused, TLSv1/SSLv3, Cipher is ECDHE-ECDSA-AES256-GCM-SHA384```

Thu, Sep 12, 6:36 PM · Traffic, Operations
Vgutierrez added a comment to T232724: ATS SSL session cache doesn't work.

After debugging why the eviction is triggered:

#0  ssl_rm_cached_session (ctx=0x188db10, sess=0x2ad5e802a910) at SSLUtils.cc:304
#1  0x00002ad5d30030f9 in remove_session_lock (ctx=0x188db10, c=0x2ad5e802a910, lck=<optimized out>) at ../ssl/ssl_sess.c:738
#2  0x00002ad5d3003eff in ssl_clear_bad_session (s=s@entry=0x2ad5e8028870) at ../ssl/ssl_sess.c:1032
#3  0x00002ad5d2ffefe6 in SSL_free (s=0x2ad5e8028870) at ../ssl/ssl_lib.c:1145
#4  0x0000000000669128 in SSLNetVConnection::clear (this=0x2ad618020bf0) at SSLNetVConnection.cc:920
#5  0x00000000006692e4 in SSLNetVConnection::free (this=0x2ad618020bf0, t=0x2ad5d8d6c010) at SSLNetVConnection.cc:954
#6  0x000000000068b4d3 in read_signal_and_update (event=104, vc=0x2ad618020bf0) at UnixNetVConnection.cc:102
#7  0x000000000068b2be in read_signal_done (event=25746192, nh=0x2ad5d8d6fcf0, vc=0x2ad618020bf0) at UnixNetVConnection.cc:144
#8  UnixNetVConnection::readSignalDone (this=0x2ad618020bf0, event=25746192, nh=0x2ad5d8d6fcf0) at UnixNetVConnection.cc:1041
#9  0x000000000066729a in SSLNetVConnection::net_read_io (this=0x2ad618020bf0, nh=0x2ad5d8d6fcf0, lthread=<optimized out>) at SSLNetVConnection.cc:696
#10 0x000000000067f6ef in NetHandler::process_ready_list (this=0x2ad5d8d6fcf0) at UnixNet.cc:395
#11 0x000000000067ff8c in NetHandler::waitForActivity (this=<optimized out>, timeout=<optimized out>) at UnixNet.cc:528
#12 0x000000000068006a in non-virtual thunk to NetHandler::waitForActivity(long) ()
#13 0x00000000006ae762 in EThread::execute_regular (this=0x2ad5d8d6c010) at UnixEThread.cc:272
#14 0x00000000006ad369 in spawn_thread_internal (a=0x17f0aa0) at Thread.cc:85
#15 0x00002ad5d39dd4a4 in start_thread (arg=0x2ad5d917d700) at pthread_create.c:456
#16 0x00002ad5d465bd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Thu, Sep 12, 5:52 PM · Traffic, Operations
Vgutierrez added a subtask for T231849: Tune ATS SSL session cache: T232724: ATS SSL session cache doesn't work.
Thu, Sep 12, 11:09 AM · Operations, Traffic
Vgutierrez added a parent task for T232724: ATS SSL session cache doesn't work: T231849: Tune ATS SSL session cache.
Thu, Sep 12, 11:09 AM · Traffic, Operations
Vgutierrez moved T232724: ATS SSL session cache doesn't work from Triage to TLS on the Traffic board.
Thu, Sep 12, 11:08 AM · Traffic, Operations
Vgutierrez updated the task description for T232724: ATS SSL session cache doesn't work.
Thu, Sep 12, 11:05 AM · Traffic, Operations
Vgutierrez added a comment to T232724: ATS SSL session cache doesn't work.

Enabling the session cache debug on a local instance shows this:

willikins:~ vgutierrez$ docker logs -f ats_ats_1 |fgrep timeout
[E. Mgmt] log ==> [TrafficManager] using root directory '/usr'
[Sep 12 11:00:15.807] {0x2ba5dff86080} DEBUG: <SSLUtils.cc:1600 (SSLInitServerContext)> (ssl.session_cache) ssl context=0x234dbc0: using session cache options, enabled=2, size=102400, num_buckets=256, skip_on_contention=0, timeout=30000, auto_clear=0
^C
willikins:~ vgutierrez$ docker logs -f ats_ats_1 |fgrep 4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1
[E. Mgmt] log ==> [TrafficManager] using root directory '/usr'
[Sep 12 11:01:14.215] {0x2ba5e78d7700} DEBUG: <SSLUtils.cc:285 (ssl_new_cached_session)> (ssl.session_cache.insert) ssl_new_cached_session session '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1' and context 0x234dbc0
[Sep 12 11:01:14.215] {0x2ba5e78d7700} DEBUG: <SSLSessionCache.cc:108 (insertSession)> (ssl.session_cache.insert) SessionCache using bucket 5 (0x2347398): Inserting session '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1' (hash: B12CFBF767630005).
[Sep 12 11:01:14.215] {0x2ba5e78d7700} DEBUG: <SSLSessionCache.cc:127 (insertSession)> (ssl.session_cache) Inserting session '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1' to bucket 0x2347398.
[Sep 12 11:01:27.146] {0x2ba5e76d3700} DEBUG: <SSLUtils.cc:241 (ssl_get_cached_session)> (ssl.session_cache.get) ssl_get_cached_session cached session '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1' context 0x234dbc0
[Sep 12 11:01:27.146] {0x2ba5e76d3700} DEBUG: <SSLSessionCache.cc:71 (getSession)> (ssl.session_cache.get) SessionCache looking in bucket 5 (0x2347398) for session '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1' (hash: B12CFBF767630005).
[Sep 12 11:01:27.146] {0x2ba5e76d3700} DEBUG: <SSLSessionCache.cc:213 (getSession)> (ssl.session_cache) Looking for session with id '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1' in bucket 0x2347398
[Sep 12 11:01:27.149] {0x2ba5e76d3700} DEBUG: <SSLUtils.cc:318 (ssl_rm_cached_session)> (ssl.session_cache.remove) ssl_rm_cached_session cached session '4C2F274AD6F44519402A6B02E4867DD22A02BFB53067A5F505006367F7FB2CB1'
Thu, Sep 12, 11:03 AM · Traffic, Operations
Vgutierrez triaged T232724: ATS SSL session cache doesn't work as High priority.
Thu, Sep 12, 10:58 AM · Traffic, Operations
Vgutierrez created T232724: ATS SSL session cache doesn't work.
Thu, Sep 12, 10:58 AM · Traffic, Operations
Vgutierrez added a comment to T219765: Implement server-side OCSP stapling.

After upgrading acme-chief on acmechief-test1001, a tiny storm of OCSP requests was generated:

Sep 12 08:23:24 acmechief-test1001 acme-chief-backend[457]: Missing/invalid DNS zone updater CMD timeout, using the default one: 60.00
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Number of certificates per status: Counter({'VALID': 48})
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Starting main loop...
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate apt / ec-prime256v1
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for apt / ec-prime256v1
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate apt / rsa-2048
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for apt / rsa-2048
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate archiva / ec-prime256v1
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for archiva / ec-prime256v1
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate archiva / rsa-2048
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for archiva / rsa-2048
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate cloudelastic / ec-prime256v1
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for cloudelastic / ec-prime256v1
Sep 12 08:23:25 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate cloudelastic / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for cloudelastic / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate dumps / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for dumps / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate dumps / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for dumps / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate gerrit / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for gerrit / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate gerrit / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for gerrit / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate icinga / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for icinga / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate icinga / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for icinga / rsa-2048
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate idp / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for idp / ec-prime256v1
Sep 12 08:23:26 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate idp / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for idp / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate ldap / ec-prime256v1
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for ldap / ec-prime256v1
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate ldap / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for ldap / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate ldap-codfw1dev / ec-prime256v1
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for ldap-codfw1dev / ec-prime256v1
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate ldap-codfw1dev / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for ldap-codfw1dev / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate ldap-labtest / ec-prime256v1
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for ldap-labtest / ec-prime256v1
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate ldap-labtest / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for ldap-labtest / rsa-2048
Sep 12 08:23:27 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate librenms / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for librenms / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate librenms / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for librenms / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate lists / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for lists / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate lists / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for lists / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate mirrors / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for mirrors / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate mirrors / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for mirrors / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate mx / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for mx / ec-prime256v1
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate mx / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for mx / rsa-2048
Sep 12 08:23:28 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate netbox / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for netbox / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate netbox / rsa-2048
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for netbox / rsa-2048
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-1 / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-1 / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-1 / rsa-2048
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-1 / rsa-2048
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-2 / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-2 / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-2 / rsa-2048
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-2 / rsa-2048
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-3 / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-3 / ec-prime256v1
Sep 12 08:23:29 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-3 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-3 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-4 / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-4 / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-4 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-4 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-5 / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-5 / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-5 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-5 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-6 / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-6 / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate non-canonical-redirect-6 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for non-canonical-redirect-6 / rsa-2048
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate tendril / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for tendril / ec-prime256v1
Sep 12 08:23:30 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate tendril / rsa-2048
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for tendril / rsa-2048
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate unified / ec-prime256v1
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for unified / ec-prime256v1
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate unified / rsa-2048
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for unified / rsa-2048
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate wikibase / ec-prime256v1
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for wikibase / ec-prime256v1
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate wikibase / rsa-2048
Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for wikibase / rsa-2048
Thu, Sep 12, 8:28 AM · Acme-chief
Vgutierrez committed rOSAC08bc9db90513: api: Allow acme-chief clients to fetch OCSP responses (authored by Vgutierrez).
api: Allow acme-chief clients to fetch OCSP responses
Thu, Sep 12, 7:20 AM
Vgutierrez committed rOSAC132920f50176: Release 0.21 (authored by Vgutierrez).
Release 0.21
Thu, Sep 12, 7:20 AM
Vgutierrez committed rOSAC7b968a435e8b: acme_chief: Provide OCSP responses (authored by Vgutierrez).
acme_chief: Provide OCSP responses
Thu, Sep 12, 7:20 AM
Vgutierrez committed rOSACfb9fc1c50b15: ocsp: Provide basic test coverage (authored by Vgutierrez).
ocsp: Provide basic test coverage
Thu, Sep 12, 7:20 AM
Vgutierrez committed rOSACe4530231b671: ocsp: Allow to load an existing OCSPResponse from disk (authored by Vgutierrez).
ocsp: Allow to load an existing OCSPResponse from disk
Thu, Sep 12, 7:20 AM
Vgutierrez committed rOSAC99f87dc61fb4: ocsp: Provide basic functionality to perform OCSP requests (authored by Vgutierrez).
ocsp: Provide basic functionality to perform OCSP requests
Thu, Sep 12, 7:20 AM
Vgutierrez committed rOSAC648adb5716cf: x509: Expose the OCSP URI of a Certificate as a property (authored by Vgutierrez).
x509: Expose the OCSP URI of a Certificate as a property
Thu, Sep 12, 7:20 AM

Tue, Sep 10

Vgutierrez added a comment to T232298: Investigate segfaults on ats-tls running on cp5001 .

So it looks like we are still leaking memory with ATS 8.0.5-1wm6:

(gdb) bt
#0  0x00002adfd1a16fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002adfd1a1842a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00002adfcef53d0a in ink_abort (message_format=<optimized out>) at ink_error.cc:99
#3  0x00002adfcef59a6d in ats_malloc (size=128408) at ink_memory.cc:60
#4  0x00000000005c6b3f in ProcessManager::signalManager (this=0xf99760, msg_id=16, data_raw=0x2adffab3e010 "\003", data_len=-777949185) at ProcessManager.cc:245
#5  0x00000000006aa68b in RecMessageSend (msg=0x0) at RecProcess.cc:325
#6  0x000000000069e4bd in send_push_message () at P_RecCore.cc:123
#7  0x00000000006aa974 in sync_cont::sync (this=0xfc5fb0) at RecProcess.cc:194
#8  0x00000000006add11 in Continuation::handleEvent (this=<optimized out>, event=<optimized out>, data=<optimized out>) at ./I_Continuation.h:160
#9  EThread::process_event (this=0x2adfd3a46010, e=0xfd44e0, calling_code=2) at UnixEThread.cc:131
#10 0x00000000006ae520 in EThread::execute_regular (this=0x2adfd3a46010) at UnixEThread.cc:244
#11 0x00000000006ad369 in spawn_thread_internal (a=0xfd05b0) at Thread.cc:85
#12 0x00002adfd0ebb4a4 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00002adfd1accd0f in clone () from /lib/x86_64-linux-gnu/libc.so.6
Tue, Sep 10, 4:44 AM · Patch-For-Review, Traffic, Operations

Mon, Sep 9

Vgutierrez moved T232298: Investigate segfaults on ats-tls running on cp5001 from Triage to Caching on the Traffic board.
Mon, Sep 9, 9:08 AM · Patch-For-Review, Traffic, Operations
Vgutierrez triaged T232298: Investigate segfaults on ats-tls running on cp5001 as Normal priority.
Mon, Sep 9, 9:08 AM · Patch-For-Review, Traffic, Operations
Vgutierrez added a comment to T232298: Investigate segfaults on ats-tls running on cp5001 .

service log shows memory issues:

Sep 05 15:54:43 cp5001 traffic_server[12607]: Fatal: couldn't allocate 32768 bytes
Sep 08 03:24:10 cp5001 traffic_server[31479]: Fatal: couldn't allocate 65536 bytes

and it seems to be related to the memory leak described on https://github.com/apache/trafficserver/pull/5695

Mon, Sep 9, 9:08 AM · Patch-For-Review, Traffic, Operations
Vgutierrez added a comment to T232298: Investigate segfaults on ats-tls running on cp5001 .

Both crashes seems to be related:

crash-2019-09-05-155443.log
Thread 12661, [ET_NET 32]:
0    0x000000000049d9f0 crash_logger_invoke(int, siginfo_t*, void*) + 0x80
1    0x00002b8df29540e0 __restore_rt + (nil)
2    0x00002b8df34a5fff gsignal + 0xcf
3    0x00002b8df34a742a abort + 0x16a
4    0x00002b8df09e2d0a ink_abort(char const*, ...) + 0x9a
5    0x00002b8df09e8a6d ats_malloc + 0x2d
6    0x00000000005aa5ba HdrHeap::coalesce_str_heaps(int) + 0x5a
7    0x00000000005aa470 HdrHeap::allocate_str(int) + 0x220
8    0x00000000005aa832 HdrHeap::duplicate_str(char const*, int) + 0xa2
9    0x00000000005b2620 mime_field_name_set(HdrHeap*, MIMEHdrImpl*, MIMEField*, short, char const*, int, bool) + 0x40
10   0x000000000055c1a8 HpackDynamicTable::add_header_field(MIMEField const*) + 0x158
11   0x000000000055c6d6 encode_literal_header_field_with_indexed_name(unsigned char*, unsigned char const*, MIMEFieldWrapper const&, unsigned int, HpackIndexingTable&, HpackField) + 0x56
12   0x000000000055deb8 hpack_encode_header_block(HpackIndexingTable&, unsigned char*, unsigned long, HTTPHdr*, int) + 0x2f8
13   0x00000000005526af http2_encode_header_blocks(HTTPHdr*, unsigned char*, unsigned int, unsigned int*, HpackIndexingTable&, int) + 0x4f
14   0x00000000005670cb Http2ConnectionState::send_headers_frame(Http2Stream*) + 0x16b
15   0x0000000000554d9d Http2Stream::update_write_request(IOBufferReader*, long, bool) + 0x75d
16   0x0000000000555cae Http2Stream::do_io_write(Continuation*, long, IOBufferReader*, bool) + 0xce
17   0x000000000053b1c4 HttpTunnel::producer_run(HttpTunnelProducer*) + 0xb14
18   0x000000000053a506 HttpTunnel::tunnel_run(HttpTunnelProducer*) + 0x46
19   0x00000000004eeabd HttpSM::handle_api_return() + 0x51d
20   0x00000000004edfd3 HttpSM::state_api_callout(int, void*) + 0x1233
21   0x0000000000501570 HttpSM::set_next_state() + 0xa70
22   0x00000000004e772f HttpSM::call_transact_and_set_next_state(void (*)(HttpTransact::State*)) + 0x10f
23   0x00000000004ee5e0 HttpSM::handle_api_return() + 0x40
24   0x00000000004edfd3 HttpSM::state_api_callout(int, void*) + 0x1233
25   0x00000000004f364f HttpSM::state_read_server_response_header(int, void*) + 0x5af
26   0x00000000004e4162 HttpSM::main_handler(int, void*) + 0x1b2
27   0x000000000068995e UnixNetVConnection::net_read_io(NetHandler*, EThread*) + 0xace
28   0x000000000067f57f NetHandler::process_ready_list() + 0x9f
29   0x000000000067fe1c NetHandler::waitForActivity(long) + 0x69c
30   0x000000000067fefa non-virtual thunk to NetHandler::waitForActivity(long) + 0xa
31   0x00000000006ae5f2 EThread::execute_regular() + 0x4f2
32   0x00000000006ad1f9 Thread::start(char const*, void*, unsigned long, std::__1::function<void ()> const&) + 0x1a9
33   0x00002b8df294a4a4 start_thread + 0xc4
34   0x00002b8df355bd0f clone + 0x3f
35   0x0000000000000000 0x0 + 0x3f
Mon, Sep 9, 8:26 AM · Patch-For-Review, Traffic, Operations
Vgutierrez created T232298: Investigate segfaults on ats-tls running on cp5001 .
Mon, Sep 9, 8:12 AM · Patch-For-Review, Traffic, Operations

Sat, Sep 7

Vgutierrez added a comment to P9056 (An Untitled Masterwork).
willikins:~ vgutierrez$ sudo -i /usr/local/Cellar/mtr/0.93_1/sbin/mtr text-lb.eqiad.wikimedia.org --report-wide -z
Start: 2019-09-07T14:09:19+0000
HOST: willikins                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS???    192.168.1.1                                  0.0%    10    1.7  13.6   1.6 114.0  35.3
  2. AS4788   58.27.111.134                                0.0%    10   30.9  50.3  28.8 183.7  49.2
  3. AS???    10.233.65.46                                 0.0%    10   32.9  83.1  31.1 310.7  89.9
  4. AS???    10.55.41.118                                 0.0%    10   37.0  66.1  36.5 219.1  57.8
  5. AS2914   ae-6.r21.kslrml02.my.bb.gin.ntt.net          0.0%    10   37.7  52.7  37.3 124.2  27.3
  6. AS2914   ae-1.r20.kslrml02.my.bb.gin.ntt.net          0.0%    10   41.5  56.4  38.6 158.3  36.8
  7. AS2914   ae-3.r25.tkokhk01.hk.bb.gin.ntt.net          0.0%    10   80.1  90.4  79.9 175.0  29.7
  8. AS2914   ae-7.r24.osakjp02.jp.bb.gin.ntt.net          0.0%    10  121.9 135.1 121.9 235.5  35.4
  9. AS2914   ae-4.r23.sttlwa01.us.bb.gin.ntt.net         70.0%    10  216.4 257.4 216.4 337.1  69.0
 10. AS2914   ae-29.r05.sttlwa01.us.bb.gin.ntt.net         0.0%    10  211.1 236.8 209.8 443.2  73.0
 11. AS2914   ae-1.a01.sttlwa01.us.bb.gin.ntt.net          0.0%    10  215.8 245.1 212.3 376.8  59.7
 12. AS2914   ae-0.cloudflare.sttlwa01.us.bb.gin.ntt.net   0.0%    10  335.2 347.2 327.7 440.0  35.9
 13. AS13335  108.162.244.36                               0.0%    10  230.8 246.6 229.1 375.5  45.7
 14. AS13335  108.162.244.36                               0.0%    10  230.3 244.4 229.3 309.4  26.2
 15. AS???    ???                                         100.0    10    0.0   0.0   0.0   0.0   0.0
 16. AS14907  text-lb.eqiad.wikimedia.org                  0.0%    10  460.6 420.9 380.3 542.1  50.2
Sat, Sep 7, 2:10 PM

Fri, Sep 6

Vgutierrez closed T232209: ats-tls is performing 3k DNS queries per second on cp5001, a subtask of T231433: Move cache upload cluster from nginx to ats-tls, as Resolved.
Fri, Sep 6, 1:27 PM · Patch-For-Review, Traffic, Operations
Vgutierrez closed T232209: ats-tls is performing 3k DNS queries per second on cp5001 as Resolved.
Fri, Sep 6, 1:27 PM · Traffic, Operations
Vgutierrez created T232209: ats-tls is performing 3k DNS queries per second on cp5001.
Fri, Sep 6, 11:01 AM · Traffic, Operations

Wed, Sep 4

Vgutierrez closed T231859: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake as Resolved.

Solved, now ATS has the same behaviour as nginx:

vgutierrez@cp5001:~$ openssl s_client -connect 127.0.0.1:443 -curves secp521r1 2>&1 </dev/null |egrep "Server Temp Key|Cipher"
Server Temp Key: DH, 2048 bits
New, SSLv3, Cipher is DHE-RSA-AES128-SHA
    Cipher    : DHE-RSA-AES128-SHA
vgutierrez@cp5001:~$ openssl s_client -connect 127.0.0.1:4443 -curves secp521r1 2>&1 </dev/null |egrep "Server Temp Key|Cipher" # nginx is currently listening on port 4443
Server Temp Key: DH, 2048 bits
New, SSLv3, Cipher is DHE-RSA-AES128-SHA
    Cipher    : DHE-RSA-AES128-SHA
Wed, Sep 4, 10:21 AM · Operations, Traffic

Tue, Sep 3

Vgutierrez triaged T231859: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake as Normal priority.
Tue, Sep 3, 9:17 AM · Operations, Traffic
Vgutierrez moved T231859: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake from Triage to TLS on the Traffic board.
Tue, Sep 3, 9:16 AM · Operations, Traffic
Vgutierrez created T231859: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake.
Tue, Sep 3, 9:16 AM · Operations, Traffic
Vgutierrez triaged T231849: Tune ATS SSL session cache as Normal priority.
Tue, Sep 3, 5:55 AM · Operations, Traffic
Vgutierrez moved T231849: Tune ATS SSL session cache from Triage to TLS on the Traffic board.
Tue, Sep 3, 5:55 AM · Operations, Traffic
Vgutierrez created T231849: Tune ATS SSL session cache.
Tue, Sep 3, 5:54 AM · Operations, Traffic

Mon, Sep 2

Vgutierrez updated subscribers of T231753: Downloading the original SVG of a file on Commons serves a truncated stream.

This look like a duplicate of T231422, or at least pretty similar issue, what do you think @ema?

Mon, Sep 2, 9:08 AM · Traffic, Operations, Commons

Fri, Aug 30

Vgutierrez closed T231513: Perform HTTPS redirect without crossing domain boundaries for non canonical domains as Resolved.
Fri, Aug 30, 6:29 AM · Operations, Traffic
Vgutierrez updated subscribers of T231422: Cannot download STL files due to network error.
Fri, Aug 30, 6:20 AM · Operations, Traffic
Vgutierrez added a comment to T231422: Cannot download STL files due to network error.

I can reproduce the issue, using HTTP/2 and HTTP/1.1 on eqsin:

willikins:~ vgutierrez$ curl https://upload.wikimedia.org/wikipedia/commons/d/de/65-msr-sandaled-foot-5.stl -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1568k    0 1568k    0     0   867k      0 --:--:--  0:00:01 --:--:--  867k
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
willikins:~ vgutierrez$ curl --http1.1 https://upload.wikimedia.org/wikipedia/commons/d/de/65-msr-sandaled-foot-5.stl -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2656k    0 2656k    0     0  1072k      0 --:--:--  0:00:02 --:--:-- 1072k
curl: (18) transfer closed with outstanding read data remaining
Fri, Aug 30, 6:19 AM · Operations, Traffic
Vgutierrez moved T231627: Move cache text cluster from nginx to ats-tls from Triage to TLS on the Traffic board.
Fri, Aug 30, 6:03 AM · Patch-For-Review, Traffic, Operations
Vgutierrez triaged T231627: Move cache text cluster from nginx to ats-tls as Normal priority.
Fri, Aug 30, 6:03 AM · Patch-For-Review, Traffic, Operations
Vgutierrez created T231627: Move cache text cluster from nginx to ats-tls.
Fri, Aug 30, 6:02 AM · Patch-For-Review, Traffic, Operations

Thu, Aug 29

Vgutierrez created T231533: Improve ATS prometheus metrics.
Thu, Aug 29, 9:43 AM · Operations, Traffic
Vgutierrez moved T231514: Enable HSTS for non canonical domains using the ncredir service from Triage to TLS on the Traffic board.
Thu, Aug 29, 4:31 AM · Operations, Traffic
Vgutierrez triaged T231514: Enable HSTS for non canonical domains using the ncredir service as Normal priority.
Thu, Aug 29, 4:31 AM · Operations, Traffic
Vgutierrez created T231514: Enable HSTS for non canonical domains using the ncredir service.
Thu, Aug 29, 4:31 AM · Operations, Traffic
Vgutierrez triaged T231513: Perform HTTPS redirect without crossing domain boundaries for non canonical domains as Normal priority.
Thu, Aug 29, 4:29 AM · Operations, Traffic
Vgutierrez moved T231513: Perform HTTPS redirect without crossing domain boundaries for non canonical domains from Triage to TLS on the Traffic board.
Thu, Aug 29, 4:29 AM · Operations, Traffic
Vgutierrez created T231513: Perform HTTPS redirect without crossing domain boundaries for non canonical domains .
Thu, Aug 29, 4:28 AM · Operations, Traffic

Wed, Aug 28

Zache awarded T230470: Could not reach wikipedia from domain wikipedia.fi a Barnstar token.
Wed, Aug 28, 1:39 PM · Traffic, Operations, DNS, Domains
Vgutierrez moved T231433: Move cache upload cluster from nginx to ats-tls from Triage to TLS on the Traffic board.
Wed, Aug 28, 11:39 AM · Patch-For-Review, Traffic, Operations
Vgutierrez triaged T231433: Move cache upload cluster from nginx to ats-tls as Normal priority.
Wed, Aug 28, 11:39 AM · Patch-For-Review, Traffic, Operations
Vgutierrez created T231433: Move cache upload cluster from nginx to ats-tls.
Wed, Aug 28, 11:39 AM · Patch-For-Review, Traffic, Operations
Vgutierrez closed T221594: Puppetize ATS TLS configuration for incoming traffic, a subtask of T220383: Evaluate ATS TLS stack, as Resolved.
Wed, Aug 28, 11:36 AM · Traffic, Operations
Vgutierrez closed T221594: Puppetize ATS TLS configuration for incoming traffic as Resolved.
Wed, Aug 28, 11:36 AM · Traffic, Operations
Vgutierrez closed T231287: Investigate HTTP/2 limits on trafficserver as Resolved.
Wed, Aug 28, 11:28 AM · Patch-For-Review, Operations, Traffic
Vgutierrez added a comment to T230470: Could not reach wikipedia from domain wikipedia.fi.

oh, it also works for https://wikipedia.fi:

willikins:~ vgutierrez$ curl https://wikipedia.fi -o /dev/null -v 2>&1 |fgrep -i Location
* successfully set certificate verify locations:
< location: https://fi.wikipedia.org/
Wed, Aug 28, 9:44 AM · Traffic, Operations, DNS, Domains
Vgutierrez closed T230470: Could not reach wikipedia from domain wikipedia.fi as Resolved.

https is happy now after adding wikipedia.fi as part of non-canonical-redirect-6 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/532950/ :

willikins:~ vgutierrez$ curl https://www.wikipedia.fi -o /dev/null -v 2>&1 |fgrep -i Location
* successfully set certificate verify locations:
< location: https://fi.wikipedia.org/
Wed, Aug 28, 9:43 AM · Traffic, Operations, DNS, Domains
Vgutierrez triaged T231287: Investigate HTTP/2 limits on trafficserver as Normal priority.
Wed, Aug 28, 5:35 AM · Patch-For-Review, Operations, Traffic
Vgutierrez added a comment to T231287: Investigate HTTP/2 limits on trafficserver.

After some discussion with upstream developers, https://github.com/apache/trafficserver/pull/5888 has been submitted and it's been included in https://gerrit.wikimedia.org/r/532723

Wed, Aug 28, 5:34 AM · Patch-For-Review, Operations, Traffic

Tue, Aug 27

Vgutierrez added a comment to T231287: Investigate HTTP/2 limits on trafficserver.

Further analysis shows that actually ATS is rate limiting PRIORITY frames even when they are disabled:

proxy.config.http2.stream_priority_enabled: 0
proxy.config.http2.max_priority_frames_per_minute: 120
Tue, Aug 27, 2:43 PM · Patch-For-Review, Operations, Traffic
Vgutierrez updated the task description for T231287: Investigate HTTP/2 limits on trafficserver.
Tue, Aug 27, 11:29 AM · Patch-For-Review, Operations, Traffic
Vgutierrez added a comment to T231287: Investigate HTTP/2 limits on trafficserver.

Triggering the issue is relatively easy browsing https://maps.wikimedia.org with Chrome 76:

t=264968 [st=29427]  HTTP2_SESSION_RECV_GOAWAY
                     --> active_streams = 22
                     --> debug_data = ""
                     --> error_code = "11 (ENHANCE_YOUR_CALM)"
                     --> last_accepted_stream_id = 1207
                     --> unclaimed_streams = 0
Tue, Aug 27, 11:25 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T231287: Investigate HTTP/2 limits on trafficserver.
Tue, Aug 27, 9:48 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T231286: Track TLS related ATS metrics in prometheus.
Tue, Aug 27, 9:44 AM · Traffic, Operations
Vgutierrez closed T228135: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers as Resolved.
Tue, Aug 27, 9:38 AM · Traffic, Operations
Vgutierrez closed T228135: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers, a subtask of T220383: Evaluate ATS TLS stack, as Resolved.
Tue, Aug 27, 9:38 AM · Traffic, Operations
Vgutierrez closed T220383: Evaluate ATS TLS stack, a subtask of T170567: Support TLSv1.3, as Resolved.
Tue, Aug 27, 9:38 AM · Performance-Team (Radar), Goal, Patch-For-Review, Traffic, Operations
Vgutierrez closed T220383: Evaluate ATS TLS stack as Resolved.
Tue, Aug 27, 9:38 AM · Traffic, Operations
Vgutierrez closed T231262: TLS handshake issues with ATS 8.0.5-1wm2, a subtask of T220383: Evaluate ATS TLS stack, as Resolved.
Tue, Aug 27, 8:53 AM · Traffic, Operations
Vgutierrez closed T231262: TLS handshake issues with ATS 8.0.5-1wm2 as Resolved.

After disable the connection throttling, cp5001 behaves as expected and no longer drops connections:

vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls metric match throttle
proxy.process.http.throttled_proxy_only 0
proxy.process.http.origin_connections_throttled_out 0
proxy.process.net.connections_throttled_in 0
proxy.process.net.connections_throttled_out 0
Tue, Aug 27, 8:53 AM · Traffic, Operations
Vgutierrez added a comment to T231262: TLS handshake issues with ATS 8.0.5-1wm2.

Further analysis of ats-tls metrics shows that connections were actually being dropped without being logged:

vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls metric match throttle
proxy.process.http.throttled_proxy_only 0
proxy.process.http.origin_connections_throttled_out 0
proxy.process.net.connections_throttled_in 313323
proxy.process.net.connections_throttled_out 0
vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls config get proxy.config.net.connections_throttle
proxy.config.net.connections_throttle: 30000
Tue, Aug 27, 7:36 AM · Traffic, Operations
Vgutierrez added a comment to T231262: TLS handshake issues with ATS 8.0.5-1wm2.

from a tcpdump capture, it looks like ATS is actually dropping connections:

1007 153.027859    127.0.0.1 → 127.0.0.1    TCP 74 60211 → 443 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=3584220 TSecr=0 WS=512
1008 153.027874    127.0.0.1 → 127.0.0.1    TCP 74 443 → 60211 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=3584220 TSecr=3584220 WS=512
1009 153.027881    127.0.0.1 → 127.0.0.1    TCP 66 60211 → 443 [ACK] Seq=1 Ack=1 Win=44032 Len=0 TSval=3584220 TSecr=3584220
1010 153.027983    127.0.0.1 → 127.0.0.1    TLSv1 242 Client Hello
1011 153.027995    127.0.0.1 → 127.0.0.1    TCP 66 443 → 60211 [ACK] Seq=1 Ack=177 Win=45056 Len=0 TSval=3584220 TSecr=3584220
1012 153.028014    127.0.0.1 → 127.0.0.1    TCP 66 443 → 60211 [RST, ACK] Seq=1 Ack=177 Win=45056 Len=0 TSval=3584220 TSecr=3584220
Tue, Aug 27, 7:11 AM · Traffic, Operations
Vgutierrez added a comment to T231262: TLS handshake issues with ATS 8.0.5-1wm2.

Further testing shows that the issue is apparently not related to OCSP stapling:

vgutierrez@cp5001:~$ openssl s_client -connect 127.0.0.1:443 < /dev/null
CONNECTED(00000003)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 176 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1566878252
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
---
Tue, Aug 27, 4:01 AM · Traffic, Operations
Vgutierrez moved T231262: TLS handshake issues with ATS 8.0.5-1wm2 from Triage to TLS on the Traffic board.
Tue, Aug 27, 3:41 AM · Traffic, Operations
Vgutierrez triaged T231262: TLS handshake issues with ATS 8.0.5-1wm2 as Normal priority.
Tue, Aug 27, 3:40 AM · Traffic, Operations
Vgutierrez created T231262: TLS handshake issues with ATS 8.0.5-1wm2.
Tue, Aug 27, 3:40 AM · Traffic, Operations

Mon, Aug 26

Vgutierrez created T231200: CI performance issues.
Mon, Aug 26, 1:44 PM · Patch-For-Review, Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201908), Continuous-Integration-Infrastructure