Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Traffic Security Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (91 w, 3 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
Unknown

Recent Activity

Today

Vgutierrez closed T238032: cp3065 crashed as Resolved.

Tracking the issue on the parent task: T238305

Thu, Nov 14, 11:32 AM · ops-esams, Operations, Traffic
Vgutierrez closed T238032: cp3065 crashed, a subtask of T238305: servers freeze across the caching cluster, as Resolved.
Thu, Nov 14, 11:32 AM · Operations, Traffic
Vgutierrez closed T238289: cp1077 is unreachable, a subtask of T238305: servers freeze across the caching cluster, as Resolved.
Thu, Nov 14, 11:32 AM · Operations, Traffic
Vgutierrez closed T238289: cp1077 is unreachable as Resolved.

Tracking the issue on the parent task: T238305

Thu, Nov 14, 11:32 AM · Traffic, ops-eqiad, Operations
Vgutierrez added a comment to T238307: ats-tls shows spikes on H/2 recv settings bad param errors.

It looks like our ATS build is missing https://github.com/apache/trafficserver/pull/5636

Thu, Nov 14, 9:58 AM · Patch-For-Review, Operations, Traffic
Vgutierrez added a comment to T238307: ats-tls shows spikes on H/2 recv settings bad param errors.

from cp5007:

vgutierrez@cp5007:~$ journalctl -u trafficserver-tls --since="7days ago" |grep "settings bad param" |cut -f1-2 -d' ' |uniq -c
     21 Nov 07
     55 Nov 08
     47 Nov 09
     22 Nov 10
    809 Nov 11
   6019 Nov 12
   7201 Nov 13
   2045 Nov 14
Thu, Nov 14, 9:57 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T238307: ats-tls shows spikes on H/2 recv settings bad param errors.
Thu, Nov 14, 9:56 AM · Patch-For-Review, Operations, Traffic
Vgutierrez triaged T238305: servers freeze across the caching cluster as High priority.
Thu, Nov 14, 9:38 AM · Operations, Traffic
Vgutierrez moved T238289: cp1077 is unreachable from Triage to Hardware on the Traffic board.
Thu, Nov 14, 9:38 AM · Traffic, ops-eqiad, Operations
Vgutierrez moved T238305: servers freeze across the caching cluster from Triage to Hardware on the Traffic board.
Thu, Nov 14, 9:38 AM · Operations, Traffic
Vgutierrez added a parent task for T237348: cp3057 is unreachable: T238305: servers freeze across the caching cluster.
Thu, Nov 14, 9:37 AM · ops-esams, Operations, Traffic
Vgutierrez added a parent task for T238032: cp3065 crashed: T238305: servers freeze across the caching cluster.
Thu, Nov 14, 9:37 AM · ops-esams, Operations, Traffic
Vgutierrez added a parent task for T238289: cp1077 is unreachable: T238305: servers freeze across the caching cluster.
Thu, Nov 14, 9:37 AM · Traffic, ops-eqiad, Operations
Vgutierrez added subtasks for T238305: servers freeze across the caching cluster: T238289: cp1077 is unreachable, T238032: cp3065 crashed, T237348: cp3057 is unreachable.
Thu, Nov 14, 9:37 AM · Operations, Traffic
Vgutierrez created T238305: servers freeze across the caching cluster.
Thu, Nov 14, 9:36 AM · Operations, Traffic
Vgutierrez added a comment to T238289: cp1077 is unreachable.

Nothing on the logs as well, this looks awfully familiar to T237348 and T238032

Thu, Nov 14, 4:07 AM · Traffic, ops-eqiad, Operations
Vgutierrez created T238289: cp1077 is unreachable.
Thu, Nov 14, 3:48 AM · Traffic, ops-eqiad, Operations

Yesterday

Vgutierrez closed T237425: ats-tls-restart failed on cp4027 as Resolved.

This has been fixed with the conditional reload:

vgutierrez@cp5007:~$ sudo -i ats-tls-restart
eqsin/cache_text/nginx/cp5007.eqsin.wmnet: pooled changed yes => no
eqsin/cache_text/nginx/cp5007.eqsin.wmnet: pooled changed no => yes
Wed, Nov 13, 10:56 AM · Operations, Traffic

Tue, Nov 12

Vgutierrez closed T238050: envoy overwrites the server header as Resolved.

Fixed with 2428105:

(eqsin) $ curl -v https://en.wikipedia.org/api/rest_v1/page/summary/Tremont_Street_Subway 2>&1 |grep server:
< server: restbase1017
Tue, Nov 12, 11:10 AM · Traffic, Operations
Vgutierrez created T238053: puppet-compiler fails to compile production catalog for restbase2014.
Tue, Nov 12, 10:21 AM · puppet-compiler, Operations
Vgutierrez triaged T238050: envoy overwrites the server header as Normal priority.
Tue, Nov 12, 9:50 AM · Traffic, Operations
Vgutierrez created T238050: envoy overwrites the server header.
Tue, Nov 12, 9:50 AM · Traffic, Operations
Vgutierrez triaged T238038: Start warning and deprecation process for all legacy TLS as Normal priority.
Tue, Nov 12, 5:32 AM · Patch-For-Review, Operations, Traffic
Vgutierrez moved T238038: Start warning and deprecation process for all legacy TLS from Triage to TLS on the Traffic board.
Tue, Nov 12, 5:19 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T238038: Start warning and deprecation process for all legacy TLS.
Tue, Nov 12, 5:19 AM · Patch-For-Review, Operations, Traffic
Vgutierrez added a comment to T238034: Enable QUIC support on Wikimedia servers.

We should consider QUIC and HTTP/3 adoption carefully as it implies a switch from TCP to UDP, and that could open new (D)DoS vectors and render unusable some mitigation techniques.

Tue, Nov 12, 4:27 AM · Operations, Traffic, HTTPS
Vgutierrez created T238036: scs-c1-eqiad CPU usage over 85%.
Tue, Nov 12, 4:02 AM · Operations, netops
Vgutierrez created T238032: cp3065 crashed.
Tue, Nov 12, 2:29 AM · ops-esams, Operations, Traffic

Mon, Nov 11

Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Mon, Nov 11, 5:40 AM · Traffic, Operations

Fri, Nov 8

Vgutierrez closed T236482: Add ats-tls status and availability graphs to frontend-traffic as Resolved.

atls-tls availability panel is now ready: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&panelId=13&fullscreen&orgId=1

Fri, Nov 8, 3:28 AM · Operations, observability, Traffic
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Fri, Nov 8, 3:20 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Fri, Nov 8, 2:56 AM · Traffic, Operations

Thu, Nov 7

Vgutierrez added a comment to T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available.

I'm already loving the data, thanks @JAllemandou <3

Thu, Nov 7, 9:51 AM · Analytics-Kanban, observability, Operations, Analytics, Traffic
Vgutierrez added a comment to T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available.

@BBlack I'm seeing some "nil" values on the TLS KeyExchange field when AES128-SHA is being used, should we rule those out at varnish level?

Thu, Nov 7, 9:41 AM · Analytics-Kanban, observability, Operations, Analytics, Traffic

Wed, Nov 6

Vgutierrez added a comment to T236482: Add ats-tls status and availability graphs to frontend-traffic.

Added ats-tls status panel: https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&panelId=12&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4

Wed, Nov 6, 5:09 AM · Operations, observability, Traffic
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Wed, Nov 6, 3:23 AM · Traffic, Operations
Vgutierrez added a comment to T237425: ats-tls-restart failed on cp4027.

this is a known issue caused by update-ocsp-all

Wed, Nov 6, 1:13 AM · Operations, Traffic

Tue, Nov 5

Vgutierrez closed T237348: cp3057 is unreachable as Resolved.

Closing as it seems that it was a hiccup and the server has been running for the last ~5 hours without issues

Tue, Nov 5, 8:24 AM · ops-esams, Operations, Traffic
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Tue, Nov 5, 4:42 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Tue, Nov 5, 4:25 AM · Traffic, Operations
Vgutierrez added a comment to T237348: cp3057 is unreachable.

racadm getsel doesn't show anything useful either

Tue, Nov 5, 3:23 AM · ops-esams, Operations, Traffic
Vgutierrez added a comment to T237348: cp3057 is unreachable.

Server went back online after a power cycle, nothing on /var/log/kern.log or /var/log/syslog that explains why the server went down.

Tue, Nov 5, 3:17 AM · ops-esams, Operations, Traffic
Vgutierrez triaged T237348: cp3057 is unreachable as Normal priority.
Tue, Nov 5, 2:56 AM · ops-esams, Operations, Traffic
Vgutierrez moved T237348: cp3057 is unreachable from Triage to Hardware on the Traffic board.
Tue, Nov 5, 2:55 AM · ops-esams, Operations, Traffic
Vgutierrez created T237348: cp3057 is unreachable.
Tue, Nov 5, 2:55 AM · ops-esams, Operations, Traffic

Mon, Nov 4

Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Mon, Nov 4, 5:05 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Mon, Nov 4, 4:50 AM · Traffic, Operations
Vgutierrez changed the status of T234803: Provide an easy way of picking the traffic serving TLS certificate used by ATS, a subtask of T230687: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users, from Open to Stalled.
Mon, Nov 4, 3:56 AM · Operations, Traffic, Acme-chief
Vgutierrez changed the status of T234803: Provide an easy way of picking the traffic serving TLS certificate used by ATS from Open to Stalled.

I'm marking this task as stalled, it will be resolved as soon as T231627 is completed

Mon, Nov 4, 3:56 AM · Operations, Traffic
Vgutierrez closed T236755: Enforce POST size limit on ats-tls, a subtask of T231627: Move cache text cluster from nginx to ats-tls, as Resolved.
Mon, Nov 4, 3:51 AM · Traffic, Operations
Vgutierrez closed T236755: Enforce POST size limit on ats-tls as Resolved.
Mon, Nov 4, 3:51 AM · Traffic, Operations

Fri, Nov 1

Vgutierrez added a comment to T233661: Publish tls related info to webrequest via varnish.

Hmm maybe it could make sense to store some TLS data like the ciphersuite, version or elliptic curve as integers assuming that all of them have constants assigned in OpenSSL.. that would reduce the storage requirements significantly

Fri, Nov 1, 8:15 AM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic

Thu, Oct 31

Vgutierrez created T236988: ats-be on the text cluster is experiencing broken connections.
Thu, Oct 31, 9:31 AM · Operations, Traffic
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Thu, Oct 31, 5:26 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Thu, Oct 31, 3:22 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Thu, Oct 31, 3:03 AM · Traffic, Operations
Vgutierrez closed T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled as Resolved.

This's been successfully mitigated by 9002f6dfc959fccf527b7c7a3778947496858695

Thu, Oct 31, 1:35 AM · Traffic, Operations
Vgutierrez closed T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled, a subtask of T231627: Move cache text cluster from nginx to ats-tls, as Resolved.
Thu, Oct 31, 1:35 AM · Traffic, Operations

Wed, Oct 30

Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Wed, Oct 30, 9:55 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Wed, Oct 30, 9:44 AM · Traffic, Operations
Vgutierrez added a comment to T233661: Publish tls related info to webrequest via varnish.

@JAllemandou it's currently split like this:

-   VCL_Log        CP-TLS-Version: TLSv1.2
-   VCL_Log        CP-TLS-Session-Reused: 0
-   VCL_Log        CP-Key-Exchange: X25519
-   VCL_Log        CP-Auth: ECDSA
-   VCL_Log        CP-Cipher: CHACHA20-POLY1305-SHA256
-   VCL_Log        CP-Full-Cipher: ECDHE-ECDSA-CHACHA20-POLY1305
Wed, Oct 30, 9:21 AM · Patch-For-Review, Analytics-Kanban, observability, Operations, Analytics, Traffic

Tue, Oct 29

Vgutierrez reopened T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled, a subtask of T231627: Move cache text cluster from nginx to ats-tls, as Open.
Tue, Oct 29, 7:44 AM · Traffic, Operations
Vgutierrez reopened T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled as "Open".

Reopening cause the issue hasn't been solved as it can be seen here: https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=1571989146430&to=now

Tue, Oct 29, 7:44 AM · Traffic, Operations
Vgutierrez created T236755: Enforce POST size limit on ats-tls.
Tue, Oct 29, 5:54 AM · Traffic, Operations
Vgutierrez created T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests.
Tue, Oct 29, 3:57 AM · Operations, Traffic

Fri, Oct 25

Vgutierrez closed T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled, a subtask of T231627: Move cache text cluster from nginx to ats-tls, as Resolved.
Fri, Oct 25, 9:47 AM · Traffic, Operations
Vgutierrez closed T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled as Resolved.
Fri, Oct 25, 9:47 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Fri, Oct 25, 9:19 AM · Traffic, Operations
Vgutierrez updated the task description for T231627: Move cache text cluster from nginx to ats-tls.
Fri, Oct 25, 8:19 AM · Traffic, Operations
Vgutierrez moved T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled from Triage to TLS on the Traffic board.
Fri, Oct 25, 5:31 AM · Traffic, Operations
Vgutierrez triaged T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled as Normal priority.
Fri, Oct 25, 5:31 AM · Traffic, Operations
Vgutierrez added a comment to T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled.

I'm tracking used TCP sockets on eqsin text nodes in https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=now-1h&to=now, I've manually applied a SSL handshake timeout on cp5007 at 05:03 UTC

Fri, Oct 25, 5:15 AM · Traffic, Operations
Vgutierrez created T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled.
Fri, Oct 25, 5:00 AM · Traffic, Operations

Thu, Oct 24

Vgutierrez updated the task description for T236217: rack/setup/install dns300[12].
Thu, Oct 24, 11:36 AM · Traffic, DNS, Operations, ops-esams
Vgutierrez updated the task description for T236294: rack/setup/install lvs300[567].
Thu, Oct 24, 11:36 AM · Traffic, Operations, ops-esams
Vgutierrez updated the task description for T233242: rack/setup/install cp30[50-65].esams.wmnet.
Thu, Oct 24, 11:35 AM · Traffic, ops-esams, Operations
Vgutierrez updated the task description for T236217: rack/setup/install dns300[12].
Thu, Oct 24, 6:08 AM · Traffic, DNS, Operations, ops-esams
Vgutierrez updated the task description for T236217: rack/setup/install dns300[12].
Thu, Oct 24, 6:05 AM · Traffic, DNS, Operations, ops-esams
Vgutierrez updated the task description for T236216: rack/setup/install ganeti300[123].
Thu, Oct 24, 6:03 AM · Operations, ops-esams
Vgutierrez updated the task description for T236216: rack/setup/install ganeti300[123].
Thu, Oct 24, 5:58 AM · Operations, ops-esams
Vgutierrez updated the task description for T236294: rack/setup/install lvs300[567].
Thu, Oct 24, 5:57 AM · Traffic, Operations, ops-esams
Vgutierrez updated the task description for T236294: rack/setup/install lvs300[567].
Thu, Oct 24, 5:52 AM · Traffic, Operations, ops-esams
Vgutierrez updated the task description for T233242: rack/setup/install cp30[50-65].esams.wmnet.
Thu, Oct 24, 5:51 AM · Traffic, ops-esams, Operations
Vgutierrez updated the task description for T233242: rack/setup/install cp30[50-65].esams.wmnet.
Thu, Oct 24, 5:45 AM · Traffic, ops-esams, Operations
Vgutierrez updated the task description for T233242: rack/setup/install cp30[50-65].esams.wmnet.
Thu, Oct 24, 5:39 AM · Traffic, ops-esams, Operations
Vgutierrez updated the task description for T233242: rack/setup/install cp30[50-65].esams.wmnet.
Thu, Oct 24, 5:31 AM · Traffic, ops-esams, Operations

Wed, Oct 23

Vgutierrez added a comment to T233274: ATS lua script reload doesn't work as expected.

The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb

Wed, Oct 23, 10:35 AM · Patch-For-Review, Operations, Traffic
Vgutierrez reopened T233274: ATS lua script reload doesn't work as expected as "Open".
Wed, Oct 23, 10:34 AM · Patch-For-Review, Operations, Traffic
Vgutierrez added a comment to T234803: Provide an easy way of picking the traffic serving TLS certificate used by ATS.

Notes from IRC, etc:
The current patch (merging shortly: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541220/ ) gets us part of the way there, enough to deploy dual commercial certs again across ATS + nginx. What we need to do beyond that in the near to medium term is:

  • Reduce some of the duplication in the hieradata (the common SNI-check blocks, etc)
  • Cover deployment of the lets-encrypt / acme_chief case to all nodes based on it being in the defined set of cert options (currently it's only deployed in eqsin; turning it on elsewhere via ucv hieradata wouldn't work)
  • Fix the issues around ATS not handling a switch of directory paths with a simple reload, since commercial and LE cert options current have distinct directory paths. This means we'll have to create some universal paths for ATS's access to the unified cert/key/ocsp which are symlinks out to the appropriate commercial or legacy paths as appropriate.
Wed, Oct 23, 9:46 AM · Operations, Traffic
Vgutierrez added a comment to T234887: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe.

I've depooled cp5007 to conduct some experiments, I've captured the varnish-fe traffic with the following tcpdump filter portrange 3120-3127 and I triggered a failed fetch with the following request:

curl --resolve en.wikipedia.org:443:127.0.0.1 -X POST -d "test" -H "Content-Length: 10" "https://en.wikipedia.org/w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1&host=cp5007&h2=1&random=$RANDOM" -v
Wed, Oct 23, 5:04 AM · Patch-For-Review, Traffic, Operations

Tue, Oct 22

Vgutierrez added a comment to T234887: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe.

I highly suspect that's related to stricter timeouts on ats-be compared to varnish-be and atls-tls, that would be solved soon when https://gerrit.wikimedia.org/r/c/operations/puppet/+/541524 is merged

Tue, Oct 22, 1:05 PM · Patch-For-Review, Traffic, Operations
Ladsgroup awarded T236120: Get rid of nginx puppetization for cache upload a Like token.
Tue, Oct 22, 10:03 AM · Traffic, Operations
Vgutierrez closed T233274: ATS lua script reload doesn't work as expected as Resolved.
Tue, Oct 22, 7:56 AM · Patch-For-Review, Operations, Traffic
Vgutierrez created T236120: Get rid of nginx puppetization for cache upload.
Tue, Oct 22, 5:54 AM · Traffic, Operations
Vgutierrez closed T231433: Move cache upload cluster from nginx to ats-tls, a subtask of T170567: Support TLSv1.3, as Resolved.
Tue, Oct 22, 5:54 AM · Performance-Team (Radar), Goal, Patch-For-Review, Operations, Traffic
Vgutierrez closed T231433: Move cache upload cluster from nginx to ats-tls as Resolved.
Tue, Oct 22, 5:54 AM · Traffic, Operations
Vgutierrez updated the task description for T231433: Move cache upload cluster from nginx to ats-tls.
Tue, Oct 22, 5:47 AM · Traffic, Operations
Vgutierrez updated the task description for T231433: Move cache upload cluster from nginx to ats-tls.
Tue, Oct 22, 5:24 AM · Traffic, Operations
Vgutierrez updated the task description for T231433: Move cache upload cluster from nginx to ats-tls.
Tue, Oct 22, 5:16 AM · Traffic, Operations
Vgutierrez updated the task description for T231433: Move cache upload cluster from nginx to ats-tls.
Tue, Oct 22, 5:05 AM · Traffic, Operations