Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Staff Site Reliability Engineer, Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (319 w, 3 d)
Availability
Busy Busy until Jun 2.
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
VGutiérrez (WMF) [ Global Accounts ]

Recent Activity

Feb 12 2024

Vgutierrez triaged T357257: Use IPIP encapsulation on lvs<-->upload cluster as Medium priority.
Feb 12 2024, 8:15 AM · Traffic
Vgutierrez moved T357258: Release tcp-mss-clamper for bullseye from Backlog to Traffic team actively servicing on the Traffic board.
Feb 12 2024, 7:26 AM · Traffic
Vgutierrez triaged T357258: Release tcp-mss-clamper for bullseye as Medium priority.
Feb 12 2024, 7:25 AM · Traffic
Vgutierrez created T357258: Release tcp-mss-clamper for bullseye.
Feb 12 2024, 7:25 AM · Traffic
Vgutierrez created T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Feb 12 2024, 7:24 AM · Traffic

Feb 7 2024

Vgutierrez added a comment to T356025: A poor internet connection should not result in a HTTP 503 error.

sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend fetch error where the client connection is healthy but varnish is unable to reach the backend server.

Feb 7 2024, 7:05 AM · SRE, Traffic

Feb 6 2024

Vgutierrez added a comment to T356792: Problem downloading large files from analytics.wikimedia.org.

https://github.com/wikimedia/operations-puppet/blob/1a6c9d13ee7a499ee7a28e47449774a6a6dcdccc/modules/envoyproxy/manifests/tls_terminator.pp#L147 could be the culprit.

Feb 6 2024, 6:23 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Vgutierrez moved T356792: Problem downloading large files from analytics.wikimedia.org from Backlog to Radar/Not for service by Traffic on the Traffic board.

@BTullis it's origin related:

Feb 6 2024, 6:02 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Vgutierrez added a comment to T356792: Problem downloading large files from analytics.wikimedia.org.

I can reproduce via text@drmrs, I'll take a look ASAP :)

Feb 6 2024, 5:57 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 4 2024

Vgutierrez triaged T356598: Track Linux key retention service as Medium priority.
Feb 4 2024, 2:43 PM · Traffic
Vgutierrez created T356598: Track Linux key retention service.
Feb 4 2024, 2:42 PM · Traffic

Jan 31 2024

Vgutierrez added a comment to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

Fix already released on HAProxy 2.9: https://www.mail-archive.com/haproxy@formilux.org/msg44547.html

Jan 31 2024, 5:12 PM · Upstream, Traffic

Jan 30 2024

Vgutierrez added a comment to T355905: Restarting fifo-log-demux should not restart nginx.

IIRC that was done to smooth the reimage process and first puppet run on various roles using fifo-log-demux.

Jan 30 2024, 8:52 AM · Traffic

Jan 23 2024

Vgutierrez renamed T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 from HAProxy 2.6.16 CPU spikes on cp3066 to HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.
Jan 23 2024, 3:40 PM · Upstream, Traffic

Jan 22 2024

Vgutierrez added a comment to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

as suggested by Willy Tarreau on https://github.com/haproxy/haproxy/issues/2403#issuecomment-1900111538 this issue could be easier to debug on HAProxy 2.8

Jan 22 2024, 10:48 AM · Upstream, Traffic

Jan 18 2024

Vgutierrez created T355359: ipip-multiqueue-optimizer won't start on server reboot.
Jan 18 2024, 7:55 PM · Traffic
Vgutierrez triaged T355308: WikiFunctions: Domain Verification for Google Search Console as Medium priority.
Jan 18 2024, 11:52 AM · Traffic

Jan 12 2024

Vgutierrez added a comment to T346350: Add revision ID to X-Analytics header.
  1. Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header

Tagged SRE Traffic per https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Traffic/Request

@KOfori: The request for your team is to add revision ID of wiki pages to X-Analytics header, ideally earlier in January.

Jan 12 2024, 9:26 AM · MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Moderator-Tools-Team (Kanban), Traffic, good first task, Data Products, Product-Analytics, Automoderator

Jan 11 2024

Vgutierrez created T354888: go mod error attempting to install blubber via go tools.
Jan 11 2024, 5:12 PM · Release-Engineering-Team, Release Pipeline (Blubber)
Vgutierrez closed T353657: tcp-mss-clamper doesn't work on bullseye / kernel 5.10 as Resolved.
Jan 11 2024, 9:43 AM · Traffic

Jan 10 2024

Vgutierrez triaged T354718: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs as Medium priority.
Jan 10 2024, 2:20 PM · Traffic, SRE
Vgutierrez updated subscribers of T354721: prometheus-lvs-realserver-mss crashed on ncredir2002.

apparently get_mss failed to get|capture a SYN/ACK:

61     if synack is None or synack[TCP] is None:
62         print(f"[!] Unexpected answer: {synack}", file=sys.stderr)
63         return None
Jan 10 2024, 8:43 AM · Traffic
Vgutierrez triaged T354721: prometheus-lvs-realserver-mss crashed on ncredir2002 as Medium priority.
Jan 10 2024, 8:28 AM · Traffic
Vgutierrez created T354721: prometheus-lvs-realserver-mss crashed on ncredir2002.
Jan 10 2024, 8:27 AM · Traffic

Jan 5 2024

Vgutierrez added a project to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066: Upstream.
Jan 5 2024, 3:46 PM · Upstream, Traffic
Vgutierrez triaged T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 as Medium priority.
Jan 5 2024, 3:46 PM · Upstream, Traffic
Vgutierrez created T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.
Jan 5 2024, 3:46 PM · Upstream, Traffic
Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/11 performs integration tests using qemu to spawn a bullseye and a bookworm kernel and then perform basic eBPF tasks, this should be enough for now.

Jan 5 2024, 3:28 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Dec 20 2023

Vgutierrez added a project to T353760: pybal_monitor_down_results_total metric only created when PyBal goes down: PyBal.
Dec 20 2023, 4:23 PM · PyBal, Traffic
Vgutierrez updated subscribers of T353779: sre.dns.roll-restart-reboot-wikimedia-dns cookbook sometimes cannot remove downtime.
Dec 20 2023, 3:50 PM · Patch-For-Review, Traffic

Dec 18 2023

Vgutierrez closed T352876: cp4037 reimage for cookbook getting stuck at PXE boot as Resolved.
Dec 18 2023, 4:13 PM · Traffic, DC-Ops
Vgutierrez closed T352876: cp4037 reimage for cookbook getting stuck at PXE boot, a subtask of T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting, as Resolved.
Dec 18 2023, 4:13 PM · Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
Vgutierrez triaged T353657: tcp-mss-clamper doesn't work on bullseye / kernel 5.10 as Medium priority.
Dec 18 2023, 4:03 PM · Traffic
Vgutierrez created T353657: tcp-mss-clamper doesn't work on bullseye / kernel 5.10.
Dec 18 2023, 4:03 PM · Traffic
Vgutierrez closed T352242: Provide second acmechief server configured for Puppet 7 in eqiad as Resolved.

@MoritzMuehlenhoff Sorry about that, acmechief1002 is now ready for service :)

Dec 18 2023, 9:16 AM · Acme-chief, Traffic
Vgutierrez reopened T352242: Provide second acmechief server configured for Puppet 7 in eqiad as "Open".

There's one active alert, is that known/expected?

FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status

Dec 18 2023, 9:02 AM · Acme-chief, Traffic

Dec 13 2023

Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

Docker is definitely not a valid option here since we need to test against several kernels (at least 5.10 and 6.1)

Dec 13 2023, 8:17 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic
Vgutierrez triaged T352876: cp4037 reimage for cookbook getting stuck at PXE boot as Medium priority.

@Papaul I see that you triggerd the cookbook last week. Are you stuck with something? do you need help from our side? it would be great to get this host back to production before the break

Dec 13 2023, 12:09 PM · Traffic, DC-Ops
Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

@thcipriani being able to run privileged containers seems to be enough, at least for basic eBPF tests (not sure about IPVS setups yet), https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/10 leverages docker-compose to spawn two containers and validate that tcp-mss-clamper (small eBPF program that performs TCP MSS clamping) is working as expected

Dec 13 2023, 11:32 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic
Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

we currently perform manual tests on developer machines (far from optimal). So if we can spawn our own runner we could run docker containers in privileged mode there? It could be easier and faster than spawning VMs per CI execution

Dec 13 2023, 9:01 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Dec 12 2023

Vgutierrez updated the task description for T353279: CI on gitlab for eBPF / networking heavy projects.
Dec 12 2023, 5:00 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic
Vgutierrez triaged T353279: CI on gitlab for eBPF / networking heavy projects as Medium priority.
Dec 12 2023, 4:57 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic
Vgutierrez created T353279: CI on gitlab for eBPF / networking heavy projects.
Dec 12 2023, 4:56 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic
Vgutierrez added a comment to T352744: OpenSSL 3.x performance issues.

HAProxy 2.9 has been released, introducing AWS-LC support and with some interesting mention to OpenSSL on its release notes:

Dec 12 2023, 9:36 AM · SRE-swift-storage, Traffic
Vgutierrez added a comment to T352744: OpenSSL 3.x performance issues.

I'm wondering though if we reproduced this with the pilot bookworm cp installation?

The pilot cp bookworm installation on cp4052 (upload@ulsfo) didn't experience the issues described on the OpenSSL GH issue, IMHO kinda expected considering the patterns of traffic in that cluster/DC and the nature of the performance issue

Dec 12 2023, 9:30 AM · SRE-swift-storage, Traffic
Vgutierrez added a comment to T346463: Identify and label prefetch proxy data in our traffic.

VCL patch submitted by @Ottomata (https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352) looks good to me, @elukey CR to add a new WebRequest field also looks good (https://gerrit.wikimedia.org/r/c/operations/puppet/+/980912). In terms of deployment, updating the VCL is slightly easier than updating the varnishkafka JSON format as VCL updates don't require restarting varnish and the varnishkafka change requires a varnishkakfa restart to be applied.

Dec 12 2023, 9:19 AM · Traffic, Movement-Insights, Data-Engineering

Dec 11 2023

Vgutierrez closed T351069: Enable IPIP encapsulation for ncredir, a subtask of T348837: Investigate IPVS IPIP encapsulation support, as Resolved.
Dec 11 2023, 4:16 PM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T351069: Enable IPIP encapsulation for ncredir as Resolved.
Dec 11 2023, 4:16 PM · Patch-For-Review, SRE, Traffic
Vgutierrez added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

@akosiaris as mentioned on the meeting we need the following questions answered:

  • Is it OK to clamp all egress traffic on a k8s node?
  • IPIP encapsulation needs rp filtering disabled on the ipip / ip6ip6 interface in order to work, is that something calico supports?
Dec 11 2023, 1:23 PM · serviceops, Traffic

Dec 7 2023

bking awarded T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors a Burninate token.
Dec 7 2023, 2:54 PM · Patch-For-Review, Data-Engineering, SRE, Traffic
Vgutierrez created T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.
Dec 7 2023, 8:41 AM · serviceops, Traffic
Vgutierrez closed T348837: Investigate IPVS IPIP encapsulation support as Resolved.
Dec 7 2023, 8:35 AM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T348837: Investigate IPVS IPIP encapsulation support, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Dec 7 2023, 8:35 AM · Traffic

Dec 5 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Dec 5 2023, 5:49 PM · Patch-For-Review, SRE, Traffic
Vgutierrez updated the task description for T342154: Upgrade Traffic hosts to bookworm.
Dec 5 2023, 10:08 AM · Patch-For-Review, Traffic
Vgutierrez triaged T352744: OpenSSL 3.x performance issues as High priority.
Dec 5 2023, 10:07 AM · SRE-swift-storage, Traffic
Vgutierrez created T352744: OpenSSL 3.x performance issues.
Dec 5 2023, 10:06 AM · SRE-swift-storage, Traffic
Vgutierrez added a comment to T352242: Provide second acmechief server configured for Puppet 7 in eqiad.

I don't think it's a problem of load as our puppetization doesn't balance Puppet API requests between different acme-chief hosts but as @MoritzMuehlenhoff mentions on the description: if we have some kind of incident on codfw puppet 7 hosts lose the ability of fetching/refreshing acme-chief TLS material.

Dec 5 2023, 7:36 AM · Acme-chief, Traffic
Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Dec 5 2023, 6:58 AM · Patch-For-Review, SRE, Traffic

Dec 4 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Dec 4 2023, 4:20 PM · Patch-For-Review, SRE, Traffic
Vgutierrez added a comment to T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster.

A quick check on deployment-parsoid12 tells that ferm rules on that instance are at fault (host FW is dropping traffic towards port 80):

Dec  4 15:38:53 deployment-parsoid12 ulogd[15249]: [fw-in-drop] IN=eth0 OUT= MAC=fa:16:3e:db:ed:c8:fa:16:3e:ae:f5:88:08:00 SRC=<REDACTED> DST=172.16.4.125 LEN=60 TOS=00 PREC=0x00 TTL=49 ID=36184 DF PROTO=TCP SPT=54462 DPT=80 SEQ=3181076532 ACK=0 WINDOW=64240 SYN URGP=0 MARK=0
Dec 4 2023, 3:43 PM · Abstract Wikipedia team, WikiLambda, Traffic, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure
Vgutierrez lowered the priority of T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster from Unbreak Now! to High.

@daniel what's time outing is parsoid-external-ci-access.beta.wmflabs.org:

$ curl --http1.1 https://sr.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/RESTBase_Testing_Page -v -o /dev/null -s 2>&1 |grep HTTP/1.1
* using HTTP/1.1
> GET /api/rest_v1/page/html/RESTBase_Testing_Page HTTP/1.1
< HTTP/1.1 200 OK
Dec 4 2023, 11:56 AM · Abstract Wikipedia team, WikiLambda, Traffic, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure

Nov 30 2023

Vgutierrez added a comment to T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster.

api.php is currently handled by deployment-mediawiki11 and that instance is unreacheable ATM:

vgutierrez@deployment-cache-text08:~$ sudo cat /etc/trafficserver/remap.config |fgrep api.php
regex_map http://(.*)/w/api.php http://deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud/w/api.php @plugin=/usr/lib/trafficserver/modules/tslua.so @pparam=/etc/trafficserver/lua/rb-mw-mangling.lua
vgutierrez@deployment-cache-text08:~$ nc -w 3 -zv deployment-mediawiki11 80
nc: connect to deployment-mediawiki11 (172.16.3.203) port 80 (tcp) timed out: Operation now in progress
Nov 30 2023, 2:45 PM · Abstract Wikipedia team, WikiLambda, Traffic, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure

Nov 29 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Nov 29 2023, 12:34 PM · Patch-For-Review, SRE, Traffic
Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Nov 29 2023, 12:26 PM · Patch-For-Review, SRE, Traffic
Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Nov 29 2023, 12:24 PM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers, a subtask of T351069: Enable IPIP encapsulation for ncredir, as Resolved.
Nov 29 2023, 12:24 PM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers as Resolved.
Nov 29 2023, 12:24 PM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop as Resolved.
Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on vlan1201
Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on enp175s0f0np0
Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Exiting
Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Detaching IPIP Multiqueue Optimizer from enp175s0f0np0
Nov 29 11:04:43 lvs4008 systemd[1]: Stopping eBPF based IPIP Multiqueue Optimizer...
Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Detaching IPIP Multiqueue Optimizer from vlan1201

working as expected now:

vgutierrez@lvs4008:~$ sudo -i bpftool prog |grep ipip
23278: sched_cls  name ipip_optimizer  tag 575f8397462edb4d  gpl
23279: sched_cls  name ipip_optimizer  tag 575f8397462edb4d  gpl
vgutierrez@lvs4008:~$ sudo -i systemctl stop ipip-multiqueue-optimizer.service 
vgutierrez@lvs4008:~$ sudo -i bpftool prog |grep ipip
[no output]
Nov 29 2023, 11:06 AM · Traffic
Vgutierrez triaged T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop as Medium priority.
Nov 29 2023, 9:18 AM · Traffic
Vgutierrez created T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop.
Nov 29 2023, 9:18 AM · Traffic

Nov 28 2023

Vgutierrez updated the task description for T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers.
Nov 28 2023, 11:59 AM · Patch-For-Review, SRE, Traffic
Vgutierrez triaged T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers as High priority.
Nov 28 2023, 10:44 AM · Patch-For-Review, SRE, Traffic
Vgutierrez created T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers.
Nov 28 2023, 10:43 AM · Patch-For-Review, SRE, Traffic
Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Nov 28 2023, 10:41 AM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers, a subtask of T351069: Enable IPIP encapsulation for ncredir, as Resolved.
Nov 28 2023, 10:40 AM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers as Resolved.
Nov 28 2023, 10:40 AM · SRE, Traffic
Vgutierrez added a comment to T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers.

using the syntax on the good old iptables, this should work:

iptables -A INPUT -s 172.16.0.0/10 -p ipencap -j ACCEPT
ip6tables -A INPUT -s 0100::/64 -p ipv6 -j ACCEPT
Nov 28 2023, 8:45 AM · SRE, Traffic
Vgutierrez triaged T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers as High priority.
Nov 28 2023, 8:37 AM · SRE, Traffic
Vgutierrez created T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers.
Nov 28 2023, 8:37 AM · SRE, Traffic

Nov 27 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Nov 27 2023, 8:29 PM · Patch-For-Review, SRE, Traffic
Vgutierrez added a comment to T351069: Enable IPIP encapsulation for ncredir.

thx @ayounsi we will go with option 1:

  • IPv4: 1500 - 20 (IP) - 20 (IP) - 20 (TCP) = 1440 bytes
  • IPv6: 1500 - 40 (IPv6) - 40 (IPv6) - 20 (TCP) = 1400 bytes
Nov 27 2023, 3:08 PM · Patch-For-Review, SRE, Traffic

Nov 24 2023

Vgutierrez added a comment to T351069: Enable IPIP encapsulation for ncredir.

@ayounsi what would be the required TCP MSS clamping values? per https://phabricator.wikimedia.org/T348837#9256494 It seems that around ~1400 bytes for both IPv4/IPv6 should be ok?

Nov 24 2023, 4:23 PM · Patch-For-Review, SRE, Traffic
Vgutierrez added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

nginx didn't enforce a timeout for the whole request but just a timeout (180s) between reads from the server so that won't be enough. To mimick the behavior you need to set the response timeout to 0 and stream_idle_timeout to 180s (dunno if the latter is supported by our puppetization)

Nov 24 2023, 11:34 AM · SRE-swift-storage, Traffic, Commons
Vgutierrez added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

envoy sets an upstream response timeout by default at 65s (https://github.com/wikimedia/operations-puppet/blob/397c454bbad404c9667c6f63f86e993b1970af8a/modules/envoyproxy/manifests/tls_terminator.pp#L147) it needs to be adjusted properly to allow the transfer of big files

Nov 24 2023, 11:25 AM · SRE-swift-storage, Traffic, Commons
Vgutierrez updated subscribers of T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

The file is over the CDN size threshold (1G) so it will hit swift every time that it needs to be fetched. Could it be related by the work done by @MatthewVernon on T317616

Nov 24 2023, 11:10 AM · SRE-swift-storage, Traffic, Commons

Nov 23 2023

Vgutierrez reopened T320636: smart-data-dump fails occasionally due to facter timeouts as "Open".

All the impacted hosts are indeed running puppet 7:

vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy && puppet --version'
[...]
===== NODE GROUP =====                                                                    
(117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb101
5.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eq
iad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernete
s[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.co
dfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet
----- OUTPUT of 'journalctl -u ex...puppet --version' -----                               
impacted                                                                                  
7.23.0
Nov 23 2023, 9:45 AM · Puppet (Puppet 7.0), SRE Observability (FY2022/2023-Q2), Observability-Alerting
Vgutierrez reopened T251293: Facter is slow on a few hosts as "Open".

It looks like we are having some issues with the raid fact:

Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
                                                Traceback (most recent call last):
                                                  File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                    return subprocess.check_output(cmd, stderr=stderr) \
                                                  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
                                                    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                                                  File "/usr/lib/python3.9/subprocess.py", line 528, in run
                                                    raise CalledProcessError(retcode, process.args,
                                                subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Traceback (most recent call last):
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 460, in <module>
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     sys.exit(main())
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 429, in main
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raid_drivers = get_raid_drivers()
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL)
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return subprocess.check_output(cmd, stderr=stderr) \
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raise CalledProcessError(retcode, process.args,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
Nov 23 2023, 9:36 AM · Infrastructure-Foundations, Puppet, SRE

Nov 22 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.
Nov 22 2023, 9:25 AM · Patch-For-Review, SRE, Traffic
Vgutierrez moved T317616: Revisit CDN<-->Swift communication from Backlog to Radar/Not for service by Traffic on the Traffic board.
Nov 22 2023, 9:23 AM · SRE-swift-storage, SRE, Traffic
Vgutierrez added a comment to T351788: Use standard tls version and ciphers for rsyslog.

Basically we should be using >=TLSv1.2, P-256 as the preferred curve and the following ciphersuites:

'TLS_AES_256_GCM_SHA384',
'TLS_AES_128_GCM_SHA256',
'TLS_CHACHA20_POLY1305_SHA256'
'ECDHE-ECDSA-AES256-GCM-SHA384',
'ECDHE-ECDSA-AES128-GCM-SHA256',
'ECDHE-ECDSA-CHACHA20-POLY1305',
'ECDHE-RSA-AES256-GCM-SHA384',
'ECDHE-RSA-AES128-GCM-SHA256',
'ECDHE-RSA-CHACHA20-POLY1305',
Nov 22 2023, 9:17 AM · Observability-Logging
Vgutierrez closed T351655: acme-chief service started on a passive node after reimage as Resolved.

Fixed by masking the systemd service before acme-chief package is installed on passive hosts

Nov 22 2023, 9:13 AM · Traffic, Acme-chief

Nov 21 2023

Vgutierrez added a comment to T351710: ossl rsyslog errors post-migration.

nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites

Nov 21 2023, 4:11 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability
Vgutierrez added a comment to T351710: ossl rsyslog errors post-migration.

@fgiunchedi seems like a mismatch on configured curves between clients and servers, could I suggest providing a more detailed TLS configuration for both rsyslog servers and clients?

Nov 21 2023, 3:00 PM · SRE Observability (FY2023/2024-Q4), User-fgiunchedi, Patch-For-Review, Cloud-VPS, SRE, observability

Nov 20 2023

Vgutierrez updated subscribers of T350353: Parsoid instance on beta not accesible from restbase CI/dev envs.

as mentioned on IRC:

<vgutierrez> it looks like profile::tlsproxy::envoy::ssl_provider should be set to acme for deployment-parsoid12 (after configuring acme-chief there to issue the expected certificate)
Nov 20 2023, 5:31 PM · RESTBase Sunsetting, RESTBase, Beta-Cluster-Infrastructure
Vgutierrez triaged T351655: acme-chief service started on a passive node after reimage as High priority.
Nov 20 2023, 4:30 PM · Traffic, Acme-chief
Vgutierrez created T351655: acme-chief service started on a passive node after reimage.
Nov 20 2023, 4:29 PM · Traffic, Acme-chief
Vgutierrez added a comment to T350353: Parsoid instance on beta not accesible from restbase CI/dev envs.

hmm there must have been some change impacting the kind of certificate used for that endpoint. Right now it's using a WMF PKI issued cert:

---
Certificate chain
 0 s:CN = parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = discovery
   a:PKEY: id-ecPublicKey, 256 (bit); sigalg: ecdsa-with-SHA512
   v:NotBefore: Nov 19 12:24:00 2023 GMT; NotAfter: Dec 17 12:24:00 2023 GMT
 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = discovery
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = WMF_TEST_CA
   a:PKEY: id-ecPublicKey, 521 (bit); sigalg: ecdsa-with-SHA512
   v:NotBefore: Jan 28 12:02:00 2022 GMT; NotAfter: Jan 27 12:02:00 2027 GMT
---
Nov 20 2023, 11:47 AM · RESTBase Sunsetting, RESTBase, Beta-Cluster-Infrastructure

Nov 13 2023

Vgutierrez triaged T351069: Enable IPIP encapsulation for ncredir as Medium priority.
Nov 13 2023, 9:58 AM · Patch-For-Review, SRE, Traffic
Vgutierrez created T351069: Enable IPIP encapsulation for ncredir.
Nov 13 2023, 9:58 AM · Patch-For-Review, SRE, Traffic