sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend fetch error where the client connection is healthy but varnish is unable to reach the backend server.

Feb 7 2024, 7:05 AM · SRE, Traffic

Feb 6 2024

Vgutierrez added a comment to T356792: Problem downloading large files from analytics.wikimedia.org.

https://github.com/wikimedia/operations-puppet/blob/1a6c9d13ee7a499ee7a28e47449774a6a6dcdccc/modules/envoyproxy/manifests/tls_terminator.pp#L147 could be the culprit.

Feb 6 2024, 6:23 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Vgutierrez moved T356792: Problem downloading large files from analytics.wikimedia.org from Backlog to Radar/Not for service by Traffic on the Traffic board.

@BTullis it's origin related:

Feb 6 2024, 6:02 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Vgutierrez added a comment to T356792: Problem downloading large files from analytics.wikimedia.org.

I can reproduce via text@drmrs, I'll take a look ASAP :)

Feb 6 2024, 5:57 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 4 2024

Vgutierrez triaged T356598: Track Linux key retention service as Medium priority.

Feb 4 2024, 2:43 PM · Traffic

Vgutierrez created T356598: Track Linux key retention service.

Feb 4 2024, 2:42 PM · Traffic

Jan 31 2024

Vgutierrez added a comment to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

Fix already released on HAProxy 2.9: https://www.mail-archive.com/haproxy@formilux.org/msg44547.html

Jan 31 2024, 5:12 PM · Upstream, Traffic

Jan 30 2024

Vgutierrez added a comment to T355905: Restarting fifo-log-demux should not restart nginx.

IIRC that was done to smooth the reimage process and first puppet run on various roles using fifo-log-demux.

Jan 30 2024, 8:52 AM · Traffic

Jan 23 2024

Vgutierrez renamed T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 from HAProxy 2.6.16 CPU spikes on cp3066 to HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

Jan 23 2024, 3:40 PM · Upstream, Traffic

Jan 22 2024

Vgutierrez added a comment to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

as suggested by Willy Tarreau on https://github.com/haproxy/haproxy/issues/2403#issuecomment-1900111538 this issue could be easier to debug on HAProxy 2.8

Jan 22 2024, 10:48 AM · Upstream, Traffic

Jan 18 2024

Vgutierrez created T355359: ipip-multiqueue-optimizer won't start on server reboot.

Jan 18 2024, 7:55 PM · Traffic

Vgutierrez triaged T355308: WikiFunctions: Domain Verification for Google Search Console as Medium priority.

Jan 18 2024, 11:52 AM · Traffic

Jan 12 2024

Vgutierrez added a comment to T346350: Add revision ID to X-Analytics header.

In T346350#9430239, @mpopov wrote:

In T346350#9175586, @phuedx wrote:

Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header

Tagged SRE Traffic per https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Traffic/Request

@KOfori: The request for your team is to add revision ID of wiki pages to X-Analytics header, ideally earlier in January.

Jan 12 2024, 9:26 AM · MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Moderator-Tools-Team (Kanban), Traffic, good first task, Data Products, Product-Analytics, Automoderator

Jan 11 2024

Vgutierrez created T354888: go mod error attempting to install blubber via go tools.

Jan 11 2024, 5:12 PM · Release-Engineering-Team, Release Pipeline (Blubber)

Vgutierrez closed T353657: tcp-mss-clamper doesn't work on bullseye / kernel 5.10 as Resolved.

Jan 11 2024, 9:43 AM · Traffic

Jan 10 2024

Vgutierrez triaged T354718: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs as Medium priority.

Jan 10 2024, 2:20 PM · Patch-For-Review, Traffic, SRE

Vgutierrez updated subscribers of T354721: prometheus-lvs-realserver-mss crashed on ncredir2002.

apparently get_mss failed to get|capture a SYN/ACK:

61     if synack is None or synack[TCP] is None:
62         print(f"[!] Unexpected answer: {synack}", file=sys.stderr)
63         return None

Jan 10 2024, 8:43 AM · Traffic

Vgutierrez triaged T354721: prometheus-lvs-realserver-mss crashed on ncredir2002 as Medium priority.

Jan 10 2024, 8:28 AM · Traffic

Vgutierrez created T354721: prometheus-lvs-realserver-mss crashed on ncredir2002.

Jan 10 2024, 8:27 AM · Traffic

Jan 5 2024

Vgutierrez added a project to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066: Upstream.

Jan 5 2024, 3:46 PM · Upstream, Traffic

Vgutierrez triaged T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 as Medium priority.

Jan 5 2024, 3:46 PM · Upstream, Traffic

Vgutierrez created T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

Jan 5 2024, 3:46 PM · Upstream, Traffic

Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/11 performs integration tests using qemu to spawn a bullseye and a bookworm kernel and then perform basic eBPF tasks, this should be enough for now.

Jan 5 2024, 3:28 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Dec 20 2023

Vgutierrez added a project to T353760: pybal_monitor_down_results_total metric only created when PyBal goes down: PyBal.

Dec 20 2023, 4:23 PM · PyBal, Traffic

Vgutierrez updated subscribers of T353779: sre.dns.roll-restart-reboot-wikimedia-dns cookbook sometimes cannot remove downtime.

Dec 20 2023, 3:50 PM · Patch-For-Review, Traffic

Dec 18 2023

Vgutierrez closed T352876: cp4037 reimage for cookbook getting stuck at PXE boot as Resolved.

Dec 18 2023, 4:13 PM · Traffic, DC-Ops

Vgutierrez closed T352876: cp4037 reimage for cookbook getting stuck at PXE boot, a subtask of T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting, as Resolved.

Dec 18 2023, 4:13 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Vgutierrez triaged T353657: tcp-mss-clamper doesn't work on bullseye / kernel 5.10 as Medium priority.

Dec 18 2023, 4:03 PM · Traffic

Vgutierrez created T353657: tcp-mss-clamper doesn't work on bullseye / kernel 5.10.

Dec 18 2023, 4:03 PM · Traffic

Vgutierrez closed T352242: Provide second acmechief server configured for Puppet 7 in eqiad as Resolved.

@MoritzMuehlenhoff Sorry about that, acmechief1002 is now ready for service :)

Dec 18 2023, 9:16 AM · Acme-chief, Traffic

Vgutierrez reopened T352242: Provide second acmechief server configured for Puppet 7 in eqiad as "Open".

In T352242#9412109, @MoritzMuehlenhoff wrote:

There's one active alert, is that known/expected?

FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status

Dec 18 2023, 9:02 AM · Acme-chief, Traffic

Dec 13 2023

Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

Docker is definitely not a valid option here since we need to test against several kernels (at least 5.10 and 6.1)

Dec 13 2023, 8:17 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Vgutierrez triaged T352876: cp4037 reimage for cookbook getting stuck at PXE boot as Medium priority.

@Papaul I see that you triggerd the cookbook last week. Are you stuck with something? do you need help from our side? it would be great to get this host back to production before the break

Dec 13 2023, 12:09 PM · Traffic, DC-Ops

Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

@thcipriani being able to run privileged containers seems to be enough, at least for basic eBPF tests (not sure about IPVS setups yet), https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/10 leverages docker-compose to spawn two containers and validate that tcp-mss-clamper (small eBPF program that performs TCP MSS clamping) is working as expected

Dec 13 2023, 11:32 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Vgutierrez added a comment to T353279: CI on gitlab for eBPF / networking heavy projects.

we currently perform manual tests on developer machines (far from optimal). So if we can spawn our own runner we could run docker containers in privileged mode there? It could be easier and faster than spawning VMs per CI execution

Dec 13 2023, 9:01 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Dec 12 2023

Vgutierrez updated the task description for T353279: CI on gitlab for eBPF / networking heavy projects.

Dec 12 2023, 5:00 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Vgutierrez triaged T353279: CI on gitlab for eBPF / networking heavy projects as Medium priority.

Dec 12 2023, 4:57 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Vgutierrez created T353279: CI on gitlab for eBPF / networking heavy projects.

Dec 12 2023, 4:56 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic

Vgutierrez added a comment to T352744: OpenSSL 3.x performance issues.

HAProxy 2.9 has been released, introducing AWS-LC support and with some interesting mention to OpenSSL on its release notes:

Dec 12 2023, 9:36 AM · SRE-swift-storage, Traffic

Vgutierrez added a comment to T352744: OpenSSL 3.x performance issues.

In T352744#9398828, @MoritzMuehlenhoff wrote:

I'm wondering though if we reproduced this with the pilot bookworm cp installation?

The pilot cp bookworm installation on cp4052 (upload@ulsfo) didn't experience the issues described on the OpenSSL GH issue, IMHO kinda expected considering the patterns of traffic in that cluster/DC and the nature of the performance issue

Dec 12 2023, 9:30 AM · SRE-swift-storage, Traffic

Vgutierrez added a comment to T346463: Identify and label prefetch proxy data in our traffic.

VCL patch submitted by @Ottomata (https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352) looks good to me, @elukey CR to add a new WebRequest field also looks good (https://gerrit.wikimedia.org/r/c/operations/puppet/+/980912). In terms of deployment, updating the VCL is slightly easier than updating the varnishkafka JSON format as VCL updates don't require restarting varnish and the varnishkafka change requires a varnishkakfa restart to be applied.

Dec 12 2023, 9:19 AM · Traffic, Movement-Insights, Data-Engineering

Dec 11 2023

Vgutierrez closed T351069: Enable IPIP encapsulation for ncredir, a subtask of T348837: Investigate IPVS IPIP encapsulation support, as Resolved.

Dec 11 2023, 4:16 PM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T351069: Enable IPIP encapsulation for ncredir as Resolved.

Dec 11 2023, 4:16 PM · Patch-For-Review, SRE, Traffic

Vgutierrez added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

@akosiaris as mentioned on the meeting we need the following questions answered:

Is it OK to clamp all egress traffic on a k8s node?
IPIP encapsulation needs rp filtering disabled on the ipip / ip6ip6 interface in order to work, is that something calico supports?

Dec 11 2023, 1:23 PM · serviceops, Traffic

Dec 7 2023

bking awarded T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors a Burninate token.

Dec 7 2023, 2:54 PM · Patch-For-Review, Data-Engineering, SRE, Traffic

Vgutierrez created T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Dec 7 2023, 8:41 AM · serviceops, Traffic

Vgutierrez closed T348837: Investigate IPVS IPIP encapsulation support as Resolved.

Dec 7 2023, 8:35 AM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T348837: Investigate IPVS IPIP encapsulation support, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

Dec 7 2023, 8:35 AM · Traffic

Dec 5 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Dec 5 2023, 5:49 PM · Patch-For-Review, SRE, Traffic

Vgutierrez updated the task description for T342154: Upgrade Traffic hosts to bookworm.

Dec 5 2023, 10:08 AM · Patch-For-Review, Traffic

Vgutierrez triaged T352744: OpenSSL 3.x performance issues as High priority.

Dec 5 2023, 10:07 AM · SRE-swift-storage, Traffic

Vgutierrez created T352744: OpenSSL 3.x performance issues.

Dec 5 2023, 10:06 AM · SRE-swift-storage, Traffic

Vgutierrez added a comment to T352242: Provide second acmechief server configured for Puppet 7 in eqiad.

I don't think it's a problem of load as our puppetization doesn't balance Puppet API requests between different acme-chief hosts but as @MoritzMuehlenhoff mentions on the description: if we have some kind of incident on codfw puppet 7 hosts lose the ability of fetching/refreshing acme-chief TLS material.

Dec 5 2023, 7:36 AM · Acme-chief, Traffic

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Dec 5 2023, 6:58 AM · Patch-For-Review, SRE, Traffic

Dec 4 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Dec 4 2023, 4:20 PM · Patch-For-Review, SRE, Traffic

Vgutierrez added a comment to T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster.

A quick check on deployment-parsoid12 tells that ferm rules on that instance are at fault (host FW is dropping traffic towards port 80):

Dec  4 15:38:53 deployment-parsoid12 ulogd[15249]: [fw-in-drop] IN=eth0 OUT= MAC=fa:16:3e:db:ed:c8:fa:16:3e:ae:f5:88:08:00 SRC=<REDACTED> DST=172.16.4.125 LEN=60 TOS=00 PREC=0x00 TTL=49 ID=36184 DF PROTO=TCP SPT=54462 DPT=80 SEQ=3181076532 ACK=0 WINDOW=64240 SYN URGP=0 MARK=0

Dec 4 2023, 3:43 PM · Abstract Wikipedia team, WikiLambda, Traffic, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure

Vgutierrez lowered the priority of T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster from Unbreak Now! to High.

@daniel what's time outing is parsoid-external-ci-access.beta.wmflabs.org:

$ curl --http1.1 https://sr.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/RESTBase_Testing_Page -v -o /dev/null -s 2>&1 |grep HTTP/1.1
* using HTTP/1.1
> GET /api/rest_v1/page/html/RESTBase_Testing_Page HTTP/1.1
< HTTP/1.1 200 OK

Dec 4 2023, 11:56 AM · Abstract Wikipedia team, WikiLambda, Traffic, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure

Nov 30 2023

Vgutierrez added a comment to T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster.

api.php is currently handled by deployment-mediawiki11 and that instance is unreacheable ATM:

vgutierrez@deployment-cache-text08:~$ sudo cat /etc/trafficserver/remap.config |fgrep api.php
regex_map http://(.*)/w/api.php http://deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud/w/api.php @plugin=/usr/lib/trafficserver/modules/tslua.so @pparam=/etc/trafficserver/lua/rb-mw-mangling.lua
vgutierrez@deployment-cache-text08:~$ nc -w 3 -zv deployment-mediawiki11 80
nc: connect to deployment-mediawiki11 (172.16.3.203) port 80 (tcp) timed out: Operation now in progress

Nov 30 2023, 2:45 PM · Abstract Wikipedia team, WikiLambda, Traffic, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure

Nov 29 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Nov 29 2023, 12:34 PM · Patch-For-Review, SRE, Traffic

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Nov 29 2023, 12:26 PM · Patch-For-Review, SRE, Traffic

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Nov 29 2023, 12:24 PM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers, a subtask of T351069: Enable IPIP encapsulation for ncredir, as Resolved.

Nov 29 2023, 12:24 PM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers as Resolved.

Nov 29 2023, 12:24 PM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop as Resolved.

Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on vlan1201
Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on enp175s0f0np0
Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Exiting
Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Detaching IPIP Multiqueue Optimizer from enp175s0f0np0
Nov 29 11:04:43 lvs4008 systemd[1]: Stopping eBPF based IPIP Multiqueue Optimizer...
Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Detaching IPIP Multiqueue Optimizer from vlan1201

working as expected now:

vgutierrez@lvs4008:~$ sudo -i bpftool prog |grep ipip
23278: sched_cls  name ipip_optimizer  tag 575f8397462edb4d  gpl
23279: sched_cls  name ipip_optimizer  tag 575f8397462edb4d  gpl
vgutierrez@lvs4008:~$ sudo -i systemctl stop ipip-multiqueue-optimizer.service 
vgutierrez@lvs4008:~$ sudo -i bpftool prog |grep ipip
[no output]

Nov 29 2023, 11:06 AM · Traffic

Vgutierrez triaged T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop as Medium priority.

Nov 29 2023, 9:18 AM · Traffic

Vgutierrez created T352249: ipip-multiqueue-optimizer should unload eBPF programs on service stop.

Nov 29 2023, 9:18 AM · Traffic

Nov 28 2023

Vgutierrez updated the task description for T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers.

Nov 28 2023, 11:59 AM · Patch-For-Review, SRE, Traffic

Vgutierrez triaged T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers as High priority.

Nov 28 2023, 10:44 AM · Patch-For-Review, SRE, Traffic

Vgutierrez created T352160: RP filtering drops requests incoming via IPIP tunnels on ncredir realservers.

Nov 28 2023, 10:43 AM · Patch-For-Review, SRE, Traffic

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Nov 28 2023, 10:41 AM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers, a subtask of T351069: Enable IPIP encapsulation for ncredir, as Resolved.

Nov 28 2023, 10:40 AM · Patch-For-Review, SRE, Traffic

Vgutierrez closed T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers as Resolved.

Nov 28 2023, 10:40 AM · SRE, Traffic

Vgutierrez added a comment to T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers.

using the syntax on the good old iptables, this should work:

iptables -A INPUT -s 172.16.0.0/10 -p ipencap -j ACCEPT
ip6tables -A INPUT -s 0100::/64 -p ipv6 -j ACCEPT

Nov 28 2023, 8:45 AM · SRE, Traffic

Vgutierrez triaged T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers as High priority.

Nov 28 2023, 8:37 AM · SRE, Traffic

Vgutierrez created T352143: Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers.

Nov 28 2023, 8:37 AM · SRE, Traffic

Nov 27 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Nov 27 2023, 8:29 PM · Patch-For-Review, SRE, Traffic

Vgutierrez added a comment to T351069: Enable IPIP encapsulation for ncredir.

thx @ayounsi we will go with option 1:

IPv4: 1500 - 20 (IP) - 20 (IP) - 20 (TCP) = 1440 bytes
IPv6: 1500 - 40 (IPv6) - 40 (IPv6) - 20 (TCP) = 1400 bytes

Nov 27 2023, 3:08 PM · Patch-For-Review, SRE, Traffic

Nov 24 2023

Vgutierrez added a comment to T351069: Enable IPIP encapsulation for ncredir.

@ayounsi what would be the required TCP MSS clamping values? per https://phabricator.wikimedia.org/T348837#9256494 It seems that around ~1400 bytes for both IPv4/IPv6 should be ok?

Nov 24 2023, 4:23 PM · Patch-For-Review, SRE, Traffic

Vgutierrez added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

nginx didn't enforce a timeout for the whole request but just a timeout (180s) between reads from the server so that won't be enough. To mimick the behavior you need to set the response timeout to 0 and stream_idle_timeout to 180s (dunno if the latter is supported by our puppetization)

Nov 24 2023, 11:34 AM · SRE-swift-storage, Traffic, Commons

Vgutierrez added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

envoy sets an upstream response timeout by default at 65s (https://github.com/wikimedia/operations-puppet/blob/397c454bbad404c9667c6f63f86e993b1970af8a/modules/envoyproxy/manifests/tls_terminator.pp#L147) it needs to be adjusted properly to allow the transfer of big files

Nov 24 2023, 11:25 AM · SRE-swift-storage, Traffic, Commons

Vgutierrez updated subscribers of T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

The file is over the CDN size threshold (1G) so it will hit swift every time that it needs to be fetched. Could it be related by the work done by @MatthewVernon on T317616

Nov 24 2023, 11:10 AM · SRE-swift-storage, Traffic, Commons

Nov 23 2023

Vgutierrez reopened T320636: smart-data-dump fails occasionally due to facter timeouts as "Open".

All the impacted hosts are indeed running puppet 7:

vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy && puppet --version'
[...]
===== NODE GROUP =====                                                                    
(117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb101
5.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eq
iad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernete
s[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.co
dfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet
----- OUTPUT of 'journalctl -u ex...puppet --version' -----                               
impacted                                                                                  
7.23.0

Nov 23 2023, 9:45 AM · Puppet (Puppet 7.0), SRE Observability (FY2022/2023-Q2), Observability-Alerting

Vgutierrez reopened T251293: Facter is slow on a few hosts as "Open".

It looks like we are having some issues with the raid fact:

Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
                                                Traceback (most recent call last):
                                                  File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                    return subprocess.check_output(cmd, stderr=stderr) \
                                                  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
                                                    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                                                  File "/usr/lib/python3.9/subprocess.py", line 528, in run
                                                    raise CalledProcessError(retcode, process.args,
                                                subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Traceback (most recent call last):
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 460, in <module>
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     sys.exit(main())
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 429, in main
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raid_drivers = get_raid_drivers()
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL)
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return subprocess.check_output(cmd, stderr=stderr) \
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raise CalledProcessError(retcode, process.args,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.

Nov 23 2023, 9:36 AM · Infrastructure-Foundations, Puppet, SRE

Nov 22 2023

Vgutierrez updated the task description for T351069: Enable IPIP encapsulation for ncredir.

Nov 22 2023, 9:25 AM · Patch-For-Review, SRE, Traffic

Vgutierrez moved T317616: Revisit CDN<-->Swift communication from Backlog to Radar/Not for service by Traffic on the Traffic board.

Nov 22 2023, 9:23 AM · SRE-swift-storage, SRE, Traffic

Vgutierrez added a comment to T351788: Use standard tls version and ciphers for rsyslog.

Basically we should be using >=TLSv1.2, P-256 as the preferred curve and the following ciphersuites:

'TLS_AES_256_GCM_SHA384',
'TLS_AES_128_GCM_SHA256',
'TLS_CHACHA20_POLY1305_SHA256'
'ECDHE-ECDSA-AES256-GCM-SHA384',
'ECDHE-ECDSA-AES128-GCM-SHA256',
'ECDHE-ECDSA-CHACHA20-POLY1305',
'ECDHE-RSA-AES256-GCM-SHA384',
'ECDHE-RSA-AES128-GCM-SHA256',
'ECDHE-RSA-CHACHA20-POLY1305',

Nov 22 2023, 9:17 AM · Observability-Logging

Vgutierrez closed T351655: acme-chief service started on a passive node after reimage as Resolved.

Fixed by masking the systemd service before acme-chief package is installed on passive hosts

Nov 22 2023, 9:13 AM · Traffic, Acme-chief

Advanced SearchUse ResultsEdit QueryHide Query

Apr 15 2024

Apr 11 2024

Apr 9 2024

Apr 8 2024

Feb 12 2024

Feb 7 2024

Feb 6 2024

Feb 4 2024

Jan 31 2024

Jan 30 2024

Jan 23 2024

Jan 22 2024

Jan 18 2024

Jan 12 2024

Jan 11 2024

Jan 10 2024

Jan 5 2024

Dec 20 2023

Dec 18 2023

Dec 13 2023

Dec 12 2023

Dec 11 2023

Dec 7 2023

Dec 5 2023

Dec 4 2023

Nov 30 2023

Nov 29 2023

Nov 28 2023

Nov 27 2023

Nov 24 2023

Nov 23 2023

Nov 22 2023

Advanced Search
Use Results
Edit Query
Hide Query