sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend fetch error where the client connection is healthy but varnish is unable to reach the backend server.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 15 2024
Apr 11 2024
Apr 9 2024
Apr 8 2024
Feb 12 2024
Feb 7 2024
Feb 6 2024
@BTullis it's origin related:
I can reproduce via text@drmrs, I'll take a look ASAP :)
Feb 4 2024
Jan 31 2024
Fix already released on HAProxy 2.9: https://www.mail-archive.com/haproxy@formilux.org/msg44547.html
Jan 30 2024
IIRC that was done to smooth the reimage process and first puppet run on various roles using fifo-log-demux.
Jan 23 2024
Jan 22 2024
as suggested by Willy Tarreau on https://github.com/haproxy/haproxy/issues/2403#issuecomment-1900111538 this issue could be easier to debug on HAProxy 2.8
Jan 18 2024
Jan 12 2024
In T346350#9430239, @mpopov wrote:In T346350#9175586, @phuedx wrote:
- Only the app servers know the revision ID of the page that's being requested therefore the app servers have to propagate the information in the header
Tagged SRE Traffic per https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Traffic/Request
@KOfori: The request for your team is to add revision ID of wiki pages to X-Analytics header, ideally earlier in January.
Jan 11 2024
Jan 10 2024
apparently get_mss failed to get|capture a SYN/ACK:
61 if synack is None or synack[TCP] is None: 62 print(f"[!] Unexpected answer: {synack}", file=sys.stderr) 63 return None
Jan 5 2024
https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/11 performs integration tests using qemu to spawn a bullseye and a bookworm kernel and then perform basic eBPF tasks, this should be enough for now.
Dec 20 2023
Dec 18 2023
@MoritzMuehlenhoff Sorry about that, acmechief1002 is now ready for service :)
In T352242#9412109, @MoritzMuehlenhoff wrote:There's one active alert, is that known/expected?
FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status
Dec 13 2023
Docker is definitely not a valid option here since we need to test against several kernels (at least 5.10 and 6.1)
@Papaul I see that you triggerd the cookbook last week. Are you stuck with something? do you need help from our side? it would be great to get this host back to production before the break
@thcipriani being able to run privileged containers seems to be enough, at least for basic eBPF tests (not sure about IPVS setups yet), https://gitlab.wikimedia.org/repos/sre/tcp-mss-clamper/-/merge_requests/10 leverages docker-compose to spawn two containers and validate that tcp-mss-clamper (small eBPF program that performs TCP MSS clamping) is working as expected
we currently perform manual tests on developer machines (far from optimal). So if we can spawn our own runner we could run docker containers in privileged mode there? It could be easier and faster than spawning VMs per CI execution
Dec 12 2023
HAProxy 2.9 has been released, introducing AWS-LC support and with some interesting mention to OpenSSL on its release notes:
In T352744#9398828, @MoritzMuehlenhoff wrote:I'm wondering though if we reproduced this with the pilot bookworm cp installation?
The pilot cp bookworm installation on cp4052 (upload@ulsfo) didn't experience the issues described on the OpenSSL GH issue, IMHO kinda expected considering the patterns of traffic in that cluster/DC and the nature of the performance issue
VCL patch submitted by @Ottomata (https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352) looks good to me, @elukey CR to add a new WebRequest field also looks good (https://gerrit.wikimedia.org/r/c/operations/puppet/+/980912). In terms of deployment, updating the VCL is slightly easier than updating the varnishkafka JSON format as VCL updates don't require restarting varnish and the varnishkafka change requires a varnishkakfa restart to be applied.
Dec 11 2023
@akosiaris as mentioned on the meeting we need the following questions answered:
- Is it OK to clamp all egress traffic on a k8s node?
- IPIP encapsulation needs rp filtering disabled on the ipip / ip6ip6 interface in order to work, is that something calico supports?
Dec 7 2023
Dec 5 2023
I don't think it's a problem of load as our puppetization doesn't balance Puppet API requests between different acme-chief hosts but as @MoritzMuehlenhoff mentions on the description: if we have some kind of incident on codfw puppet 7 hosts lose the ability of fetching/refreshing acme-chief TLS material.
Dec 4 2023
A quick check on deployment-parsoid12 tells that ferm rules on that instance are at fault (host FW is dropping traffic towards port 80):
Dec 4 15:38:53 deployment-parsoid12 ulogd[15249]: [fw-in-drop] IN=eth0 OUT= MAC=fa:16:3e:db:ed:c8:fa:16:3e:ae:f5:88:08:00 SRC=<REDACTED> DST=172.16.4.125 LEN=60 TOS=00 PREC=0x00 TTL=49 ID=36184 DF PROTO=TCP SPT=54462 DPT=80 SEQ=3181076532 ACK=0 WINDOW=64240 SYN URGP=0 MARK=0
@daniel what's time outing is parsoid-external-ci-access.beta.wmflabs.org:
$ curl --http1.1 https://sr.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/RESTBase_Testing_Page -v -o /dev/null -s 2>&1 |grep HTTP/1.1 * using HTTP/1.1 > GET /api/rest_v1/page/html/RESTBase_Testing_Page HTTP/1.1 < HTTP/1.1 200 OK
Nov 30 2023
api.php is currently handled by deployment-mediawiki11 and that instance is unreacheable ATM:
vgutierrez@deployment-cache-text08:~$ sudo cat /etc/trafficserver/remap.config |fgrep api.php regex_map http://(.*)/w/api.php http://deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud/w/api.php @plugin=/usr/lib/trafficserver/modules/tslua.so @pparam=/etc/trafficserver/lua/rb-mw-mangling.lua vgutierrez@deployment-cache-text08:~$ nc -w 3 -zv deployment-mediawiki11 80 nc: connect to deployment-mediawiki11 (172.16.3.203) port 80 (tcp) timed out: Operation now in progress
Nov 29 2023
Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on vlan1201 Nov 29 11:04:34 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:34 Attaching IPIP Multiqueue Optimizer on enp175s0f0np0 Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Exiting Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Detaching IPIP Multiqueue Optimizer from enp175s0f0np0 Nov 29 11:04:43 lvs4008 systemd[1]: Stopping eBPF based IPIP Multiqueue Optimizer... Nov 29 11:04:43 lvs4008 ipip-multiqueue-optimizer[1925397]: 2023/11/29 11:04:43 Detaching IPIP Multiqueue Optimizer from vlan1201
working as expected now:
vgutierrez@lvs4008:~$ sudo -i bpftool prog |grep ipip 23278: sched_cls name ipip_optimizer tag 575f8397462edb4d gpl 23279: sched_cls name ipip_optimizer tag 575f8397462edb4d gpl vgutierrez@lvs4008:~$ sudo -i systemctl stop ipip-multiqueue-optimizer.service vgutierrez@lvs4008:~$ sudo -i bpftool prog |grep ipip [no output]
Nov 28 2023
using the syntax on the good old iptables, this should work:
iptables -A INPUT -s 172.16.0.0/10 -p ipencap -j ACCEPT ip6tables -A INPUT -s 0100::/64 -p ipv6 -j ACCEPT
Nov 27 2023
thx @ayounsi we will go with option 1:
- IPv4: 1500 - 20 (IP) - 20 (IP) - 20 (TCP) = 1440 bytes
- IPv6: 1500 - 40 (IPv6) - 40 (IPv6) - 20 (TCP) = 1400 bytes
Nov 24 2023
@ayounsi what would be the required TCP MSS clamping values? per https://phabricator.wikimedia.org/T348837#9256494 It seems that around ~1400 bytes for both IPv4/IPv6 should be ok?
nginx didn't enforce a timeout for the whole request but just a timeout (180s) between reads from the server so that won't be enough. To mimick the behavior you need to set the response timeout to 0 and stream_idle_timeout to 180s (dunno if the latter is supported by our puppetization)
envoy sets an upstream response timeout by default at 65s (https://github.com/wikimedia/operations-puppet/blob/397c454bbad404c9667c6f63f86e993b1970af8a/modules/envoyproxy/manifests/tls_terminator.pp#L147) it needs to be adjusted properly to allow the transfer of big files
The file is over the CDN size threshold (1G) so it will hit swift every time that it needs to be fetched. Could it be related by the work done by @MatthewVernon on T317616
Nov 23 2023
All the impacted hosts are indeed running puppet 7:
vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy && puppet --version' [...] ===== NODE GROUP ===== (117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb101 5.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eq iad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernete s[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.co dfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet ----- OUTPUT of 'journalctl -u ex...puppet --version' ----- impacted 7.23.0
It looks like we are having some issues with the raid fact:
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1. Traceback (most recent call last): File "/usr/local/sbin/smart-data-dump", line 124, in _check_output return subprocess.check_output(cmd, stderr=stderr) \ File "/usr/lib/python3.9/subprocess.py", line 424, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1. Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Traceback (most recent call last): Nov 22 23:47:01 cp4037 smart-data-dump[748598]: File "/usr/local/sbin/smart-data-dump", line 460, in <module> Nov 22 23:47:01 cp4037 smart-data-dump[748598]: sys.exit(main()) Nov 22 23:47:01 cp4037 smart-data-dump[748598]: File "/usr/local/sbin/smart-data-dump", line 429, in main Nov 22 23:47:01 cp4037 smart-data-dump[748598]: raid_drivers = get_raid_drivers() Nov 22 23:47:01 cp4037 smart-data-dump[748598]: File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers Nov 22 23:47:01 cp4037 smart-data-dump[748598]: raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL) Nov 22 23:47:01 cp4037 smart-data-dump[748598]: File "/usr/local/sbin/smart-data-dump", line 124, in _check_output Nov 22 23:47:01 cp4037 smart-data-dump[748598]: return subprocess.check_output(cmd, stderr=stderr) \ Nov 22 23:47:01 cp4037 smart-data-dump[748598]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output Nov 22 23:47:01 cp4037 smart-data-dump[748598]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, Nov 22 23:47:01 cp4037 smart-data-dump[748598]: File "/usr/lib/python3.9/subprocess.py", line 528, in run Nov 22 23:47:01 cp4037 smart-data-dump[748598]: raise CalledProcessError(retcode, process.args, Nov 22 23:47:01 cp4037 smart-data-dump[748598]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
Nov 22 2023
Basically we should be using >=TLSv1.2, P-256 as the preferred curve and the following ciphersuites:
'TLS_AES_256_GCM_SHA384', 'TLS_AES_128_GCM_SHA256', 'TLS_CHACHA20_POLY1305_SHA256' 'ECDHE-ECDSA-AES256-GCM-SHA384', 'ECDHE-ECDSA-AES128-GCM-SHA256', 'ECDHE-ECDSA-CHACHA20-POLY1305', 'ECDHE-RSA-AES256-GCM-SHA384', 'ECDHE-RSA-AES128-GCM-SHA256', 'ECDHE-RSA-CHACHA20-POLY1305',
Fixed by masking the systemd service before acme-chief package is installed on passive hosts