This just happened on cp2023 too.
@Gilles To see if and to which extent ats-tls is also responsible for some of the performance degradation, you can query hadoop and check the ssl timings for cp3064. Two interesting events are Nov 12 2:54 PM (new TLS certs deployed) and Fri, Nov 15, 5:07 AM - cp3064 switched from nginx to ats-tls: https://phabricator.wikimedia.org/T231627#5666181
As per irc conversation with @Gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the information in graphite and add the cache frontends as a dropdown to https://grafana.wikimedia.org/d/000000143/navigation-timing
Interesting, I've observed the request failing as described in this task by using the Chromium developer tools, copied it as curl and tried it against cp1075. The dashboard did get deleted. Private info replaced with 'blah':
This is now done:
$ curl -v https://noc.wikimedia.org/Potato -H "X-Wikimedia-Debug: mwdebug1001.eqiad.wmnet" 2>&1 | egrep "(x-cache|server):" < server: mwdebug1001.eqiad.wmnet < x-cache: cp3052 pass, cp3054 pass
By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follows. There are some parts I don't understand so please fill the gaps.
Mon, Nov 18
There's been a decrease in local backend hitrate on ats-be compared to varnish-be. While on 2019-11-11 (before reimages to ats) the local hitrate was about 3.5%, today it is 1.3%:
Fri, Nov 15
I cannot reproduce with URLs such as https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg/200px-Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg
@awight: is there anything to do here or can we close the task?
Thu, Nov 14
TLS termination configured on port 7443:
$ curl -v https://debmonitor.wikimedia.org:7443/login/ --resolve debmonitor.wikimedia.org:7443:10.64.32.62 2>&1 | grep '< HTTP' < HTTP/2 200
Wed, Nov 13
Tue, Nov 12
Perhaps interestingly, or maybe entirely unrelated: a couple of hours before crashing the host had a spike in cache write errors:
I thought we did already address the issue with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520425/. Evidently there's something wrong with that patch. To be continued!
Nov 05 15:22:48 cp4027 systemd: Starting trafficserver-tls.service... Nov 05 15:22:50 cp4027 update-ocsp-all: touch: cannot touch '/srv/trafficserver/tls/etc/ssl_multicert.config': Read-only file system Nov 05 15:22:50 cp4027 systemd: trafficserver-tls.service: Unit cannot be reloaded because it is inactive. Nov 05 15:22:50 cp4027 update-ocsp-all: run-parts: /etc/update-ocsp.d/hooks/trafficserver-tls-ocsp exited with return code 99
I fixed the touch issue with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550475/, but still update-ocsp-all tries to reload trafficserver-tls.service, which fails due to the unit being inactive. @Vgutierrez: ideas for how to tackle this?
Mon, Nov 11
The functionality is now deployed to production, a brief illustration follows.
Fri, Nov 8
Notice that debug servers aren't pooled in etcd like regular production ones, so mwdebug1002 is still serving debug traffic:
Thu, Nov 7
Wed, Nov 6
Tue, Nov 5
This time, after reimaging the host it did boot properly. Also, initramfs size is now in line with that of other cp5 systems:
As an update, cp5012 is currently reimaging (Started first puppet run phase). The initramfs looks like this right now:
Mon, Nov 4
Wed, Oct 30
I've just observed the issue again with cp5008:
Tue, Oct 29
It is done, yes. Thanks @Ottomata!
On cache_text we have a fairly significant number of VCL files stuck in the "auto/busy" state after having been discarded by our reload script. As an example, right now we have 10 VCLs in such state on cp3050 (text), and only 2 on cp3057 (upload). They can be seen with varnishadm -n frontend vcl.list. Each VCL file keeps on running all its probes, causing the requests mentioned in this ticket. The issue seems to be known upstream but "timed out": https://github.com/varnishcache/varnish-cache/issues/2228