- User Since
- Sep 29 2015, 8:49 PM (417 w, 5 d)
- IRC Nick
- LDAP User
- MediaWiki User
Dec 14 2022
Feb 12 2022
Feb 9 2022
Although we did briefly discuss the results of this experiment within Traffic, I don't think we ever publicly disclosed our analysis.
Feb 1 2022
We should probably add a Conflicts: libvarnishapi1 to our varnish 6 packaging, or whatever relationship magic is the right one to ensure that if varnish 6 is installed, libvarnishapi1 is not.
Jan 31 2022
Apparently deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud is gone and has been replaced by deployment-mediawiki12.
Jan 12 2022
Jan 7 2022
Smoke testing of 6.0.9 is fine on deployment-prep, I'll start upgrading production nodes next week.
Dec 15 2021
Dec 8 2021
Many of the assumptions made when this task was created have changed since the migration to ATS for cache backends (no more IPSec, the difference between Tier1 and Tier2 DCs is now gone, ...). We are now in a world where all backend caches access the origins via TLS, which I think largely covers what we wanted to achieve here. @BBlack: I'm marking the task as resolved, but of course feel free to reopen / create other tasks as needed if you think that anything is missing.
Dec 7 2021
Dec 6 2021
In the last 24 hours we had just one overrun on 4 nodes:
Nov 19 2021
The situation has improved significantly, we are now processing up to 13K lines per second vs the ~8K plateau from last week:
Nov 18 2021
After setting cache::single_backend_fqdn: cp4021.ulsfo.wmnet in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for instance cp4022:
Nov 8 2021
This is a recurring problem, see for example T273599 T222072 T295253. T222075 has ideas on how to tackle the issue. I've tried to access the instance to free up some space but I don't seem to have the permissions to do so.
Nov 5 2021
Oct 28 2021
The optimizations to varnishxcache.mtail and varnishreqstats.mtail paid off, time spent in tryBacktrack has decreased significantly:
Oct 27 2021
Oct 25 2021
Oct 22 2021
I upgraded varnish to 6.0.8 everywhere (see T292290) and forgot about restarting the service on deployment-cache-upload06. It should be fixed now, thanks @Majavah.
Oct 21 2021
I've tried using a separate mtail instance with a subset of the scripts used by the production instance, namely:
By giving a very large amount - 3G instead of the default 80M - of vsl_space to cp3062, the issue happens less often but still does happen. On all text@esams nodes, there has been no overrun between 00:44 and 04:46 (when esams traffic is at its lowest), while on cp3062 the last overrun happened at 21:56 and the first one this EU morning at 06:38. Bumping vsl_space alone does not fix the issue.
Oct 20 2021
Trying the lowest possible hanging fruit first, namely rising vsl_space. I've first tried setting it to 512M as mentioned in the SAL entry above, but that failed due to the size of /var/lib/varnish (512M, but it needs space for other stuff too). We now have vsl_space=480M on cp3062, let's see if that changes anything at all.
All hosts upgraded.
Oct 19 2021
Caches have now filled up. Response start looks good on cp3060 compared to one week ago:
Oct 18 2021
Setting priority to low for now as these seem isolated, sporadic crashes and systemd took care of the restarts as expected so there was no production impact.
Oct 14 2021
Oct 13 2021
Trying another reimage as follows:
Runbook and updated dashboard link are shown correctly. Closing.
@bd808: looks like we're all set!
Oct 12 2021
Runbook created: https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop
Oct 11 2021
Heads up Performance-Team: as with all Varnish upgrades, this may have an impact (positive or negative) on performance. You may want to keep the upgrade process on your radar: beta has been running with Varnish 6.0.8 for a few days now without obvious issues, I'll upgrade one prod text node in ulsfo today and then carry on relatively quickly unless things break.
Oct 8 2021
Oct 6 2021
After lowering the amount of memory used for the ATS backend ram cache, there's now some more available on the system:
Oct 5 2021
Preliminary testing in beta looks good, uploading the package to the archive.