Mon, Sep 18
Tue, Sep 12
Mon, Sep 11
Fri, Sep 8
Thu, Sep 7
Tue, Sep 5
On September 1st:
Mon, Sep 4
Thanks @elukey! Yeah cp4024 might be having hardware issues. The system was down yesterday at 9ish AM UTC. I've power-cycled it and it came back online fine, but then after some hours it started with the lockups mentioned in this task description.
Fri, Sep 1
We've been using vsthrottle in prod for a while now, closing.
Thu, Aug 31
Wed, Aug 30
Tue, Aug 29
Looks good, thanks @Cmjohnson!
Aug 11 2017
Aug 9 2017
Aug 5 2017
A more general patch has been submitted by Julian Anastasov http://archive.linuxvirtualserver.org/html/lvs-devel/2017-08/msg00001.html \o/
Aug 2 2017
@Cmjohnson any news?
Aug 1 2017
I've sent a patch upstream covering the virtual service removal case: http://archive.linuxvirtualserver.org/html/lvs-devel/2017-07/msg00016.html
Jul 31 2017
Jul 28 2017
We do have occasional backend fetch failures. Closing, as this looks like a transient error.
Jul 27 2017
+1 on upgrading to stretch. However, we are probably gonna end up in a similar situation on stretch whenever upgrading to newer kernels, so perhaps it might still make sense to keep this ticket open to track ipvsadm backporting efforts whenever necessary?
Jul 26 2017
Jul 25 2017
Jul 24 2017
Closing, the problem is known and there's no perfect solution (but one nginx reload a day is much better than one every hour!).
Jul 23 2017
Jul 22 2017
Jul 21 2017
The last overrun was logged about half an hour ago.
As of yesterday, varnish 4.1.7-1wm1 is deployed on all cache hosts. It includes our patch adding two counters, one for shortlived objects creation and another for uncacheable objects. I've added both counters to the varnish-transient-storage-usage dashboard.
Jul 20 2017
@Cmjohnson please replace the disk (sda) whenever you've got the chance!
So Kconfig says that PERF_EVENTS_INTEL_CSTATE is about perf events for power monitoring. As I don't think we need it, we could blacklist it together with intel-rapl-perf (PERF_EVENTS_INTEL_RAPL) while we're at it.
Reopening, I've just seen this happening again on cp1066. This is what systemd-analyze blame reported after a slow but successful boot:
So as @MoritzMuehlenhoff mentioned on IRC the mgmt issues might have been due to T171041.
Jul 19 2017
So it looks like the varnishstatsd overruns occur mostly in ulsfo:
Icinga external commands include SCHEDULE_SVC_DOWNTIME, which seems handy. We could perhaps try writing a script that issues a SCHEDULE_SVC_DOWNTIME for the IPSec service for each host defined in the role::ipsec targets array?
Incidentally, while looking at entirely different stuff on esams recdns hosts, I've noticed that the vast majority of our DNS traffic there is due to cache hosts continuously asking for the A record of statsd.eqiad.wmnet.
Yeah, all other daemons have been fixed but varnishstatsd seems to still be affected by this issue.