Looks like a case of the controller freaking out. I've updated its firmware now to 6.60, after a reboot the raid is clean
Fri, Oct 19
Tried 6MB per thread now: we're ingesting about 30MB/s of udp traffic, with 4 statsd-proxy threads each should be able to buffer its share of bandwidth (7.5MB/s) for ~1s
Setting a 2MB socket receive buffer has helped getting errors down to ~0, unfortunately statsd-proxy nor statsite support setting SO_RCVBUF socket option via configuration, so I did this to temporarily set the buffer to 2MB and then back to its default:
In this case the controller freaked out, after a reboot the raids are clean:
Thu, Oct 18
We've been observing periodic elevated (>500/s) udp inerrors / buffer errors on graphite1001 since yesterday _after_ having switched statsd traffic to graphite1004 in T196484. The only statsd client still sending traffic to graphite1001 is ores in this case and errors are still elevated even with modest traffic (compared to the firehose of all udp statsd traffic)
I'm +1 on dateext going forward, likely not worth going back and change all existing logrotate configs
Wed, Oct 17
Agreed on the behaviors we want, on the behaviors that are desirable (i.e. what to do on engine down or unresponsive) I think we should stick to what the (absence of) the cookie instructs apache to do. Rationale being that detecting engine down/unresponsive might paper over problems with the engine itself. The other side effect is that the php7 choice would be in two places, mediawiki declaratively and apache "at runtime" depending on the state of the engines at the time of request which IMO will make debugging harder.
Mon, Oct 15
Checked now, indeed now testwikidatawiki is in group0 not group1, resolving.
I don't think we've seen reoccurence of this, resolving as invalid, also generic systemd unit monitoring should help catching cases like this.
I don't think we've seen reoccurrence of this, also logstash now has monitoring for udp packet loss which I'm assuming would also show up if logstash services are down.
We're doing good space wise now:
We're one year of librenms data in Graphite already, I'm declining this since we'll eventually reach librenms retention anyways (2yrs IIRC)
Not really "logstash" but using Wikimedia-Logstash for logging-related tasks
I ran @Krinkle script to audit grafana dashboards at https://gist.github.com/Krinkle/b5ceff5156c1f4cf3568e373cc135bad to gauge where we're still querying the servers hierarchy, full results at P7680.
The most similar task is likely T88997: Improve graphite failover and related. As far as graphite goes sending carbon line-oriented traffic is already active-active in the sense that traffic can be sent to any graphite frontend in codfw/eqiad and it'll be mirrored to the other datacenter.
The prometheus.svc endpoint in eqiad and codfw is backed by two independent Prometheus servers scraping the same targets. What I suspect has happened is that one of the two servers "catched" workers in state closing or logging while the other didn't. This also suggests to me the exporter doesn't report all metrics it knows about all the time, which leads me to believe that mod_status believes that way (i.e. when no workers are in state closing they are not reported at all).
I did a quick audit in eqiad (for starters) to preview how we'd be affected by the alert, in this way:
I was looking at T206704: Enable access from icinga1001 to mgmt interfaces and likely einsteinium/tegmen addresses will be found in other places on router configuration too (including pfw like @Volans pointed out) that will need updating
Wed, Oct 10
With the latest patch in to log exceptions I think we're good to resolve this?
I'm resolving this since this work is happening as part of T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus
Tue, Oct 9
Mon, Oct 8
Looks like we have a way forward! Resolving in favor of T206454: Setup Kafka cluster, producers and consumers for logging pipeline to track the actual Kafka setup work.
Fri, Oct 5
Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheus_host. Once TTL expires bast4001 should no longer receive queries, this can be verified by looking at /var/log/apache2/other_vhosts_access.log.
Thu, Oct 4
All done! 3.8.7-1 is live
Wed, Oct 3
Tue, Oct 2
Also to take into consideration that services moving to k8s have statsd_exporter listening on localhost, for those there's no deployment needed, only writing the statsd -> prometheus mapping rules for statsd_exporter to use.
The packaging part has been done already as part of T204266: Backport prometheus haproxy exporter for Jessie what's left in this case I believe is the puppetization to add haproxy-exporter to dbproxy hosts and the related job in Prometheus.
Adding Thumbor too since I'm sure it'll be affected as well. re: swift space concerns I don't think it'll be a problem unless the rasterized SVGs take up a lot of space, which I don't think it is the case. Thanks for the heads up!
I bounced logstash on deployment-logstash2 and looks like logs are flowing again, logstash-plain.log wasn't being written to before the restart which is a little worrying in itself and makes it non-obvious to understand what's wrong.
At yesterday's monitoring/logging meeting we've discussed this and concluded that for good hygiene and decoupling it makes sense to spin up a new Kafka cluster for logging purposes. What's left to decide on which hardware we're going to run Kafka on, which in turn boils down to a budget question, see also T203169: Logstash hardware expansion