Host came back clean, I've updated the hw raid firmware while I was at it
Tor has been retired in T243288: Retire the Tor relay
The old hosts have been eventually decom'd!
We have icinga alerts for mediawiki errors rates nowadays, based on Prometheus metrics (via logstash -> statsd -> prometheus)
With the standard partman recipes being implemented essentially everywhere it also means we get (software) raid "by default". I'm going to boldly resolve the task but please reopen if needed!
Push Gateway implementation at T249311: Deploy Prometheus Push Gateway
Boldly declining as we're still using nginx but it is on its way out (frontend caches already off nginx, internal usage should be replaced with envoy)
Looks like this is no longer an issue, I checked cloudmetrics* (ex labmon) alerts and no UNKNOWNs
Boldly resolving this task, with the logging pipeline in production we can either tap into the kafka log stream pre-logstash or inject messages back into kafka post-logstash after processing
AFAICT we've been running all elasticsearch checks in all clusters and we're OK with it, boldly resolving!
We have the logging pipeline in production now, in other words applications send logs to either local syslog unix socket / journald or localhost udp
Graphite version on wmcs has catched up and sortByTotal is available (tested on grafana-labs' explore function)
AFAIK this hasn't recurred, but we might have not had Phatality deployments since then @mmodell ?
Resolving in favor of T257016: Fix paniclog alert to only sent mails once
Taking over this issue to provide access to Thanos instead, which provides a unified query interface.
Boldly resolving, it is indeed the case that a blocked logstash output exerts backpressure on the whole pipeline. Pending items are T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable and T255243: Increase logging pipeline ingestion capacity
This is complete, was indeed related to PDU upgrades
Fri, Jul 3
Thu, Jul 2
Resolving if favor of T255243: Increase logging pipeline ingestion capacity
Option 1 seems attractive to me because it is the proverbial nail in the coffin for the issue of logstash-derived metrics being unreliable in the face of kafka consumer lag. OTOH it is unclear to me how much of an effort it'd be to get there (?)
This is done, thanks @herron for putting the new VMs in service
We are on the Kafka pipeline for MW logs that were sent to logstash over the network, udp2log is still in place due to the high volume of logs but yes eventually we'd like to deprecate udp2log too and move everything to Kafka.
Fri, Jun 26
Thu, Jun 25
Wed, Jun 24
I've switched both SRE and WMCS contacts to the VO-specific contacts and verified they work as expected. The previous failure was due to not using address1 in place of email.