Upstream PR by yours truly https://github.com/librenms/librenms/pull/12346
Was able to get an error this morning when the session needed refreshing and thus a new ticket:
Mon, Nov 23
This is complete! I've left out trying memcache since the in-memory caching seems to work well for now
Thu, Nov 19
Ran into an error just now, although I'm not sure why CORS wouldn't allow this and might not be on the IDP side?
The full list of hosts is the logging-eqiad kafka cluster on tcp port 9093, namely:
Different error message this time in the browser's console, but something has definitely changed!
Wed, Nov 18
This was me manually failing sdb
Tue, Nov 17
Update: the host isn't coming back (both mgmt and ssh) but yes given we'll need a BBU for ms-be1030 (T268036) too I'd say let's order some (?). The host will be fully decom'd in maybe 8-10 weeks but I think we can keep the bbu anyway post-decom
This host went down earlier today, it is missing a hw raid firmware upgrade so I'll do that just in case. This looks like a BBU needing a replacement tho
This is fixed now \o/
I can confirm that the short link above works now, thanks for reaching out!
Indeed the underlying disk was failed and marked as such by the controller. To be clear in this case IIRC we have the disk name from hpssacli already and I think we should use it in reporting at least device_smart_healthy without requiring smartctl to return valid output
Thank you @Papaul ! SSD is rebuilding
Mon, Nov 16
Nevermind, let's follow up on T267870: ms-be1022 smart storage battery failure; disk sdb possibly bad
@Papaul looks like the SSD is busted on this host. Host is OOW I think, we'll need a replacement SSD, thanks!
Once the PDU are installed please let observability know. At minimum we'd need to test librenms discovery and their SNMP MIB to snmp-exporter for pulling power data into Prometheus
Tue, Nov 10
The patch is live, unfortunately due to how our Prometheus puppetization works it means we're scraping metrics from gerrit from all sites. Given that the number of metrics is not high (~2k ballpark) we're ok for now but we definitely need to tune the configuration so that only Prometheus hosts in eqiad scrape gerrit metrics
I think it is fair to say that if no disks are detected then that's always an error condition (?) In that case I think a simple(r) solution would be to exit non-zero if no disks are detected so the systemd service/timer fails loudly.
Mon, Nov 9
Status update: alerts now can be acknowledged via silences by prefixing the silence text with ACK!
I was looking at prometheus jobs down alert today and idp shows up there, I'm assuming because the prometheus endpoint has been removed in Ia4b089af. Please remove the IDP prometheus job as well, thanks!
Status update: query-frontend is serving queries with in-memory caching (1GB, to start with)
Thu, Nov 5
This is live now, LDAP users are synced daily to Grafana.
Wed, Nov 4
Reopening, this is alerting again
Noticed this today on alerts.w.o and IDP interaction, we're missing CORS:
Tue, Nov 3
Mon, Nov 2
This is now a case with VO support. They'll be following up with their transactional email provider.
Host is fully in service now
Thu, Oct 29
Wed, Oct 28
Tue, Oct 27
IIRC this was fixed, boldly resolving
This is complete!
From my POV as Swift maintainer I'm ok to go ahead with testing etc.
Mon, Oct 26
Oct 22 2020
Unfortunately reopening, we've been seeing failures (e.g. systemd, ssh) during latest codfw rebalances