Page MenuHomePhabricator

grafana.wmcloud.org offline following cloud wide outage
Closed, ResolvedPublicBUG REPORT

Description

[23:26]  <   JJMC89> FYI, getting 503s for grafana.wmcloud.org
[23:28]  <    bd808> JJMC89: I'll see if the why is obvious...
[23:32]  <    bd808> !log metricsinfra grafana.wmcloud.org offline with db connection error. Investigating.
[23:35]  <    bd808> !log metricsinfra metricsinfra-db-1.trove.eqiad1.wikimedia.cloud not responsive to ssh
[23:37]  <    bd808> !log metricsinfra metricsinfra-db-1.trove.eqiad1.wikimedia.cloud restarted via Horizon
[23:41]  <    bd808> andrewbogott: do you know how to troubleshoot a trove db instance? the metricsinfra-db-1 instance in the metricsinfra project is not talking with the grafana process on metricsinfra-grafana-1.metricsinfra.eqiad1.wikimedia.cloud. Restarting the trove db via horizon didn't seem to do anything useful.

The error recorded by grafana-server on metricsinfra-grafana-1.metricsinfra.eqiad1.wikimedia.cloud is:

2023-02-13T23:41:01.41+0000 lvl=eror msg="failed to determine the status of alerting engine. Enable either legacy or unified alerting explicitly and try again" err="failed to verify if the 'alert' table exists: dial tcp 172.16.3.253:3306: connect: connection refused"

Event Timeline

taavi assigned this task to Andrew.
<andrewbogott> bd808: I didn't do anything smart but I think grafana is talking to metricsinfra-db-1 again. (I just did 'restart instance,' probably the same thing you did)