T127066 is closed, but the issue persists. Especially, catscan2 seems to go down every few hours. To reiterate, it worked fine for months, and ran into trouble around the time of the last DB server crash, though I do not know if that was a coincidence.
Whatever changed on Labs, PLEASE FIX IT. We can't keep restarting our tools manually all the time.
Checking for the heartbeat, I find an "unknown" server with a high lag, plus some lag on s3. Might one of those be the cause?
MariaDB [wikidatawiki_p]> SELECT * FROM heartbeat_p.heartbeat;
+---------+----------------------------+---------+
shard | last_updated | lag |
+---------+----------------------------+---------+
s6 | 2016-02-24T11:02:39.501070 | 0 |
unknown | 2016-02-11T11:14:01.219140 | 1122517 |
s7 | 2016-02-24T11:02:39.501110 | 0 |
s3 | 2016-02-24T10:50:39.501080 | 719 |
s4 | 2016-02-24T11:02:15.501190 | 23 |
s1 | 2016-02-24T11:02:36.000590 | 2 |
s5 | 2016-02-24T11:02:38.500810 | 0 |
s2 | 2016-02-24T11:02:39.000970 | 0 |
+---------+----------------------------+---------+
8 rows in set (0.01 sec)