Page MenuHomePhabricator

Occasional "Cannot access the database: Unknown error" in Wikimedia production
Closed, InvalidPublic

Description

https://logstash.wikimedia.org/goto/da55900c7dc6fddf0f70ec4ca324d4eb
There is a slow but steady trickle of Cannot access the database: Unknown error errors in production (something like 20-30 times a day; with a big spike around Nov 23 15:00, which might be something separate). It's fairly randomly distributed across appservers and wikis; seemingly across DB servers too, although there is no easy way to tally that. Seems like it has been going on for a long time:

dberrors.png (268×1 px, 35 KB)

Event Timeline

Without a more concrete error it is hard to troubleshoot from the DB side :-(
If it is randomly distributed and/or not specific to either app servers or DB servers, it might small network glitches? Again, without a more concrete error it is hard to debug further unfortunately.

Let's talk on IRC? We can compare notes on timestamps, servers, db names etc.

Finding 3 entries in the past 30 days. The entries this month, and the entries seen before, do not appear to have anything in common or otherwise affect a specific wiki, user group, content type, server, etc — it is presumed random and can be filtered out under general OOM/Timeout.

If these become in any way common or encountered consistently in particular scenarios, we can investigate.