Page MenuHomePhabricator

wikimedia/rdbms should not log warnings about recoverable issues as DBConnection ERROR in Logstash
Closed, ResolvedPublic

Description

It seems we currently log recoverable (and succesfully recovered) db connection issues with severity ERROR to Logstash. This is affecting AHT Team among others because their code paths are matching in the trace.

Actively excluding/ignoring that channel is a current workaround, and depending on the ansers to the below questions, we could accept that as the best practice and recommend it to everyone. However if that's the case and we want almost nobody to see these, then we could also do the simpler and less wasteful/disruptive thing and lower their severity to INFO instead and naturally get the same behaviour since team dashboards naturally hide DEBUG/INFO by default until a developer opens that gate during an investigation.

Work

Based on examples in this log, I have looked for the same reqId to see if an actual uncaught error happened later down the line and it seems this was never the case in the 24 hour period that I looked at. There were no messages in any other channels logged from the production traffic that logged an connection error to DBConnection. As such, I'm assuming that these all ended up re-tried and succeeding.

  • Question: Is this assumption correct? And thus is it okay for developers to ignore these messages currently as anything serious would end up re-logged higher in the stack?

If the answer is Yes, then I recommend we lower the severity to INFO and perhaps improve the message to clearly describe that as being about an "internal" connection, and that merely an "attempt" has failed. The logical connection request itself did not fail, and the INFO severity will mean it is hidden from view by default, thus scoping it to basically maintainers of Rdbms when diagnosing an incident related to that in the future. But not for general purposes of awareness about effective connection failures.

  • Code: Update message text and severity.
    • Describe as internal attempt under INFO.
    • Or, get rid of it entirely and have the higher level be responsible for logging it in a more detailed manner based on its state. For example, the higher level part of Rdbms will be able to conclude it as "This was the last attempt and we can't retry, this a real issue now" (which we already log and thus don't need this one as well) or as "We're going to try again, here is some INFO stuff about a now-past attempt that failed".
    • Or, if we want to generically log every attempt the same way for debugging, then perhaps move it to happen before instead of afterwards, e.g. stating "Making a connection to X Y Z" under DEBUG severity. But afaik we do this already, and so don't need it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle edited projects, added Developer Productivity; removed Platform Engineering.
Krinkle updated the task description. (Show Details)
Krinkle closed this task as Resolved.EditedApr 13 2022, 10:06 PM
Krinkle assigned this task to Ladsgroup.

Not sure which patch exactly, but this appears to no longer be an issue based on the last 2h hours of logs, eg. connection errors are still logged under DBConnection with level ERROR and under exception as fatal, but that by itself is fine and useful given the former has more detailed and db-related context. What I no longer see, whicih was the issue, is such entries that don't then result in a fatal. (e.g. log as ERROR but then also reconnect).

Loosely related future work: T305016: Think about rdbms reconnection logic