From a quick review with @Ladsgroup in CR and in-meeting with @aaron and @tstarling, it seems like we might not need as much reconnection logic and could potentially lean more towards failing hard/early.
How its doing today
Three relevant scenarios.
- Connection loss during transaction or other statful operation. These are not recoverable and result in failure.
This is basically the only reasonable outcome and would stay, although possibly with less logic if we make it a more common outcome.
1 unrecoverable conn loss in 7 days. Came from a job runner, which would have retried. I believe these generally also throw an uncaught DBQueryDisconnectedError exception.
- Connection loss from stateless read handle or auto-commit write handle. These are recoverable.
In the past 7 days, 102 succesfully recovered connection losses.
By server group:
- 48 jobrunners (47 videoscaler jobs, 1 misc job).
- 54 other:
- 51 mwmaint scripts (mostly updateSpecialPages and GrowthExperiments/refreshLinkRecommendations).
- 3 snapshot1008.
- 0 parsoid.
- 0 appserver.
- 0 api_appserver.
In the past 7 days, 0 recoverable connection losses failed to reconnect. Would be logged as:
lost connection to {db_server} with error {errno}; reconnection failed
Historical context
Tim mentioned that this code started with just the ping() query being strategically used before a write query when we know much time has past since we last used the db handle for writes. For example, in the updateSpecialPages.php maintenance scripts we spend many hours gathering data from replica databases after which we then performed a ping to ensure (and auto-reconnect) that the db write handle is okay, and then perform the write.
This meant that no reconnection logic had to exist elsewhere, and any failed writes would always simply be failures.
This then later came up in other contexts as well where the time isn't spent in the database per-se but for example on disk (for snapshots/dump generation) or on the CPU or on Swift/NFS (during video transcoding or chunked upload processing jobs). At some point there this logic was generalised to accomodate those.
Tim emphasises that this was never intended for web requests, and indeed the above data fortunately indicates that we don't seem to relying on this at all in production right now, not even for a small fraction, which suggests that while this would likely have been an acceptable loss (especially with the automatic re-try we have at the HTTP level for idempotent requests), it seems to not even be a loss at all. Granted, this was just sampled from a random week, so it's possible that it does happen in rare cases. However it's consistent with our expectation that connection losses stem mainly from sleeper connections where the mysql server or kernel disposes of an unused connection after the TCP socket is idle for a certain period of time. This TCP timeout is presumably longer than 60 seconds (GET req timeout) and perhaps also longer than 3 minutes (POST req timeout), which explains why they would not happen under normal circumstances on app servers.
Numbers and sources for HTTP timeouts documented at https://wikitech.wikimedia.org/wiki/HTTP_timeouts.
How it works
- If a query fails, we determine if its due to connection loss.
- No matter what, we reconnect. (Really!)
- If safe to do, we repeat the last query and silently move on (and log a diagnostic warning).
- If not safe, we dirty the Database object as in unsafe state and throw an error.
- The error is usually fatal. Although callers are allowed to catch it and call rollback() which is effectively a no-op but will reset the Database object state and permit the caller to run new queries. If a caller catches the error but fails to acknowlege it with a rollback, the db handle remains dirty and throws on the next query (e.g. T281451, T293859).
In a web context, I believe such continuation would mainly be for rendering the error page about a connection loss, which could involve further queries. Although per the above data, this doesn't actually happen. Likewise, it would be used between deferred updates, but again we don't see that right now.
Outside web context, this is used between jobs when they run in batches (runJobs.php). That's the only scenario I can think of where it's reasonable to want the process to continue after knowingly having lost state, since each job is separate and there is no benefit to stopping the process. Although we don't use this at WMF, and for third-parties runJobs is generally run in a loop or reguarly on page views. Hence, it might be reasonable to fatal here always and not support a recovery from dirty state (CGI for the win?). From a library perspective, I suppose it would be important to make sure that one can at least dispose of the state and start fresh from LBFactory. Afaik we don't need to do anything to support that, it just works, but it's something to think about.
Code in question
Notable code and complexity that may be related:
- Database::canRecoverFromDisconnect (source)
- Database::replaceLostConnection (source)
- DatabaseMysqlBase::isKnownStatementRollbackError (source)
Proposal
No proposal at this time. Just to capture the above conversations and think if we can do anything to simplify things, eg. in context of T299691: Break down monster class: Database.