Page MenuHomePhabricator

Think about rdbms reconnection logic
Open, LowPublic

Description

From a quick review with @Ladsgroup in CR and in-meeting with @aaron and @tstarling, it seems like we might not need as much reconnection logic and could potentially lean more towards failing hard/early.

How its doing today

Three relevant scenarios.

  • Connection loss during transaction or other statful operation. These are not recoverable and result in failure.

    This is basically the only reasonable outcome and would stay, although possibly with less logic if we make it a more common outcome.

channel:DBConnection, level:WARNING
Silent reconnection to {db_server} could not be attempted: {error}

1 unrecoverable conn loss in 7 days. Came from a job runner, which would have retried. I believe these generally also throw an uncaught DBQueryDisconnectedError exception.

  • Connection loss from stateless read handle or auto-commit write handle. These are recoverable.

channel:DBConnection, level:WARNING
lost connection to {db_server} with error {errno}; reconnected

In the past 7 days, 102 succesfully recovered connection losses.

By server group:

  • 48 jobrunners (47 videoscaler jobs, 1 misc job).
  • 54 other:
    • 51 mwmaint scripts (mostly updateSpecialPages and GrowthExperiments/refreshLinkRecommendations).
    • 3 snapshot1008.
  • 0 parsoid.
  • 0 appserver.
  • 0 api_appserver.

In the past 7 days, 0 recoverable connection losses failed to reconnect. Would be logged as:

lost connection to {db_server} with error {errno}; reconnection failed

Historical context

Tim mentioned that this code started with just the ping() query being strategically used before a write query when we know much time has past since we last used the db handle for writes. For example, in the updateSpecialPages.php maintenance scripts we spend many hours gathering data from replica databases after which we then performed a ping to ensure (and auto-reconnect) that the db write handle is okay, and then perform the write.

This meant that no reconnection logic had to exist elsewhere, and any failed writes would always simply be failures.

This then later came up in other contexts as well where the time isn't spent in the database per-se but for example on disk (for snapshots/dump generation) or on the CPU or on Swift/NFS (during video transcoding or chunked upload processing jobs). At some point there this logic was generalised to accomodate those.

Tim emphasises that this was never intended for web requests, and indeed the above data fortunately indicates that we don't seem to relying on this at all in production right now, not even for a small fraction, which suggests that while this would likely have been an acceptable loss (especially with the automatic re-try we have at the HTTP level for idempotent requests), it seems to not even be a loss at all. Granted, this was just sampled from a random week, so it's possible that it does happen in rare cases. However it's consistent with our expectation that connection losses stem mainly from sleeper connections where the mysql server or kernel disposes of an unused connection after the TCP socket is idle for a certain period of time. This TCP timeout is presumably longer than 60 seconds (GET req timeout) and perhaps also longer than 3 minutes (POST req timeout), which explains why they would not happen under normal circumstances on app servers.

Numbers and sources for HTTP timeouts documented at https://wikitech.wikimedia.org/wiki/HTTP_timeouts.

How it works

  • If a query fails, we determine if its due to connection loss.
  • No matter what, we reconnect. (Really!)
  • If safe to do, we repeat the last query and silently move on (and log a diagnostic warning).
  • If not safe, we dirty the Database object as in unsafe state and throw an error.
  • The error is usually fatal. Although callers are allowed to catch it and call rollback() which is effectively a no-op but will reset the Database object state and permit the caller to run new queries. If a caller catches the error but fails to acknowlege it with a rollback, the db handle remains dirty and throws on the next query (e.g. T281451, T293859).

In a web context, I believe such continuation would mainly be for rendering the error page about a connection loss, which could involve further queries. Although per the above data, this doesn't actually happen. Likewise, it would be used between deferred updates, but again we don't see that right now.

Outside web context, this is used between jobs when they run in batches (runJobs.php). That's the only scenario I can think of where it's reasonable to want the process to continue after knowingly having lost state, since each job is separate and there is no benefit to stopping the process. Although we don't use this at WMF, and for third-parties runJobs is generally run in a loop or reguarly on page views. Hence, it might be reasonable to fatal here always and not support a recovery from dirty state (CGI for the win?). From a library perspective, I suppose it would be important to make sure that one can at least dispose of the state and start fresh from LBFactory. Afaik we don't need to do anything to support that, it just works, but it's something to think about.

Code in question

Notable code and complexity that may be related:

  • Database::canRecoverFromDisconnect (source)
  • Database::replaceLostConnection (source)
  • DatabaseMysqlBase::isKnownStatementRollbackError (source)

Proposal

No proposal at this time. Just to capture the above conversations and think if we can do anything to simplify things, eg. in context of T299691: Break down monster class: Database.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I thought about this a bit. It's a bit hard to avoid scope creep and still keep the bigger picture in mind. This ties into T302880: Reduce connections opened by mediawiki, T299691: Break down monster class: Database, T298485: MW scripts should reload the database config and more.

My thinking is that Database is trying to juggle two different worlds. One handling db in request context and one handling db in jobs/maint scripts/videoscalers. The latter group has completely different needs and differ drastically in what a request needs and we prioritized the requests (understandably) leading to all sorts of bugs including dbs keeping connection to depooled hosts.

Here are some random suggestions:

  • Decide whether to create a new Database class for jobs/maint or stick to MaintainableDatabase.
    • If the latter, then make sure wfGetDB() stops returning MaintainableDatabase
  • Move all of reconnect logic to that class.
    • If you ask me, these long-running systems hold the connection for way too long. If you do "show processlist" on any db, it's full of jobrunners/mwmaint connection in Sleep all the time. This can lead to outage if the load on appservers go up. In jobs/maint we can simply close the goddamn connection once query is done and reconnect if there is a new query.
  • Remove implicit transaction logic from that class.
  • Possibly build a way that it would automatically reload the config.
  • And/Or instead of holding one DB (host), hold multiple and do round-robin on the DBs (and other magic we can implement). -...

How does that sound?

  • Move all of reconnect logic to that class.
    • If you ask me, these long-running systems hold the connection for way too long. If you do "show processlist" on any db, it's full of jobrunners/mwmaint connection in Sleep all the time. This can lead to outage if the load on appservers go up. In jobs/maint we can simply close the goddamn connection once query is done and reconnect if there is a new query.
  • Remove implicit transaction logic from that class.

I think this wouldn't really help with complexity. There's not much interaction between implicit transactions and automatic connection recovery, so you would have about the same amount of code. Dropping connections early and relying more on automatic connection recovery tips the balance towards performance and away from simplicity.

  • Possibly build a way that it would automatically reload the config.

One way to allow both connection recovery and configuration reload would be to have a checkpoint/restart feature for maintenance scripts and jobs. The outer loop of compatible scripts and jobs would be factored out into a controller system. Connection loss and config change would be treated as recoverable errors leading to shutdown and restart of the worker process. So Database/LoadBalancer would be simpler, instead you would have increased complexity in maintenance scripts and jobs.

  • Move all of reconnect logic to that class.
    • If you ask me, these long-running systems hold the connection for way too long. If you do "show processlist" on any db, it's full of jobrunners/mwmaint connection in Sleep all the time. This can lead to outage if the load on appservers go up. In jobs/maint we can simply close the goddamn connection once query is done and reconnect if there is a new query.
  • Remove implicit transaction logic from that class.

I think this wouldn't really help with complexity. There's not much interaction between implicit transactions and automatic connection recovery, so you would have about the same amount of code. Dropping connections early and relying more on automatic connection recovery tips the balance towards performance and away from simplicity.

  • Possibly build a way that it would automatically reload the config.

One way to allow both connection recovery and configuration reload would be to have a checkpoint/restart feature for maintenance scripts and jobs. The outer loop of compatible scripts and jobs would be factored out into a controller system. Connection loss and config change would be treated as recoverable errors leading to shutdown and restart of the worker process. So Database/LoadBalancer would be simpler, instead you would have increased complexity in maintenance scripts and jobs.

True but my point is to shift the complexity to a dedicated class handling db connections in maint scripts (we can actually consider jobs same as web requests). In other words, I think we basically need to have two systems of handling connections. One rather simple handling connections in webrequests and jobs (without reconnection logic or config reload logic or a lot other issues). And the other class being more complicated handling connections in maint scripts only. That would simplify the code in the common case (webrequests and jobs) a lot but also avoids copy-pasting a lot of logic in every maint script.

That being said, we don't have that many long-running maint scripts, so we can copy-paste without introducing too much tech debt.

Just passing by to provide one use case, that I believe was workarounded (but not sure if fully fixed) some years ago- closing connections. Videoscalers used to have (or try to) long open idle connections while they were running T97641. I remember asking someone to avoid that behavior, and that was solved it in some way. Mentioning in case you want to have this behavior into account for reconnection needs. Apologies if this is out of topic, just wanted to provide more info.

One way to allow both connection recovery and configuration reload would be to have a checkpoint/restart feature for maintenance scripts and jobs. The outer loop of compatible scripts and jobs would be factored out into a controller system. Connection loss and config change would be treated as recoverable errors leading to shutdown and restart of the worker process. So Database/LoadBalancer would be simpler, instead you would have increased complexity in maintenance scripts and jobs.

I never actually used forking in PHP so I could be way off, but that seems like a neat solution here. There could be a Maintenance subclass with forking support via a generator (the logic largely exists already in ForkController), then typically the only thing to change would be the parent class plus replacing some foreach loop with the generator call, and providing some mechanism (a socket?) for sending data (like the number of rows updated and the last seen ID) back.

Of course that wouldn't reload the configuration, duh. It could somewhat help with the "long-running process keeping DB connection open" issue, but then the script might try to reconnect to the depooled DB, which is probably bad.

Dump from convo with @Ladsgroup:

For the T298485 use case, we want the outcome that as a DBA you can do "something" and then within 10 minutes no longer have any open connections, or only connections that we are fine to quit and that won't come back after that final quit.

Right now the reconnect logic is in the Database class and as the interface for on-going scripts is DBConnRef, it is currently non-trivial for connection loss to connect to something other than the same host, as this information is in the LB/LBF objects.

If we refactor this and allow Rdbms to respond to a connection loss by essentially depooling the host from the in-process perspective, and connect to a different replica; the main concern that I can think of is ChronologyProtector. The good thing is, that this is mainly for web use cases and the above comments and analysis indicate that we don't need reconnect logic on web requests anyway, so we can simply exclude all this from web requests. We may still want the new connection to start with a waitFor to the last-seen of read/write positions just in case, which seems fine.

If we consider it solved, in theory, what to do upon connection loss, the next question is how do we get the maintenance script to yield and/or trigger a connection loss on purpose such that it doesn't interrupt any actual writes or (short) atomic transctions on the otherwise fairly "idle" auto-commit connections from mwmaint scripts. Amir's answer is that we can quit proceses from the mysql server process list and only quit the ones that are idle / in sleep mode, and repeat this until there's nothing left. Long-running scripts tend to spend a fair amont of time between writes so this should work in practice even if not in theory. We can have some sort of fallback where maybe after X minutes we quit it anyway or report the operator that a particular connection is in a long transaction and/or is ending/starting transactions with insufficient time in between and then let the operator decide, as it the case today.

Change 854956 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: deprecate Database::wasErrorReissuable

https://gerrit.wikimedia.org/r/854956

Change 854957 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: remove lock timeouts from Database::isKnownStatementRollbackError

https://gerrit.wikimedia.org/r/854957

Change 854956 merged by jenkins-bot:

[mediawiki/core@master] rdbms: deprecate Database::wasErrorReissuable

https://gerrit.wikimedia.org/r/854956

Change 854957 merged by jenkins-bot:

[mediawiki/core@master] rdbms: remove lock timeouts from Database::isKnownStatementRollbackError

https://gerrit.wikimedia.org/r/854957