Page MenuHomePhabricator

DispatchChanges: Avoid long-lasting connections to the master DB
Closed, ResolvedPublic

Description

Status Quo:

  • we have several (about 5 to 10) instances of dispatchChanges.php running as cron jobs concurrently.
  • dispatchChanges.php picks the most lagged client by looking at wbchanges_dispatch, grabs a lock representing that client, and pushes updates to it. On success, wbchanges_dispatch is updated and the lock released.
  • Locking is done using the Database::lock method, which translates to MySQLs GET_LOCK. The lock is bound to the DB connedction.
  • Each run of dispatchChanges.php may make several dispatch passes, picking and locking a client wiki each time.

The Problem:

  • Each dispatchChanges.php instance (about 10) holds a connection to the master DB for several minutes. (How much of a problem is this, really?)
  • We cannot close the connection and re-connect, since we would lose the lock.

Ideas:

  • Rely on information written to wbchanges_dispatch. But how to prevent starvation due to stale locks, left by a failed dispatch pass or dead process? High timeout values cause starvation, low timeout values cause false negatives.
  • Use some other locking mechanism. We have the LockManager base class with several implementations, including Memc, Redis, file system, and MySQL.
  • Put wbchanges on a separate DB system. But running a separate server for a singe DB table with 1000 rows and 10 concurrent connections seems overkill.
NOTE: Switching from one locking mechanism to the other requires careful coordination. Execution of dispatchChanges.php must be halted and then re-started with the new configuration, without causing more than a few minutes lag for client wikis.

Related Objects

Event Timeline

Holding connections on the master: if there are 5-10 jobs running it shouldn't be a big deal as I assume only 10 connections (max) will be just connected (but not doing anything, right?).
It can be a problem if the master is under heavy stress, however, if the lock isn't doing anything but keeping the connection open, isn't a massive deal.

@Manuel, @daniel Actually it is a problem, because masters have a limit of CPU# or 32 active threads on the pool of connections, which means half of the connections are reserved but doing nothing, so you are limiting the master throughput to 50% of the reality. Also, running connections for 15 minutes means that in case of a master problem, those connections do not get new configuration (like a master failover) or a new mediawiki version, something that limits High Availability- if there is a problem with the master, it cannot be easily failed over. All long running connections should be short, or easy to kill.

The wiki master is a SPOF, and it should not be used for coordination. For example, as @aaron will be able to tell you, the real master will not be easily accessible from a remote datacenter, which means this functionality as it is now, limits cross-dc scalability.

There is absolutely no reason to do this on a master, when it could be easily done on a separate server, that doesn't interact with other functionality.

I would favor doing the locking using an alternative mechanism for which we already have infrastructure. Setting up a separate Maria server just to serve 10 mostly idle connections seems overkill.

However, I'm not sure the available locking mechanismns based on Memc and Redis have sufficient protection against stale (orphan) locks. Perhaps @aaron can shed some light on this.

Using memcached is probably not feasible, since locks may be dropped at any time. If Redis turns out not to be an option, we may consider using a more powerful distributed locking service. Running another service just for this seems to be a bit of overkill, but if the alternative is to set up a separate Maria host for a single 1000 row table, ti may still be the better option.

Two options that come to mind:

@Marostegui, @daniel Actually it is a problem, because masters have a limit of CPU# or 32 active threads on the pool of connections, which means half of the connections are reserved but doing nothing, so you are limiting the master throughput to 50% of the reality.

I don't quite understand - why would an idle connection be hogging CPU threads? These connections are not active for 15 minutes, just open... Sure, they count towards the connection limit, but that's higher than just 32, no?

This comment was removed by daniel.

max_connections is 5000, maximum active threads is 32 enforced on the connection pool. No connections should be open that are idle, and a typical connection should take less than 1 second, otherwise it has the risk of getting killed by the watchdog looking for idle connections (and we are not going to make an exception because wikidata). Extra connections make other connections take longer to connect, which increases the timeout error rate from regular connections.

Even if your connection don't count to the limit, they have an ongoing transaction open (lock) which creates issues with both metadata locking on things like ALTER TABLES and provide purge issues.

How often would locks be dropped? Using ScopedLock would handle exceptions in non-lock code. The shutdown handler usually catches SIGINT. I guess there are still fatal errors, though I'd hope that sort of thing would be rare. In that case, the redis lock manager used by our FileBackend instances could be reused.

There is also a flip-side to automatically dropping on connection loss, which is that loss can happen (possibly due to the net_wait_timeout options) while the connection to DBs actually being updated stays alive. In that case, multiple threads could run on the same client wiki. Not sure if that actually happens though, there would be reconnection events in logstash if it did.

How often would locks be dropped? Using ScopedLock would handle exceptions in non-lock code. The shutdown handler usually catches SIGINT. I guess there are still fatal errors, though I'd hope that sort of thing would be rare. In that case, the redis lock manager used by our FileBackend instances could be reused.

Reusing there code from FileBackend sounds like it's worth looking into, thanks @aaron!

Another example of why long running connections are a problem: I am depooling es1017 for important maintenance, I have depooled it, so I expect connections so finish within a few seconds, with the exception of wikiadmin's known long running queries, but I just see 2 sleeping connections:

| 2188112525 | wikiuser        | 10.64.32.39:47222 | wikidatawiki       | Sleep   |     1138 |                                                                             | NULL             |    0.000 |
| 2188171306 | wikiuser        | 10.64.16.64:43004 | wikidatawiki       | Sleep   |      584 |                                                                             | NULL             |    0.000 |

Should I kill them? Should I not be able to depool the server, even if I was in an emergency? I do not mind long running connections, if I know I can kill them at any time.

I also do not want you to make you work more than necessary. If you only need 1000 rows, and it contains no private data, I can give you access to a misc server shared with other resources, no need to have a dedicated server.

@jcrespo a misc server would be fine, no private data there.

We'll need to add a config variable to allow wb_changes_dispatch to live on a separate server, though. I should file a ticket for that later.

Anyway, @Ladsgroup will be exploring Redis based locks, see T151993: Implement ChangeDispatchCoordinator based on RedisLockManager. With that, we'd no longer need the MySQL locks at all.

Another example of why long running connections are a problem: I am depooling es1017 for important maintenance, I have depooled it, so I expect connections so finish within a few seconds, with the exception of wikiadmin's known long running queries, but I just see 2 sleeping connections:

| 2188112525 | wikiuser        | 10.64.32.39:47222 | wikidatawiki       | Sleep   |     1138 |                                                                             | NULL             |    0.000 |
| 2188171306 | wikiuser        | 10.64.16.64:43004 | wikidatawiki       | Sleep   |      584 |                                                                             | NULL             |    0.000 |

Should I kill them? Should I not be able to depool the server, even if I was in an emergency? I do not mind long running connections, if I know I can kill them at any time.

Hm, these are both job runners, jobs (probably) shouldn't run for so long. I wonder what's causing this.

Hm, these are both job runners, jobs (probably) shouldn't run for so long. I wonder what's causing this.

Separate issue then, but heads up for it.

daniel moved this task from Revisit to Project on the User-Daniel board.

Okay, it's up and running. What's next?

For the record in history. This chart is number of errors of master going readonly due to replicas being lagged:

image.png (516×1 px, 91 KB)

Guess when redis dispatching has been deployed.
In ordinary days, we had around 1700 of errors of the master going readonly. Now, it's zero.

So do you think this had something to do with reports like T123867 T164191? This is highly surprising- I was expecting low to no master or replication performance impact, but zero is highly suspicious. Was this expected? Couldn't this be related to a bug on monitoring due to T161908#3245451 instead?

Yes, Strangely I removed all selectors in logstash and added them back one by one and made some changes.
it worked properly. This is the correct result:

image.png (820×1 px, 143 KB)

Since deployment, we still have readonly mode, but they reduced to it's 80% of original flow by comparing last 24 hours and the similar period in last week. That's not super bad. I will continue monitoring this on a week total to see if it's having a real impact or not.

With the current state, we still have the same amount of connections to the master DBs, but we don't use GET_LOCK etc. on them anymore. The queries are unchanged also (despite the GET_LOCK etc.).

Due to this it would be surprising to me if this change alone improved the situation (it was mostly a pre-requirement for that).

With the current state, we still have the same amount of connections to the master DBs, but we don't use GET_LOCK etc. on them anymore.

And that for me is a huge win alone. Sadly we do not have yet metrics on connection time performance, where I would expect the biggest win.

With the deployment of the changes in the dispatching, in the last 24 hours we had around 1,800 cases of going readonly but this number last week the exact same day was 3,482. This is completely tangible.

1978Gage2001 moved this task from Triage to In progress on the DBA board.

I don't see huge issues on the current master- I would solve this as resolved and at some point thing about the epic parent (T108944) to improve the infrastructure.

That's another topic, I think we can call this resolved.

Marostegui assigned this task to Ladsgroup.