Page MenuHomePhabricator

Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections
Closed, ResolvedPublic

Description

I was looking at the logs for running maintenance scripts on the terbium replacement machine and found out that our logs are filled with what can be found in P2293.

I traced the problem to the cronjob for wikibase dispatchchanges:

php /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 1600 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200

which is supposed to run every 3 minutes; in practice, this script takes forever to run and thus we have a draining lag of execution so that we have at least 8 concurrent executions of the script, all trying to act on the database extensively.

To make things worse, a script around the time of its timeout has typically more than 500 open connections to databases.

I don't think this is acceptable by any logic, so what I propose is the following:

  • Make the cronjob run with a lock file, so that no more than one instance can run in parallel
  • Make the cronjob run a bit less often
  • The logic behind this script should be revisited and maybe these changes should be dispatched via our jobqueue instead than via a maintenance script

In general I don't think it's sensible to allow this much connection to be opened at once, I guess this is a bug in the code.

Related Objects

Event Timeline

Joe created this task.Nov 9 2015, 12:14 PM
Joe raised the priority of this task from to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe added projects: Operations, DBA, Wikibase-Quality.
Joe added a subscriber: Joe.
Restricted Application added a project: Wikidata. · View Herald TranscriptNov 9 2015, 12:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 252179 had a related patch set uploaded (by Giuseppe Lavagetto):
maintenance: run at most one wikidata dispatchChanges instance at a time

https://gerrit.wikimedia.org/r/252179

Probably related to T107072, too.

Change 252179 merged by Giuseppe Lavagetto:
maintenance: run at most three wikidata dispatchChanges instance at a time

https://gerrit.wikimedia.org/r/252179

To update the latest issues identified:

  • As this creates one connection per wiki, it ends up opening 1800 connections to s3 servers (actual measure)
  • Most of this connections are idle, not doing actual work, which makes them unnecesary, while creating the same issue than a long-running query (running out of available connections, creating large temporary structures that are impossible or very difficult to defragment)
  • The cron jobs run as wikiadmin, which are on purpose not limited in execution time, so there is no protection against them overflowing a database server
  • There has been connection exhaustion observed for the past months on several servers, but specially on s3, (see ticket above) this probably has contributed to it
  • Connections should only take seconds to execute, then disconnect, otherwise, should the server is depooled, or it lags for other reasons, you are working with stale data and blocking other maintenance workflows
daniel added a comment.EditedNov 10 2015, 11:23 AM

This script should open one connection per db cluster, not one per wiki. If it does, that's a major bug. The change dispatcher makes use of the standard functionality provided by the LoadBalancer class, so the connections should be pooled. Maybe it fails to release the connection in some error case, leading to more error cases? I'll try look into it, but I'm not on my regular work schedule this week.

As to the performance issues:

Regarding the PID lock: running one instance will not dispatch changes fast enough. The lag will build up.

the connections should be pooled

We only pool connections at server side- there is not such a thing as client connection pooling in mediawiki right now.

aaron added a subscriber: aaron.Nov 10 2015, 11:38 AM

See "+channel:DBPerformance +message:"*connections made*"" at logstash.wikimedia.org. I see lots of concurrent connections from mw1152 scripts (if reuseConnection was called properly I'd assume there would be ~7 or so, since going from DB to DB on the same shard is just a matter of USE clauses).

I'll try to look into whether the problem is in core or not, though I need to spend time on some other tasks too.

We only pool connections at server side- there is not such a thing as client connection pooling in mediawiki right now.

I'm confused - what is it that LoadBalancer::reuseConnection does, then? From a brief look at the code, it seems to manage leases of pooled connections in $this->mConns['foreignUsed'] and $this->mConns['foreignFree'], and manages a count in "foreignPoolRefCount". Am I missing something?

We can go into details of terminology, but reusing connections != pool. There are many differences, some of which require persistent connections. But the key factor (specially in this case) is that in a pool of connections there is an upper limit of maximum concurrent connections, and we do not have one.

daniel added a comment.EditedNov 10 2015, 5:55 PM

@jcrespo My point is that LoadBalancer::reuseConnection should make sure that I get an existing connection, not a new one, when I ask for a connection to a wiki on the same cluster, if there already is a connection to that cluster. If used correctly, this means that the maximum number of connections would be the number of clusters times two (one master and one slave per cluster), right?

It's important for me to understand whether consistent use of LB::getConnection and LB:reuseConnection *should* resolve the problem. Because that was my assumption when I wrote the script. So, was my assumption wrong, or is the implementation faulty in some way?

The relevant class is SqlChangeDispatchCoordinator. The critical methods are engageClientLock() and releaseClientLock(). Do you see any fundamental problems with these?

The idea is to use MySQL's global lock feature to avoid multiple dispatcher processes dispatching to the same wiki at the same time. The lock is currently engaged on the target wiki's database, but there is actually no compelling reason to do that. We could just as well place the locks for all client wikis on the wikidata master db. Then there should be no reason to connect to the client database at all (assuming the job queue is not using mysql).

Change 252267 had a related patch set uploaded (by Aaron Schulz):
Make getLaggedSlaveMode() use reuseConnection() as needed

https://gerrit.wikimedia.org/r/252267

Change 252267 merged by jenkins-bot:
Make getLaggedSlaveMode() use reuseConnection() as needed

https://gerrit.wikimedia.org/r/252267

The Wikidata team will check if this is resolved in Wikidata-Sprint-2015-11-17.

jcrespo added a comment.EditedNov 18 2015, 10:30 AM

I am thinking of killing and banning these connections right now from s3 because it is breaking our database servers. It is still creating 1000-2000 connections to db1035 (mostly blocking regular traffic in 900 wikis, as we only allow concurrent execution of 32 threads).

@daniel, could you please make sure to sync up with the Services team to see how T114443 could lay the groundwork for solving this problem better? @mobrovac, could you please act as liaison / point-person?

aude added a comment.Nov 18 2015, 10:41 AM

from irc:

05:33 < aude> jynus: would it be worth to backport and deploy https://gerrit.wikimedia.org/r/#/c/252267/ ?
05:34 < aude> would be really good to know that this helps or not (or how much)
05:34 < jynus> I do not think that will work at all
ori added a comment.Nov 18 2015, 10:49 AM
  • The cron jobs run as wikiadmin, which are on purpose not limited in execution time, so there is no protection against them overflowing a database server

Ok, so it sounds like we have:

  • A very severe problem.
  • Agreement on how to mitigate it.

@Lydia_Pintscher, could you please make this a priority? Especially since the sprint goal of checking if this has been resolved has just been accomplished for you by Jaime.

daniel added a comment.EditedNov 18 2015, 11:11 AM

@ori We do not have agreement on how to fix this. Dispatching via the job queue would certainly be good, but it would change nothing regarding the number databases the dispatch code needs to talk to. As far as I can see, it may help to protect the server, but it would do so by breaking the dispatch process. Which is a nicer failure mode, but not a solution to the problem.

As far as I know, we are using core methods that are supposed to facilitate connection re-use. This was broken in core, and was fixed by Aaron. Now Jynus sais that he doesn't believe that fix is going to help our situation. I'm curious about why he believes that.

It would also be good to get an answer to the question I asked in T118162#1796490. Basically, I see three possibilities: 1) I am misunderstanding what the relevant core methods inLoadBalancer do 2) I'm using them wrong 3) they are broken. It would be very useful to get some feedback at least on the question whether SqlChangeDispatchCoordinator::engageClientLock() and SqlChangeDispatchCoordinator::releaseClientLock() are doing something obviously wrong.

One possible fix to the connection issue is to change the locking mechanism. That would not be very hard to code, but annoying to deploy - we'd need to stop all dispatch scripts to make sure no old code runs in parallel with new code. This would also mean side-stepping the problem without understanding the cause. And I would really like to understand the cause.

Now Jynus sais that he doesn't believe that fix is going to help our situation. I'm curious about why he believes that.

Connection re-use does not work if you open 900 connections at the same time every 3 minutes. I have been saying that for a long time. If a DBA has to explain connection reuse vs pool of connections... :-/

We could just as well place the locks for all client wikis on the wikidata master db. Then there should be no reason to connect to the client database at all (assuming the job queue is not using mysql).

Please don't. You are moving the problem from one server to another (and a more critical server). A good start would be to close connections as soon as they are not needed anymore- instead of idling them. Whatever you do, please do not test it in production.

Change 253889 had a related patch set uploaded (by Aude):
Close database connections in SqlChangeDispatchCoordinator

https://gerrit.wikimedia.org/r/253889

Change 253889 abandoned by Aude:
Close database connections in SqlChangeDispatchCoordinator

https://gerrit.wikimedia.org/r/253889

Change 253898 had a related patch set uploaded (by Daniel Kinzler):
ChangeDispatcher should use locks on the local DB.

https://gerrit.wikimedia.org/r/253898

Regarding https://gerrit.wikimedia.org/r/253898: this should fix any issues with connections to client wiki dbs. However, it does so at the cost of changing our locking mechanism, which means it cannot be deployed easily.

The original issue however is that the connection re-use machanism in core doesn't work, or didn't work, or was never intended to work in the way I understood it. I think it's important to resolve that issue, because if we don't, it will hit again, triggered by some other code.

As @ori mentioned, the Services team is working with Analytics on the EventBus project, whose aim is basically to replace jobrunners and enable efficient event propagation and delivery inside the production cluster (but also targeted at third-party users). In that context, our team plans to build change-propagation system (cf. T102476, T117933 and related tasks for more context) on top of it. And from the looks of it, that's exactly what you are trying to achieve with this script, isn't it?

Connection re-use does not work if you open 900 connections at the same time every 3 minutes. I have been saying that for a long time. If a DBA has to explain connection reuse vs pool of connections... :-/

There seems to be a fundamental misunderstanding here. The script is not designed to open 900 connections, nor should it be possible for this script to open 900 connections, if the code in core works as I understand it. I suspect that we are talking about "connection re-use" at a completely different level. I'm talking about re-use inside the same PHP invocation, not between requests. Do you know the relevant code in the LoadBalancer class? Can you tell me in what respect I am understanding or using it wrong? Maybe Aaron could shed some light on this.

We could just as well place the locks for all client wikis on the wikidata master db. Then there should be no reason to connect to the client database at all (assuming the job queue is not using mysql).

Please don't. You are moving the problem from one server to another (and a more critical server). A good start would be to close connections as soon as they are not needed anymore- instead of idling them. Whatever you do, please do not test it in production.

The script shouldn't idle the connection really, it's actually bounded by sql query speed on the repo's master DB. No queries are run against other databases. I do not see how we would "shift the problem" - the problem is too many connections. What I suggest is to use one connection for everything.

We were connecting to the client databases just to obtain a named lock on that database. The connection then stayed open (not so great), but (according to how I understand LoadBalancer), we would use the same connection for all wikis on the same cluster - so we may end up with 10 or so open connections per script. Not ideal, but not catastrophic. Unless the connection wasn't getting re-used, and we ended up creating a new one for every wiki. To me, that seems to be the problem - a problem caused by the bug in core that Aaron fixed.

We could start to forcibly close the connection after every query, even if it's just a second or less until we are going to fire the next one. But we'd have to discuss why we should do that in this case, and not in others. Explicitly closing connections is generally Not Done in MediaWiki - we keep the same connection(s) alive for the duration of the request. As far as I can see, DatabaseBase::close and LoadBalancer::closeConnection are never called in any regular maintenance script or during web requests.

If the "request" (script run) lasts 10 minutes, that can of course be problematic, especially if there are many such scripts, and the connection isn't actually used. But the connection to the repo database *is* used. And we don't have hundreds of script instances.

Closing the connection to slave DBs (instead of using the same connection for all wikis on the same master), as Aude suggests, would work around the issue of connection re-use not working properly. My patch also works around the problem, by using the connection to the repo wiki for everything (which requires a change to the locking logic). But with LoadBalancer working correctly, neither should be needed to avoid opening hundreds of connections. That's either a core bug (probably the one Aaron fixed), or a fundamental misunderstanding on my part (about how LoadBalancer is supposed to be doing).

There is a lot that can and should be improved about the dispatching process. But the massive problem we are seeing is not caused by the script working as designed. It's caused by the script misbehaving due to, as far as I can tell, a core bug. Which, I think, is fixed.

You are opening one connection per wiki. That is wrong. Locking per wiki will serve nothing.

@mobrovac Yes, and I'll be very happy to replace the entire change dispatching clutch with an actual event bus. I suspect it'll be a few months until we can do that, though. I'm in touch with Gabriel about this, though I haven't looked into the details. I did explain to him how our tracking/subscription system works, though, and I assume he's targeting this as a major use case for the event bus.

I'm not sure I can contribute much to the EventBus discussion right now, which seems to be mostly about implementation details.

| 206189024 | wikiadmin       | 10.64.32.13:49847  | ruwikibooks          | Sleep   |
| 206189066 | wikiadmin       | 10.64.32.13:49848  | sahwiki              | Sleep   |
| 206189112 | wikiadmin       | 10.64.32.13:49850  | fiwikinews           | Sleep   |
| 206189130 | wikiadmin       | 10.64.32.13:49852  | jawikinews           | Sleep   |
| 206189143 | wikiadmin       | 10.64.32.13:49853  | extwiki              | Sleep   |
| 206189185 | wikiadmin       | 10.64.32.13:49854  | eswikiquote          | Sleep   |
| 206189249 | wikiadmin       | 10.64.32.13:49855  | emlwiki              | Sleep   |
| 206189352 | wikiadmin       | 10.64.32.13:49856  | elwikiquote          | Sleep   |
| 206189416 | wikiadmin       | 10.64.32.13:49857  | eewiki               | Sleep   |
| 206201258 | wikiadmin       | 10.64.32.13:49862  | iswikisource         | Sleep   |
| 206201331 | wikiadmin       | 10.64.32.13:49865  | idwikisource         | Sleep   |
| 206201543 | wikiadmin       | 10.64.32.13:49868  | tewiki               | Sleep   |
| 206201727 | wikiadmin       | 10.64.32.13:49869  | hiwikibooks          | Sleep   |
| 206201784 | wikiadmin       | 10.64.32.13:49870  | ocwikibooks          | Sleep   |
| 206201872 | wikiadmin       | 10.64.32.13:49874  | ttwiki               | Sleep   |
| 206201886 | wikiadmin       | 10.64.32.13:49875  | svwikiquote          | Sleep   |
| 206201994 | wikiadmin       | 10.64.32.13:49877  | sdwiki               | Sleep   |
| 206202023 | wikiadmin       | 10.64.32.13:49878  | oswiki               | Sleep   |
| 206202074 | wikiadmin       | 10.64.32.13:49879  | orwiki               | Sleep   |
| 206202103 | wikiadmin       | 10.64.32.13:49880  | tnwiki               | Sleep   |
1060 rows in set (0.00 sec)

1060 connections is the low point of it, when we reach 3000 we are close to exhausting them.

Change 253934 had a related patch set uploaded (by Ori.livneh):
Make getLaggedSlaveMode() use reuseConnection() as needed

https://gerrit.wikimedia.org/r/253934

Change 253934 merged by Ori.livneh:
Make getLaggedSlaveMode() use reuseConnection() as needed

https://gerrit.wikimedia.org/r/253934

jcrespo closed this task as Resolved.Nov 18 2015, 6:04 PM
jcrespo claimed this task.

Resolving, and hopefuly future issues can be resolved in further tickets. I am ok with the current state.

ori added a comment.Nov 18 2015, 8:13 PM

Wikidata folks (especially @daniel and @Lydia_Pintscher): Just a quick note to apologize for my conduct on this task. I think I was quick to escalate things, and I did it in a manner that probably only succeeded at raising the stress level of everyone involved. I'll treat it as a learning experience. :)

@ori: Don't worry. We have it fixed and will do some more to make it better. I understand it sucks when these things break so badly.