Page MenuHomePhabricator

Connections to all db servers for wikidata as wikiadmin from snapshot, terbium
Open, Needs TriagePublic

Description

Having long-running connections to all hosts are a huge issue for availability, and prevents me from depooling servers for regular maintenance. In particular, wikiadmin user is the most problematic as it is not checked for used resources (maximum duration of queries, connections, etc.)

These threads do /* FetchText::doGetText */ and then idle for seconds, with all the issues that can create.

If dump hosts are too slow, please say so and we can check options (specially now that we have new servers), but creating random connections to any host is a problem. Let's see why this is happening and propose a proper solution:

If these create light-weight queries only, lets disconnect and connect after some amount of seconds. If dump hosts are too slow, let's give them better resources. If there is a problem with the connection framework, let'x fix it with the addition of a proxy/persistent connections manager.

Event Timeline

jcrespo created this task.Jun 20 2016, 10:25 AM
Restricted Application added a subscriber: Zppix. · View Herald TranscriptJun 20 2016, 10:25 AM
jcrespo renamed this task from Connections to all db servers for wikidata as wikiadmin from snapshot1001.eqiad.wmnet to Connections to all db servers for wikidata as wikiadmin from snapshot, terbium.Jun 20 2016, 2:07 PM

This is not only happening for dumps, terbium is also wrongly using main-dbs (which are still on testing) for long-running queries, which cause long periods of connection issues: https://logstash.wikimedia.org/#dashboard/temp/AVVuJmxb_LTxu7wlh30V

root@terbium:~$ lsof | grep 43894
php5      32192          www-data    8u     IPv4         1620677566       0t0        TCP terbium.eqiad.wmnet:43894->db1092.eqiad.wmnet:mysql (ESTABLISHED)
root@terbium:~$ ps aux | grep 32192
root     16294  0.0  0.0  11864   916 pts/5    S+   14:06   0:00 grep 32192
www-data 32192  6.3  0.1 333200 53700 ?        S    13:57   0:34 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 540 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200
root@terbium:~$ lsof | grep 52769
php5       9169          www-data    8u     IPv4         1620971979       0t0        TCP terbium.eqiad.wmnet:52769->db1092.eqiad.wmnet:mysql (ESTABLISHED)
root@terbium:~$ ps aux | grep 9169
www-data  9169  7.5  0.1 330380 50984 ?        S    14:03   0:16 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 540 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200
root     22859  0.0  0.0  11868   916 pts/5    S+   14:06   0:00 grep 9169

I sure expect the dump db hosts to be used; if they are not be, that's a bug somewhere in the MW code. This used to work properly AFAIR.

In the meantime closing and reopening connections after some number of seconds or some small number of requests is fine. I can look into that.

@ArielGlenn : let's identify the reason why this is happening before changing things- we may implement proxying/etcd config before changing any logic. The important part here and now is that probably some mediawiki class is not using the dump role, maybe related to wikidata?

jcrespo added a subscriber: hoo.Jun 21 2016, 10:35 AM

With 'dump', I sometimes mean 'vslow', too (e.g. for terbium).

Clarification: is this *only* happening for wikidata, or do you notice this for other wiki dumps too?

jcrespo added a comment.EditedJun 21 2016, 3:21 PM

Right now my only worries are for wikidata, because they create large amount of connection errors and are very visible- I can check on other shards, but even if they do, it would be very low priority (no infrastructure issues there).

The only errors I am getting are:

{
  "_index": "logstash-2016.06.21",
  "_type": "mediawiki",
  "_id": "AVVzitOBiAuaWDjhzkJI",
  "_score": null,
  "_source": {
    "message": "Connection error: Unknown error (10.64.48.26)",
    "@version": 1,
    "@timestamp": "2016-06-21T15:18:45.000Z",
    "type": "mediawiki",
    "host": "snapshot1001",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "wfLogDBError",
    "normalized_message": "Connection error: {last_error} ({db_server})",
    "wiki": "wikidatawiki",
    "mwversion": "1.28.0-wmf.4",
    "reqId": "54343ee7ae0333671a9f0db1",
    "method": "LoadBalancer::reportConnectionError",
    "last_error": "Unknown error",
    "db_server": "10.64.48.26"
  },
  "sort": [
    1466522325000
  ]
}

Not to other wikis.

10.64.48.26 is db1071, a regular-traffic db.

hoo added a comment.Jun 22 2016, 8:56 AM

This is not only happening for dumps, terbium is also wrongly using main-dbs (which are still on testing) for long-running queries, which cause long periods of connection issues: https://logstash.wikimedia.org/#dashboard/temp/AVVuJmxb_LTxu7wlh30V

root@terbium:~$ lsof | grep 43894
php5      32192          www-data    8u     IPv4         1620677566       0t0        TCP terbium.eqiad.wmnet:43894->db1092.eqiad.wmnet:mysql (ESTABLISHED)
root@terbium:~$ ps aux | grep 32192
root     16294  0.0  0.0  11864   916 pts/5    S+   14:06   0:00 grep 32192
www-data 32192  6.3  0.1 333200 53700 ?        S    13:57   0:34 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 540 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200
root@terbium:~$ lsof | grep 52769
php5       9169          www-data    8u     IPv4         1620971979       0t0        TCP terbium.eqiad.wmnet:52769->db1092.eqiad.wmnet:mysql (ESTABLISHED)
root@terbium:~$ ps aux | grep 9169
www-data  9169  7.5  0.1 330380 50984 ?        S    14:03   0:16 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 540 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200
root     22859  0.0  0.0  11868   916 pts/5    S+   14:06   0:00 grep 9169

These jobs need access to the main DBs as well as to the master (these jobs are used to actively dispatch incoming changes to various Wikipedias). Currently we let these scripts run for about 9 minutes… if needed, we can cut that down (they run on Zend, so the performance penalty of restarts is rather low).

These jobs need access to the main DBs

What is a "main db" and what is the difference with a 'vslow' slave? You are accessing a testing slave, that will be put down at any moment (or will block & kill terbium traffic).

hoo added a comment.Jun 22 2016, 9:01 AM

Regarding the dump scripts running on the snapshot hosts: If needed, I can try to make these use the "dump" servers, although that's going to significantly slow them down (given we use queries that the API and the UI also uses). Reloading the database configuration is sadly not going to be easy, given we have these definitions in PHP.

As I said:

If these create light-weight queries only, lets disconnect and connect after some amount of seconds. If dump hosts are too slow, let's give them better resources. If there is a problem with the connection framework, let'x fix it with the addition of a proxy/persistent connections manager.

By using the wrong hosts you are creating connection issues for end users.

hoo added a comment.Jun 22 2016, 9:43 AM

As I said:

If these create light-weight queries only, lets disconnect and connect after some amount of seconds. If dump hosts are too slow, let's give them better resources. If there is a problem with the connection framework, let'x fix it with the addition of a proxy/persistent connections manager.

By using the wrong hosts you are creating connection issues for end users.

Why do testing slaves have weight? If they have weight, you need to expect MediaWiki to connect to them. As said, we can (rather easily) reduce the run-time of the terbium scripts. Regarding the dumps: I'm in a conversation with Daniel about how to fix this… but it's not easy given that it requires a significant change in MediaWiki (also architecture wise).

daniel added a subscriber: daniel.EditedJun 22 2016, 9:50 AM

My 2¢

@jcrespo wrote

If these create light-weight queries only, lets disconnect and connect after some amount of seconds.

Teaching LoadBalance to throw away connections after a minute or so would be easy. But would that be sufficient? Reconnecting would use the old configuration that was valid when the process started. Reconfiguring on the fly is a lot more tricky, as @hoo pointed out.

If dump hosts are too slow, let's give them better resources.

Yes, please. When dump scripts run for weeks instead of days, they tend to fail at some point. I agree that we shouldn't hit the web-facing databases for making dumps. But if that means we get no dumps at all, that's not good either...

If there is a problem with the connection framework, let'x fix it with the addition of a proxy/persistent connections manager.

My impression is that what we want here is the opposite of persistent connections. Making it possible to re-configure MediaWiki mid-air would certainly be nice, it's something I have been working on with respect to dependency injection and unit testing.

jcrespo added a comment.EditedJun 22 2016, 9:57 AM

Let me show you the weight of these servers:

	's5' => array(
		'vslow' => array(
			'db1045' => 1,
		),
		'dump' => array(
			'db1045' => 1,
		),
		'api' => array(
			'db1070' => 1,
			'db1071' => 1,
		),
		'watchlist' => array(
			'db1026' => 1,
		),
		'recentchanges' => array(
			'db1026' => 1,
		),
		'recentchangeslinked' => array(
			'db1026' => 1,
		),
		'contributions' => array(
			'db1026' => 1,
		),
		'logpager' => array(
			'db1026' => 1,
		),
	),

Do you see db1071 anywhere? Not on dump, not on vslow (and I just added it to api, it was not there before, when the issue started happening). So yes, I only expect most short-lived connections (the ones created by non-api end-user requests) to only go to the new servers while they are being tested. This separation is important due to performance and HA reasons. By not following mediawiki standards you are threatening the reliability of the site, for wikidatawiki and for dewiki users (which are already not happy).

There is not a need to change any mediawiki architecture. terbium jobs must use the 'vslow' role. If you think only one server is not enough, please say so and we change the config to add more servers to vslow (which we certainly can do now that we have more servers), but not violate mediawiki contract of roles sending long-running connections to servers that are not ready for them.

The same thing applies for dumps- if more are needed, configuration is changed, not hardcoding the wrong ones.

These changes literally only need one parameter change loadbalancer->get(SLAVE, 'dump')

@jcrespo

These changes literally only need one parameter change loadbalancer->get(SLAVE, 'dump')

That sounds straight forward enough.

However, there are potentially several dozen places where we call LoadBalancer::getConnection (we try not to hog the connection, but only get it from the LB when we need it - so we do that often). We'd have to somehow loop this parameter through to all the places where we get the connection. Or is it sufficient to specify the group on the first call? Will the LB return a connection that was created for a specific group, if later no group is specified in the call to getConnection?

@jcrespo: is this triggered only by dispatchChanges.php, or also by dumpJson.php or dumpRdf.php?

jcrespo added a comment.EditedJun 22 2016, 10:46 AM

However, there are potentially several dozen places where we call LoadBalancer::getConnection (we try not to hog the connection, but only get it from the LB when we need it - so we do that often). We'd have to somehow loop this parameter through to all the places where we get the connection. Or is it sufficient to specify the group on the first call? Will the LB return a connection that was created for a specific group, if later no group is specified in the call to getConnection?

I cannot say by heart, I will have to check. We may want to loop in performance team as they gave good tips last time we had an issue with connection handling and tumbnails. I really do not think this is a general problem in all of wikidata, just in a couple of scripts- some scripts keeps connected for 9 minutes, mostly idle- that is the main issue.

However, for what you say, you may want to refactor the connection handling (wikidata's, not mediawiki's) to a more abstract-factoy-likey, as role can be 'dump' now and 'wikidata-special-dump' tomorrow (that is not actually happening, but it could).

We could even have a 'wikidata' role, and reserve it 100% for wikidata, non-mediawiki operations!

@jcrespo How much fire is this? Does it need fixing while we are at Wikimania or is ot ok if we handle it after that?

This is not an emergency, I handled that, but it should be definitely 'high'- I suspect it is what it is causing queries such as dumps fail/go slow. T138291 has a different root cause, but I assume it is related to this.

There has been dewiki bot (api) users complaining recently, but that was when it was combined with being low on servers a few weeks ago.

@jcrespo if you do not indeed see this on any other shards, it's probably not anywhere in the code I write/run, which is why I ask if you've seen it elsewhere except wikidata.

Also, I'm happy to look at the Wikidata-specific code and help make sure that the right db is used for these jobs.

Beyod that I would like the ability to tell LB to drop the current connection AND config and re-read. This would be very handy in general.

Depooled db1109 2 days ago, I still cannot put it under maintenance.

I still see it on other sections other than s8, see T143870.

If the code doesn't work, special mediawiki configuration should be setup so that dump hosts only know about dump dbs, I can see that working fine.

I would like the ability to tell LB to drop the current connection AND config and re-read. This would be very handy in general.

That functionality was added when we had issues with commons transcoding leaving db connections open, AFAIK

hoo added a comment.EditedApr 25 2018, 9:56 AM

@jcrespo I can only talk about the Wikidata side of things, we are working on this in two ways:

  • We change the script invocation so that they don't run for several days anymore (the first part for that is awaiting review already) - T190513
  • We plan to implement T147169#3660704 shortly.

For db1109, I guess our scripts will take up to one more day before finishing and closing the connection (checked the progress on snapshot1007). If this issue is very very pressing on your end, kill the connections, our scripts will recover (by restarting from the beginning, which is awful, but T190513 will also address this).

@Ariel just told me that we should not restart the dumps this week, so that they don't run into the weekend, to give room for planned maintenance.

hoo added a subscriber: Ariel.Apr 25 2018, 9:59 AM
Ariel removed a subscriber: Ariel.May 25 2018, 9:18 PM

From another thread, from TimS:

It sounds like the snapshot hosts were in fact needing and using the host. It's not LoadBalancer's fault if some maintenance script calls wfGetDB() without specifying a query group. Throwing an exception is the correct thing to do if the MW configuration is incorrect, since it allows the maintenance script to terminate and be restarted with different configuration.

hoo added a comment.Oct 24 2018, 9:41 PM

From another thread, from TimS:

It sounds like the snapshot hosts were in fact needing and using the host. It's not LoadBalancer's fault if some maintenance script calls wfGetDB() without specifying a query group. Throwing an exception is the correct thing to do if the MW configuration is incorrect, since it allows the maintenance script to terminate and be restarted with different configuration.

Given we set $wgDBDefaultGroup* to dump for these scripts, LoadBalancer should only ever use the "default group" in case the dump host(s) are not available. Given this (probably) didn't happen here, we're almost certainly facing some other problem.

* $wgDBDefaultGroup configures the DB group to use per default (if no other group is explicitely given), thus wfGetDB (and others) will not connect to the default hosts.

hoo added a comment.EditedOct 24 2018, 9:42 PM

One thing we could possibly do next: Add a hook in getConnection (or somewhere close) that let's us kill the connection attempt (or the entire script) in case an unwanted replica is selected. This is not very nice, though… :S

jcrespo added a comment.EditedMay 30 2019, 2:31 PM

Because this issue, or T143870, and/or long running connections due to mw connection handler, there was a connection issue at https://logstash.wikimedia.org/goto/286304e84262d2fe3335acd5eed135bb and there is likely to be another one soon. This may or may not create isues on wikidata dumps/exports, depending if retries are done on the latest configuration.

Because of hw maintenance, we cannot wait to depool the servers except a few minutes there until all webrequests finish.

Retries of wikidata entity dumps rerun the particular batch from a MediaWiki maintenance script; it will set up configuration from scratch.