Page MenuHomePhabricator

Some mw snapshot hosts are accessing main db servers
Closed, ResolvedPublic

Description

From: T143862

For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.

The idea is that if the current dump db host is not enough to dump values, we can set up more (I said many times that you should ask for resources and I happy to providing them, in fact, dump hosts are scheduled for renewal with some of the most expensive servers we have bought), but nothing that is an interactive query should go to the regular servers and it risks being blocked at any time by the firewall configuration/grant system because they are not supposed to have connections. I do not have any issue reserving new servers for dumps, or even several of them if it was necessary, but it has to be documented and configured on mediawiki.

Also, as a secondary thing, regarding resources, I was asked if new dump servers could be on codfw, as eqiad is more costly in rack space.

So please 1) identify those cases 2) ask for more resources if needed 3) do not hide such configuration.

Event Timeline

I camped on one of the s3 slave dbs and watched for any queries from snapshot1005/6 but nada. Can you grab an example for me? Or give me a cheater's way to grab an example. I'd very much like to find that bug and squash it.

I use getSlaveServer --group=dump for anything that does not go through MediaWiki, and the dumps themselves (dumpBackup and its ilk) should al be using the dump group consistently. So any hint as to what to look at would be welcome.

ArielGlenn triaged this task as Medium priority.

Could it be a reverse dns-caching bug? I will check the statistics to give you suspected IPs.

I've claimed this bug for now for anything that is not related to wikidata dumps out of cron. That means any queries not coming from snapshot1007. When I've got that tracked down, if there are still connections from snapshot1007, I can pass it off to other folks.

These are the stats since last restart for db1077 (s3), I can check other host, too:

$ mysql -h db1077 sys -e "SELECT * FROM host_summary_by_statement_type \
WHERE host IN ('$(dig +short snapshot1005.eqiad.wmnet)', '$(dig +short snapshot1006.eqiad.wmnet)', '$(dig +short snapshot1007.eqiad.wmnet)')"
+--------------+-------------------+-----------+---------------+-------------+--------------+-----------+---------------+---------------+------------+
| host         | statement         | total     | total_latency | max_latency | lock_latency | rows_sent | rows_examined | rows_affected | full_scans |
+--------------+-------------------+-----------+---------------+-------------+--------------+-----------+---------------+---------------+------------+
| 10.64.0.9    | select            |  22155259 | 1.00 h        | 149.29 ms   | 14.88 m      |  20396125 |             0 |             0 |       2193 |
| 10.64.0.9    | set_option        |     10024 | 582.47 ms     | 1.52 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.0.9    | Init DB           |     10024 | 287.63 ms     | 706.32 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.0.9    | show_slave_status |      3050 | 229.71 ms     | 829.48 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.0.9    | Quit              |     10017 | 59.92 ms      | 873.21 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | select            | 163507417 | 7.91 h        | 462.07 ms   | 2.06 h       | 137461641 |             0 |             0 |      13345 |
| 10.64.32.149 | set_option        |     36580 | 2.07 s        | 2.02 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | Init DB           |     36580 | 1.10 s        | 867.45 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | show_slave_status |      4890 | 345.70 ms     | 1.06 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | Quit              |     36482 | 231.84 ms     | 493.04 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | select            | 167326936 | 8.15 h        | 603.08 ms   | 2.09 h       | 133783142 |             0 |             0 |      18903 |
| 10.64.48.141 | set_option        |     59589 | 3.31 s        | 1.26 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | Init DB           |     59590 | 1.76 s        | 2.09 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | Quit              |     59529 | 376.37 ms     | 879.36 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | show_slave_status |      4081 | 284.61 ms     | 1.01 ms     | 0 ps         |         0 |             0 |             0 |          0 |
+--------------+-------------------+-----------+---------------+-------------+--------------+-----------+---------------+---------------+------------+

Note that show_slave_status or small amount of connection may be normal for load balancer reasons, let's just make sure there are not real queries going on.

db1077 was the host I was watching. so that's perfect.

I really want to know what those selects were. Can we get a full process entry on one of those from snapshot1005 or 1006?

I think so, it is just it is not very immediate because 149.29 ms queries are not caught by the current monitoring, and the new one is still WIP (its backend active, but none of the frontend/aggregation), I will update here once I know more.

dumpsgen 4973 /usr/bin/php5 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=zhwiki is accessing a main traffic (no dump) server at snapshot1007, and keeps doing that after the server has been depooled for over 1 hour. FYI, the server has to be depooled for maintenance, and no regard will be made for ongoing backups as this bug was filed as a blocker more than 1 year ago- if it breaks backups, I think this was warned quite some time in advance.

As far as I knew, this was a wikidata-only issue, though I have periodically checked this task to see if there is any more information.

The zhwiki task, so that we have the information here, was an abstracts dump; these are broken into smallish pieces and each one done separately. When I went to look later for that process it was already gone, and the dump was continuing, presumably using the proper host, since it was a new process starting out with a new read of db config files and a new db connection. We do this both so that no one query takes a ridiculous amount of time, and so that if a db server needs to go away for maintenance, it's not a problem.

If the queries are still taking too long, I can tune the number of revisions requested down some.

I don't really want to close the db connection, reread the config, and open a new connection for each query; that would be horribly inefficient. If the ability to reread the config is available as mentioned here T138208#4156583 then we could do that on error. In any case, you should not be at all concerned about pulling out boxes for maintenance.

The only case where we might have jobs that take a long time and are hard to split up is mysqldump of some large table; if a db server on the back end of that is pulled out for maintenance, the job will run again, but I don't really see a way to split that sort of job into smaller pieces nicely.

Krinkle renamed this task from Some dump hosts are accessing main traffic servers to Some mw snapshot hosts are accessing main db servers.Jun 4 2018, 10:31 PM

Just for the record, I observed this again on db1126.
Further, when a host gets depooled, the threads keep connecting there:

[10:31:58]  <+logmsgbot>	!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 - schema change', diff saved to https://phabricator.wikimedia.org/P9732 and previous config saved to /var/cache/conftool/dbconfig/20191125-093157-marostegui.json
# mysql.py -hdb1126 -e "show processlist" | grep -i wikiadmin
5791670571	wikiadmin	10.64.32.149:44306	wikidatawiki	Sleep	0		NULL	0.000
5791670585	wikiadmin	10.64.32.149:44308	wikidatawiki	Sleep	9		NULL	0.000
5791670699	wikiadmin	10.64.32.149:44346	wikidatawiki	Sleep	1		NULL	0.000
5791670707	wikiadmin	10.64.32.149:44352	wikidatawiki	Sleep	0		NULL	0.000
5791670723	wikiadmin	10.64.32.149:44362	wikidatawiki	Sleep	1		NULL	0.000
5791670756	wikiadmin	10.64.32.149:44372	wikidatawiki	Sleep	0		NULL	0.000
5791670765	wikiadmin	10.64.32.149:44382	wikidatawiki	Sleep	0		NULL	0.000
5791670776	wikiadmin	10.64.32.149:44388	wikidatawiki	Sleep	1		NULL	0.000
5791670789	wikiadmin	10.64.32.149:44394	wikidatawiki	Sleep	1		NULL	0.000
5791671167	wikiadmin	10.64.32.149:44436	wikidatawiki	Sleep	0		NULL	0.000
5791671541	wikiadmin	10.64.32.149:44460	wikidatawiki	Sleep	3		NULL	0.000
5791672964	wikiadmin	10.64.32.149:44534	wikidatawiki	Sleep	6		NULL	0.000
5791673160	wikiadmin	10.64.32.149:44558	wikidatawiki	Sleep	0		NULL	0.000
5791673287	wikiadmin	10.64.32.149:44602	wikidatawiki	Sleep	0		NULL	0.000
5791673340	wikiadmin	10.64.32.149:44616	wikidatawiki	Sleep	0		NULL	0.000
5791673416	wikiadmin	10.64.32.149:44642	wikidatawiki	Sleep	0		NULL	0.000
5791673601	wikiadmin	10.64.32.149:44686	wikidatawiki	Sleep	3		NULL	0.000
5791674979	wikiadmin	10.64.32.149:44756	wikidatawiki	Sleep	0		NULL	0.000
5791676211	wikiadmin	10.64.32.149:44788	wikidatawiki	Sleep	2		NULL	0.000
# host 10.64.32.149
149.32.64.10.in-addr.arpa domain name pointer snapshot1006.eqiad.wmnet.

Snapshot1006 was running regular wikidata dumps. We don't flush LB config after every query for obvious reasons, though page content fetchers should fail and restart with a new config if the connected db server becomes unavailable. Unavailable in this case means it doesn't serve queries, connections are terminated. I don't think there's a facility in MediaWiki to fail a connection if the host has been depooled in etcd/LB config but continues to respond to queries.

There will be 27 of those processes all keeping their existing connections until they complete a batch (could be as long as 12 hours) or something bad happens at the db server end.

Thanks for the explanation. However, db1126 isn't configured as vslow,dumps for s8 (wikidata) - which is the whole point of this ticket and the reason I was reporting it sort of to ack that this keeps happening and I actively ran into it :-)

Sure thing! I'm just not sure of the way forward right now.

Just to clarify, what is/was it pooled as, and when did it change from vslow?

Sure thing! I'm just not sure of the way forward right now.

Just to clarify, what is/was it pooled as, and when did it change from vslow?

It was never a vslow host.

...

It was never a vslow host.

Ok, that's new and very undesirable behavior. In the past it was always the case that for xml/sql dumps, connections might remain open to vslow hosts that were then depooled or recategorized, but never to non-vslow hosts... unless, I suppose, no vslow hosts were available. I'll need to see if something has broken or changed in the maintenance scripts or in setting up db connections.

...

It was never a vslow host.

Ok, that's new and very undesirable behavior. In the past it was always the case that for xml/sql dumps, connections might remain open to vslow hosts that were then depooled or recategorized, but never to non-vslow hosts... unless, I suppose, no vslow hosts were available. I'll need to see if something has broken or changed in the maintenance scripts or in setting up db connections.

I actually checked, and the last time db1087 (the current vslow) was depooled was like 4 months ago: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/520844/
Now that we don't use gerrit anymore for depooling hosts, it is harder to track if a host got depooled via dbctl, but I also checked SAL and did some phabricator searches and couldn't find any recent db1087's depooling.

...

Ok, that's new and very undesirable behavior. In the past it was always the case that for xml/sql dumps, connections might remain open to vslow hosts that were then depooled or recategorized, but never to non-vslow hosts... unless, I suppose, no vslow hosts were available. I'll need to see if something has broken or changed in the maintenance scripts or in setting up db connections.

I actually checked, and the last time db1087 (the current vslow) was depooled was like 4 months ago: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/520844/
Now that we don't use gerrit anymore for depooling hosts, it is harder to track if a host got depooled via dbctl, but I also checked SAL and did some phabricator searches and couldn't find any recent db1087's depooling.

Thanks for checking. That makes me even more suspicious that something about lookups for db replicas in a group has changed, or something in how connections are established. I'll be checking into it, but please do let me know in the meantime if you see any more of these from any of snapshot1005,6,7,9. (Not 8, that's where 'misc' jobs run and that's another can of worms.)

Will do - thanks for looking into it.

Do we know what queries these clients were running? A first pass through the relevant MediaWiki code doesn't show any good suspects.

No, I wasn't able to see any query running bust just the threads connected as shown at T143870#5688483
Tendril also doesn't report any slow query during that timeframe.

Change 553703 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] make BackupDumper use vslow/dumps dbservers even for short queries

https://gerrit.wikimedia.org/r/553703

Change 553703 merged by jenkins-bot:
[mediawiki/core@master] make BackupDumper use vslow/dumps dbservers even for short queries

https://gerrit.wikimedia.org/r/553703

Let's see what happens once this is in production.

jcrespo raised the priority of this task from Medium to Unbreak Now!.Dec 4 2019, 5:15 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-04T17:15:13Z] <jynus> killing dump threads on db1118 T143870

jcrespo lowered the priority of this task from Unbreak Now! to High.Dec 4 2019, 5:22 PM

I am seeing db1118 serving dumps. This is a high-throughput main-traffic enwiki replica. I thought at first this was the cause of an outage, but it was unrelated. However, it seems quite worrying.

For what is worth - confirming that db1118 is not configured on vslow/dumps:

root@cumin1001:~#  dbctl instance db1118 get
{
    "db1118": {
        "host_ip": "10.64.16.12",
        "note": "",
        "port": 3306,
        "sections": {
            "s1": {
                "percentage": 100,
                "pooled": true,
                "weight": 500
            }
        }
    },
    "tags": "datacenter=eqiad"
}

Do either of you have any of the queries run? And which host(s) were they from? If this is new and different from the issue patched above, I should look into it.

Can this be causing spikes desribed in T229407: Spikes in DB traffic and rows/s reads when reading from new terms store as well?
Some graphs to look at:

image.png (668×1 px, 177 KB)

image.png (668×1 px, 117 KB)

image.png (710×1 px, 281 KB)

I couldn't attribute this to any other maintenance script or requests, I only know there's something that tries to render the HTML of every page in Wikidata incrementally. Querying ParserCache. says it's incrementally going up, yesterday it was around Q50Mio, today was around Q54. Can something in dumps causing this?

We don't try to render anything. Not for wikidata entity dumps nor the xml dumps. So I don't have any good ideas about that.

The patch was included in wmf.10, which is now on all wikis. Unfortunately the only dumps running now are wikidata page-meta-history, a different phase than the one where queries were seen. But I'll ask folks to keep an eye out around the 20th-23rd and let me know if anything shows up.

The new run has started, so this is the time to see if there are queries coming from snapshot1005,6,7,9 to non-vslow db servers.

I have quickly checked and so far I have seen the following on s2

I see connections (sleeping) from snapshoot1006 and 1007 to the following hosts:
db1105
db1076
db1074

I have not been able if they are doing anything other than just sleeping, so not sure if that is expected (just to make a connection without actually doing anything). They were using the wikiadmin user.

Example:

47609741	wikiadmin	10.64.48.141:51810	eowiki	Sleep	0		NULL	0.000
db1076.eqiad.wmnet:3306
82225721	wikiadmin	10.64.48.141:35172	bgwiki	Sleep	0		NULL	0.000
82241978	wikiadmin	10.64.0.9:35306	bgwiktionary	Sleep	0		NULL	0.000
db1074.eqiad.wmnet:3306
2365240797	wikiadmin	10.64.48.141:44984	cswiki	Sleep	0		NULL	0.000

The vslow for s2 has lots more connections:

110240481	wikiadmin	10.64.48.141:54252	cswiki	Sleep	44		NULL	0.000
110240482	wikiadmin	10.64.48.141:54254	cswiki	Sleep	12		NULL	0.000
110241630	wikiadmin	10.64.0.9:49548	bgwiktionary	Sleep	24		NULL	0.000
110241631	wikiadmin	10.64.0.9:49550	bgwiktionary	Sleep	24		NULL	0.000
110242068	wikiadmin	10.64.48.141:57470	eowiki	Sleep	12		NULL	0.000
110242070	wikiadmin	10.64.48.141:57472	eowiki	Sleep	12		NULL	0.000
110242225	wikiadmin	10.64.48.141:57934	bgwiki	Sleep	8		NULL	0.000
110242226	wikiadmin	10.64.48.141:57936	bgwiki	Sleep	7		NULL	0.000

Is that behaviour expected? I have not checked the rest of the section as if this is expected, there is no point on checking if it is happening on the rest of the servers.

Vslow should have all of them, so there may be some last place that short queries or just connections happen. I'll take another pass through the code.

Aklapper subscribed.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

LSobanski subscribed.

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.

Is this still an issue, after the changes to LBFactory and so on that remove depooled dbs from the list available for connections? @Ladsgroup you would know best I think.

I haven't seen an issue in long time but we could possibly having in less frequencies.

Slightly unrelated, Wikidata dumps (not the xml ones) don't refresh the config and that's a way bigger problem.

Ladsgroup claimed this task.

I call it done, let's reopen if it happens again. Hopefully these patches fixed the problem for good.