Page MenuHomePhabricator

Some mw snapshot hosts are accessing main db servers
Open, NormalPublic

Description

From: T143862

For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.

The idea is that if the current dump db host is not enough to dump values, we can set up more (I said many times that you should ask for resources and I happy to providing them, in fact, dump hosts are scheduled for renewal with some of the most expensive servers we have bought), but nothing that is an interactive query should go to the regular servers and it risks being blocked at any time by the firewall configuration/grant system because they are not supposed to have connections. I do not have any issue reserving new servers for dumps, or even several of them if it was necessary, but it has to be documented and configured on mediawiki.

Also, as a secondary thing, regarding resources, I was asked if new dump servers could be on codfw, as eqiad is more costly in rack space.

So please 1) identify those cases 2) ask for more resources if needed 3) do not hide such configuration.

Event Timeline

jcrespo created this task.Aug 25 2016, 8:59 AM

I camped on one of the s3 slave dbs and watched for any queries from snapshot1005/6 but nada. Can you grab an example for me? Or give me a cheater's way to grab an example. I'd very much like to find that bug and squash it.

I use getSlaveServer --group=dump for anything that does not go through MediaWiki, and the dumps themselves (dumpBackup and its ilk) should al be using the dump group consistently. So any hint as to what to look at would be welcome.

ArielGlenn triaged this task as Normal priority.

Could it be a reverse dns-caching bug? I will check the statistics to give you suspected IPs.

I've claimed this bug for now for anything that is not related to wikidata dumps out of cron. That means any queries not coming from snapshot1007. When I've got that tracked down, if there are still connections from snapshot1007, I can pass it off to other folks.

These are the stats since last restart for db1077 (s3), I can check other host, too:

$ mysql -h db1077 sys -e "SELECT * FROM host_summary_by_statement_type \
WHERE host IN ('$(dig +short snapshot1005.eqiad.wmnet)', '$(dig +short snapshot1006.eqiad.wmnet)', '$(dig +short snapshot1007.eqiad.wmnet)')"
+--------------+-------------------+-----------+---------------+-------------+--------------+-----------+---------------+---------------+------------+
| host         | statement         | total     | total_latency | max_latency | lock_latency | rows_sent | rows_examined | rows_affected | full_scans |
+--------------+-------------------+-----------+---------------+-------------+--------------+-----------+---------------+---------------+------------+
| 10.64.0.9    | select            |  22155259 | 1.00 h        | 149.29 ms   | 14.88 m      |  20396125 |             0 |             0 |       2193 |
| 10.64.0.9    | set_option        |     10024 | 582.47 ms     | 1.52 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.0.9    | Init DB           |     10024 | 287.63 ms     | 706.32 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.0.9    | show_slave_status |      3050 | 229.71 ms     | 829.48 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.0.9    | Quit              |     10017 | 59.92 ms      | 873.21 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | select            | 163507417 | 7.91 h        | 462.07 ms   | 2.06 h       | 137461641 |             0 |             0 |      13345 |
| 10.64.32.149 | set_option        |     36580 | 2.07 s        | 2.02 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | Init DB           |     36580 | 1.10 s        | 867.45 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | show_slave_status |      4890 | 345.70 ms     | 1.06 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.32.149 | Quit              |     36482 | 231.84 ms     | 493.04 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | select            | 167326936 | 8.15 h        | 603.08 ms   | 2.09 h       | 133783142 |             0 |             0 |      18903 |
| 10.64.48.141 | set_option        |     59589 | 3.31 s        | 1.26 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | Init DB           |     59590 | 1.76 s        | 2.09 ms     | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | Quit              |     59529 | 376.37 ms     | 879.36 us   | 0 ps         |         0 |             0 |             0 |          0 |
| 10.64.48.141 | show_slave_status |      4081 | 284.61 ms     | 1.01 ms     | 0 ps         |         0 |             0 |             0 |          0 |
+--------------+-------------------+-----------+---------------+-------------+--------------+-----------+---------------+---------------+------------+

Note that show_slave_status or small amount of connection may be normal for load balancer reasons, let's just make sure there are not real queries going on.

ArielGlenn added a comment.EditedAug 25 2016, 12:48 PM

db1077 was the host I was watching. so that's perfect.

I really want to know what those selects were. Can we get a full process entry on one of those from snapshot1005 or 1006?

I think so, it is just it is not very immediate because 149.29 ms queries are not caught by the current monitoring, and the new one is still WIP (its backend active, but none of the frontend/aggregation), I will update here once I know more.

jcrespo moved this task from Triage to Backlog on the DBA board.Aug 31 2016, 12:45 PM

dumpsgen 4973 /usr/bin/php5 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=zhwiki is accessing a main traffic (no dump) server at snapshot1007, and keeps doing that after the server has been depooled for over 1 hour. FYI, the server has to be depooled for maintenance, and no regard will be made for ongoing backups as this bug was filed as a blocker more than 1 year ago- if it breaks backups, I think this was warned quite some time in advance.

As far as I knew, this was a wikidata-only issue, though I have periodically checked this task to see if there is any more information.

The zhwiki task, so that we have the information here, was an abstracts dump; these are broken into smallish pieces and each one done separately. When I went to look later for that process it was already gone, and the dump was continuing, presumably using the proper host, since it was a new process starting out with a new read of db config files and a new db connection. We do this both so that no one query takes a ridiculous amount of time, and so that if a db server needs to go away for maintenance, it's not a problem.

If the queries are still taking too long, I can tune the number of revisions requested down some.

I don't really want to close the db connection, reread the config, and open a new connection for each query; that would be horribly inefficient. If the ability to reread the config is available as mentioned here T138208#4156583 then we could do that on error. In any case, you should not be at all concerned about pulling out boxes for maintenance.

The only case where we might have jobs that take a long time and are hard to split up is mysqldump of some large table; if a db server on the back end of that is pulled out for maintenance, the job will run again, but I don't really see a way to split that sort of job into smaller pieces nicely.

Krinkle renamed this task from Some dump hosts are accessing main traffic servers to Some mw snapshot hosts are accessing main db servers.Jun 4 2018, 10:31 PM