Page MenuHomePhabricator

Move labs pdns database off of m5-master
Closed, ResolvedPublic

Description

We have two pdns hosts for labs: labservices1001 (labs-ns2) and holmium (labs-ns3). Both use a common database on m5-master. Now that each is writing to the database, we may have race conditions. Better would be to have each server use a database running on the same server as the pdns service.

Roll-out steps:

Proposed steps:

  • Get mysql running on holmium and labservices1001
  • Create replicated pdns dbs on holmium and labservices, each replicating from the pdns db on m5-master
  • disable puppet on labnet1002 and turn off the nova API
  • disable puppet on labservices1001 and stop all designate services
  • Now we should only be reading from pdns dbs, not writing to them.
  • merge https://gerrit.wikimedia.org/r/#/c/286670/1 and apply on Holmium
  • dig tests @ labs-ns1.wikimedia.org and labs-recursor1.wikimedia.org
  • enable puppet on labservices1001, apply, immediately kill designate services again and disable puppet
  • dig tests @ labs-ns0.wikimedia.org and labs-recursor0.wikimedia.org
  • Powerdns db-reading and resolving work! Next, stop, Designate and writing
  • disable db replication, make pdns dbs read/write on both holmium and labservices1001
  • enable and run puppet on labservices1001 (thus turning designate on)
  • create a new record via Horizon, dig test @ labs-ns0.wikimedia.org and labs-recursor0.wikimedia.org and labs-ns1.wikimedia.org and labs-recursor1.wikimedia.org
  • create a new domain (via designate commandline), create a new record on that domain, dig test
  • enable nova API, create new instance and dig test
  • Declare victory
  • (much later) drop the pdns database on m5-master

Details

Related Gerrit Patches:
operations/puppet : productionRemoving references to pnds database on s5
operations/puppet : productionSplit centralized labs pdns database into two different local DBs.
operations/puppet : productionFix grants for labs pdns
operations/puppet : productionAdd grants to pdns mysql for localhost
operations/puppet : productionCorrecting typo on character_set_filesystem
operations/puppet : productionCorrect wrong collation from utf8 to unicode CI
operations/puppet : productionAdd mysql to labs dns servers

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Andrew added a comment.EditedMar 3 2016, 3:18 PM

(roll-out steps moved to description)

Andrew added a project: DBA.Mar 3 2016, 3:19 PM
Niharika removed a subscriber: Niharika.Mar 3 2016, 4:26 PM
jcrespo moved this task from Triage to Backlog on the DBA board.Mar 4 2016, 3:43 PM
Andrew added a comment.Mar 7 2016, 2:10 PM

I tidied up both servers a bit -- there should be plenty of space for db servers now.

how much memory can you dedicate to mysql? The on disk size is 130MB, so it should be lower than that.

Andrew added a comment.Mar 7 2016, 2:18 PM

The memory metrics for labservices1001 were very noisy until last week (I suspect because of the ridiculously verbose logs that were getting written), but free memory never dropped below 1G. Right now there's lots of memory, 7G free, which I am hoping is the new normal.

Andrew added a comment.Mar 7 2016, 2:18 PM

(Oh, and the picture is much rosier on Holmium since it isn't doing much.)

Andrew reassigned this task from Andrew to jcrespo.Mar 7 2016, 6:26 PM

So, 100MB ok for the buffer pool?

jcrespo moved this task from Backlog to In progress on the DBA board.Mar 17 2016, 12:15 PM
jcrespo moved this task from In progress to Next on the DBA board.Mar 23 2016, 2:13 PM
jcrespo moved this task from Next to In progress on the DBA board.Apr 1 2016, 9:39 AM

Change 280863 had a related patch set uploaded (by Jcrespo):
[WIP]Add mysql to labs dns servers

https://gerrit.wikimedia.org/r/280863

Please check that this is intended (should mysql be installed by labs::dns, or is it independent?)

https://puppet-compiler.wmflabs.org/2270/

Basically, please review https://gerrit.wikimedia.org/r/#/c/280863/ (it is WIP because I have to figure out how to handle grants - or what those are). But please review the pure labs parts, I can be 100% responsible for the mysql installation and configuration.

Andrew added a comment.Apr 1 2016, 1:53 PM

The patch and change looks fine to me.

Jaime, will you have time to work on this this week? I'm still hopeful that this will stamp out some of our existing DNS issues.

Change 280863 merged by Jcrespo:
Add mysql to labs dns servers

https://gerrit.wikimedia.org/r/280863

Change 285608 had a related patch set uploaded (by Jcrespo):
Correct wrong collation from utf8 to unicode CI

https://gerrit.wikimedia.org/r/285608

Change 285608 merged by Jcrespo:
Correct wrong collation from utf8 to unicode CI

https://gerrit.wikimedia.org/r/285608

Change 285609 had a related patch set uploaded (by Jcrespo):
Correcting typo on character_set_filesystem

https://gerrit.wikimedia.org/r/285609

Change 285609 merged by Jcrespo:
Correcting typo on character_set_filesystem

https://gerrit.wikimedia.org/r/285609

jcrespo added a comment.EditedApr 27 2016, 9:17 AM

@Andrew: @ holmium:

ps aux | grep mysql
mysql     1259  0.0  0.3 492676 50004 ?        Ssl  Mar17  33:59 /usr/sbin/mysqld

What do we do with the existing (unpuppetized?) already running mysql instance?

Server version: 5.5.47-0ubuntu0.14.04.1 (Ubuntu)

Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:root@localhost [(none)]> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| pdns               |
| performance_schema |
+--------------------+
4 rows in set (0.02 sec)

Maybe this was the original database and it was never deleted?

Holmium and the others will page on stop, so make sure to downtime if you stop a service.

What do we do with the existing (unpuppetized?) already running mysql instance?

I'm pretty sure that that's unused leftovers from a previous implementation. Can you verify that it's not getting any traffic?

Actually, I see 4 connections from localhost (although using the public interface) from /usr/sbin/pdns_server-instance --daemon --guardian=yes

I've been looking for a while and can't figure out why pdns would be hitting localhost mysql. There is a localhost mysql config in /etc/powerdns/pdns.d but I don't know why those configs would be getting loaded.

In any case... the configs are the same between labservices1001 and holmium, and labservices1001 didn't have that local mysql running. So I stopped the service on holmium, and can't detect any ill effects. So, let's wipe it out.

Change 285907 had a related patch set uploaded (by Jcrespo):
Add grants to pdns mysql for localhost

https://gerrit.wikimedia.org/r/285907

Change 285907 merged by Jcrespo:
Add grants to pdns mysql for localhost

https://gerrit.wikimedia.org/r/285907

FYI, labtestservices2001:

Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns: Traceback (most recent call last):
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/local/bin/labs-ip-alias-dump.py", line 38, in <module>
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     for server in client.servers.list():
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/v1_1/servers.py", line 587, in list
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     return self._list("/servers%s%s" % (detail, query_string), "servers")
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 64, in _list
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     _resp, body = self.api.client.get(url)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 283, in get
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     return self._cs_request(url, 'GET', **kwargs)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 249, in _cs_request
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     self.authenticate()
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 382, in authenticate
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     auth_url = self._v2_auth(auth_url)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 469, in _v2_auth
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     return self._authenticate(url, body)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 482, in _authenticate
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     **kwargs)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 242, in _time_request
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     resp, body = self.request(url, method, **kwargs)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 236, in request
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns:     raise exceptions.from_response(resp, body, url, method)
Notice: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns: novaclient.exceptions.Unauthorized: The request you have made requires authentication. (HTTP 401)
Error: /usr/local/bin/labs-ip-alias-dump.py returned 1 instead of one of [0]
Error: /Stage[main]/Dnsrecursor::Labsaliaser/Exec[/usr/local/bin/labs-ip-alias-dump.py]/returns: change from notrun to 0 failed: /usr/local/bin/labs-ip-alias-dump.py returned 1 instead of one of [0]
Notice: /Stage[main]/Dnsrecursor/Service[pdns-recursor]: Dependency Exec[/usr/local/bin/labs-ip-alias-dump.py] has failures: true
Warning: /Stage[main]/Dnsrecursor/Service[pdns-recursor]: Skipping because of failed dependencies

Doesn't affect me.

Mentioned in SAL [2016-04-28T09:23:57Z] <jynus> removing unused mysql-server-5.5 from holmium (keeping database just in case) T128737

We have now running 3 new mysql instances replication pdsn from db1009: https://tendril.wikimedia.org/host/view/holmium.wikimedia.org/3306

Waiting for you for switchover and db1009 permissions and data cleanup.

Thanks for pointing that error out @jcrespo, I have left a comment on https://gerrit.wikimedia.org/r/#/c/280768/

Is it possible apparmor is killing mysql on holmium?

Apr 29 00:20:09 holmium kernel: [ 8382.403156] type=1400 audit(1461889209.191:68): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/mysqld" pid=682 comm="apparmor_parser"
Apr 29 00:20:09 holmium kernel: [ 8382.409694] type=1400 audit(1461889209.195:69): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/etc/my.cnf" pid=684 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Apr 29 00:20:09 holmium kernel: [ 8382.409724] type=1400 audit(1461889209.195:70): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/etc/my.cnf" pid=684 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Apr 29 00:20:09 holmium kernel: [ 8382.419439] type=1400 audit(1461889209.207:71): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/etc/my.cnf" pid=694 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Apr 29 00:20:09 holmium kernel: [ 8382.419472] type=1400 audit(1461889209.207:72): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/etc/my.cnf" pid=694 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Apr 29 00:20:09 holmium kernel: [ 8382.424724] init: mysql main process (694) terminated with status 1
Apr 29 00:20:09 holmium kernel: [ 8382.424744] init: mysql main process ended, respawning
Apr 29 00:20:10 holmium kernel: [ 8383.438986] init: mysql post-start process (695) terminated with status 1
Apr 29 00:20:10 holmium kernel: [ 8383.473266] init: mysql main process (730) terminated with status 1
Apr 29 00:20:10 holmium kernel: [ 8383.473287] init: mysql respawning too fast, stopped

alerted this evening and I'm silencing it since it seems to not yet be in service here

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=holmium&service=mysqld+processes

@chasemp: Yes, that is caused by apparmor: holmium used to have the mysql-5.5 package from Ubuntu installed and it was removed with "dpkg --rc", which left /etc/apparmor.d/usr.sbin.mysqld around, which the WMF-packaged mysql uses an empty /etc/apparmor.d/usr.sbin.mysqld (that package is also used on jessie which has incomplete apparmor support). I'll purge the lingering mysql packages after doublechecking with Jaime later.

Andrew added a comment.May 3 2016, 2:02 PM

I propose that we do the final switchover for this from 14:00 to 15:00 UTC on Thursday, 2015-05-05.

Change 286670 had a related patch set uploaded (by Andrew Bogott):
Split centralized labs pdns database into two different local DBs.

https://gerrit.wikimedia.org/r/286670

Change 286671 had a related patch set uploaded (by Jcrespo):
[WIP] Fix grants for labs pdns

https://gerrit.wikimedia.org/r/286671

Andrew updated the task description. (Show Details)May 3 2016, 5:12 PM

Change 286671 merged by Jcrespo:
Fix grants for labs pdns

https://gerrit.wikimedia.org/r/286671

Mentioned in SAL [2016-05-05T13:55:45Z] <andrewbogott> downtimed labservices1001, holmium, labcontrol1001 for one hour, disabled puppet as per https://phabricator.wikimedia.org/T128737

Change 286670 merged by Andrew Bogott:
Split centralized labs pdns database into two different local DBs.

https://gerrit.wikimedia.org/r/286670

Andrew updated the task description. (Show Details)May 5 2016, 2:00 PM
Andrew updated the task description. (Show Details)May 5 2016, 2:10 PM
Andrew added a comment.May 5 2016, 3:10 PM

This required the emergency application of https://gerrit.wikimedia.org/r/#/c/287093/ (axfr on udp, who knew?) but otherwise went as expected.

Andrew updated the task description. (Show Details)May 5 2016, 4:18 PM
Andrew added a comment.May 5 2016, 8:42 PM

Remaining tasks:

  • wait a while
  • verify that there's no longer any traffic to the 'pdns' database on m5-master
  • clean up

At 6:30 today:

MariaDB  m5-master (none) > select * FROM information_schema.table_statistics WHERE table_schema='pdns';
+--------------+-----------------+------------+--------------+------------------------+
| TABLE_SCHEMA | TABLE_NAME      | ROWS_READ  | ROWS_CHANGED | ROWS_CHANGED_X_INDEXES |
+--------------+-----------------+------------+--------------+------------------------+
| pdns         | migrate_version |         72 |            6 |                      6 |
| pdns         | records         | 2757731037 |    981672758 |             3926708795 |
| pdns         | domains         |  137672435 |       584324 |                1168648 |
| pdns         | supermasters    |         26 |           10 |                     10 |
| pdns         | records_backup  |          0 |         1951 |                   9755 |
+--------------+-----------------+------------+--------------+------------------------+

MariaDB  m5-master (none) > select * FROM information_schema.user_statistics WHERE user='pdns';
+------+-------------------+------------------------+----------------+-------------------+-------------------+----------------+---------------+----------------------+------------+-----------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
| USER | TOTAL_CONNECTIONS | CONCURRENT_CONNECTIONS | CONNECTED_TIME | BUSY_TIME         | CPU_TIME          | BYTES_RECEIVED | BYTES_SENT    | BINLOG_BYTES_WRITTEN | ROWS_READ  | ROWS_SENT | ROWS_DELETED | ROWS_INSERTED | ROWS_UPDATED | SELECT_COMMANDS | UPDATE_COMMANDS | OTHER_COMMANDS | COMMIT_TRANSACTIONS | ROLLBACK_TRANSACTIONS | DENIED_CONNECTIONS | LOST_CONNECTIONS | ACCESS_DENIED | EMPTY_QUERIES |
+------+-------------------+------------------------+----------------+-------------------+-------------------+----------------+---------------+----------------------+------------+-----------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
| pdns |            741751 |                      0 |      203313477 | 957654.0238427534 | 925095.7790646543 |   524870285987 | 1308882186177 |         111242720999 | 2895200982 | 774396694 |    490829659 |     490832599 |      7518986 |      3561408985 |       491808711 |         807996 |          4545742401 |                 27827 |            4887762 |             1196 |       4887759 |    2923588959 |
+------+-------------------+------------------------+----------------+-------------------+-------------------+----------------+---------------+----------------------+------------+-----------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
1 row in set (0.00 sec)
Andrew added a comment.May 6 2016, 1:52 PM

@jcrespo -- I want to make sure I understand your last comment. Those stats are cumulative since the db was created, right? So you're establishing a baseline and then will compare those stats at a later date to make sure there's been no additional activity? Or do they indicate that there's lots of activity /right now/?

Those stats are cumulative since the db was created, right? So you're establishing a baseline and then will compare those stats at a later date to make sure there's been no additional activity?

Correct, we will do another measurement when you think it is appropriate, now, for example?

Andrew added a comment.May 6 2016, 3:19 PM

Now is good -- or tomorrow :)

Today at 8:30:

MariaDB  m5-master (none) > select * FROM information_schema.table_statistics WHERE table_schema='pdns';
+--------------+-----------------+------------+--------------+------------------------+
| TABLE_SCHEMA | TABLE_NAME      | ROWS_READ  | ROWS_CHANGED | ROWS_CHANGED_X_INDEXES |
+--------------+-----------------+------------+--------------+------------------------+
| pdns         | migrate_version |         72 |            6 |                      6 |
| pdns         | records         | 2757731037 |    981672758 |             3926708795 |
| pdns         | domains         |  137672435 |       584324 |                1168648 |
| pdns         | supermasters    |         26 |           10 |                     10 |
| pdns         | records_backup  |          0 |         1951 |                   9755 |
+--------------+-----------------+------------+--------------+------------------------+
5 rows in set (0.00 sec)

MariaDB  m5-master (none) > select * FROM information_schema.user_statistics WHERE user='pdns';
+------+-------------------+------------------------+----------------+-------------------+-------------------+----------------+---------------+----------------------+------------+-----------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
| USER | TOTAL_CONNECTIONS | CONCURRENT_CONNECTIONS | CONNECTED_TIME | BUSY_TIME         | CPU_TIME          | BYTES_RECEIVED | BYTES_SENT    | BINLOG_BYTES_WRITTEN | ROWS_READ  | ROWS_SENT | ROWS_DELETED | ROWS_INSERTED | ROWS_UPDATED | SELECT_COMMANDS | UPDATE_COMMANDS | OTHER_COMMANDS | COMMIT_TRANSACTIONS | ROLLBACK_TRANSACTIONS | DENIED_CONNECTIONS | LOST_CONNECTIONS | ACCESS_DENIED | EMPTY_QUERIES |
+------+-------------------+------------------------+----------------+-------------------+-------------------+----------------+---------------+----------------------+------------+-----------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
| pdns |            741751 |                      0 |      203313477 | 957654.0238427534 | 925095.7790646543 |   524870285987 | 1308882186177 |         111242720999 | 2895200982 | 774396694 |    490829659 |     490832599 |      7518986 |      3561408985 |       491808711 |         807996 |          4545742401 |                 27827 |            4887762 |             1196 |       4887759 |    2923588959 |
+------+-------------------+------------------------+----------------+-------------------+-------------------+----------------+---------------+----------------------+------------+-----------+--------------+---------------+--------------+-----------------+-----------------+----------------+---------------------+-----------------------+--------------------+------------------+---------------+---------------+
1 row in set (0.00 sec)

No connections or user reads and writes where done in the past 3 days (I also tested that the counters are really counting by reading rows afterwards).

labtestservices2001 is still replicating from m5-master. Next steps?

Andrew added a comment.EditedMay 9 2016, 5:09 PM

labtestservices uses a totally different openstack install (with different domains and hosts and so on) so it never really made sense to replicate from m5-. So, I would say...

  1. stop replication
  2. drop all records on labtestservices but leave tables as they are
  3. drop the pdns database on m5-master
  4. I'll take it from there :)

Change 287922 had a related patch set uploaded (by Jcrespo):
[WIP] Removing references to pnds database on s5

https://gerrit.wikimedia.org/r/287922

@Andrew there are some references to m5-master + pdns on puppet, please have a look at them up here.

Change 287922 merged by Jcrespo:
Removing references to pnds database on s5

https://gerrit.wikimedia.org/r/287922

jcrespo closed this task as Resolved.May 10 2016, 2:59 PM

Pdns database has been dropped and all account with grants to it