Page MenuHomePhabricator

Parametrize wmf-pt-kill so it can connect to different sockets
Closed, ResolvedPublic

Description

We run wmf-pt-kill on the labsdb hosts to kill long queries.
Right now the service runs as a daemon with the following options:

wmf-pt-+ 28800  0.0  0.0 145144 66488 ?        Ss   Aug16   0:17 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 3600 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /run/mysqld/mysqld.sock F=/dev/null

This was originally packaged at T203674.
The most recent package was built a few months ago at T248843 for Buster and the new pt-kill version.

The new clouddb hosts will run multi-instance, so using the default socket -S /run/mysqld/mysqld.sock won't work, as mysql will have different socket locations, like we do on normal multi-instance hosts:

root@db1099:/run/mysqld# ls
mysqld.s1.sock	mysqld.s8.sock

We should change the wmf-pt-kill puppet code to accept socket location maybe based on the hiera files, for instance this is a multi-instance hiera class:

cat hieradata/hosts/db1099.yaml
# db1099
# Buffer pool sizes/instance enabled
profile::mariadb::core::multiinstance::num_instances: 2
profile::mariadb::core::multiinstance::s1: '185G'
profile::mariadb::core::multiinstance::s8: '185G'

Maybe wmf-pt-kill can use those s1 and s8 options and attach itself to those sockets, as they are called mysqld.sX. As probably the new clouddb hosts will need this sort of files to run multiinstance.
If there is no hiera file and/or there is no multi-instance there, wmf-pt-kill should just assume the default socket (like we do know) as both systems will live in parallel for sometime (the old single-instance and the new multi-instance system).

Maybe @Kormat can take the lead of this at this and discuss with cloud-services-team, some different approaches?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

While some package changes could be needed, one thing to note is some work could be done with just puppet- When we created the prometheus-mysqld-exporter multiplexer for multi-instance db hosts, we just used the original package and created an additional 'prometheus-mysqld-exporter@' systemd unit and its configuration.

Checking all changes on that deploy may be useful for this work (even if there was some things that were corrected afterwards):

https://gerrit.wikimedia.org/r/c/operations/puppet/+/364396

@Bstorm keep in mind that while this is important to have it running (puppet keeps reporting failures as the daemon cannot be started) it shouldn't be a blocker for getting some users on the new hosts.
We can run the pt-kill on a screen (like we used to do in the past), it is not ideal, but shouldn't stop us from moving some beta-testers to these new hosts.

Jaime's approach might be easier to implement perhaps, but I would leave that up to @Bstorm and @Kormat

I'll give @jcrespo 's approach a go today. Looking at the package code it would likely be pretty simple to use puppet to disable the unit that the package installs and create the systemd units necessary to inject the right info.

Change 654890 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: fix up wmf-pt-kill service on multiinstance replicas

https://gerrit.wikimedia.org/r/654890

Change 654890 merged by Bstorm:
[operations/puppet@production] wikireplicas: fix up wmf-pt-kill service on multiinstance replicas

https://gerrit.wikimedia.org/r/654890

That patch was safe on the old servers (no change). On the multi-instance I see the error: Access denied for user 'wmf-pt-kill'@'localhost'. That sounds like it is pretty close to working.

Yep, the user is not created. Creating the grant using the info in modules/role/templates/mariadb/grants/wiki-replicas.sql since that appears to be what is in the existing replicas.

I believe all replicas pass puppet now (after creating that grant). @Marostegui if you can check that the software is doing what it should be doing now, I think this can be closed.

Marostegui assigned this task to Bstorm.

Thanks a lot Brooke for working on this.
Looks like it is working fine, pt-kill is up everywhere and with the correct values depending on the role:

8 hosts will be targeted:
clouddb[1013-1020].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) clouddb1017.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  2311  0.0  0.0  36296 18424 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s1.sock F=/dev/null
wmf-pt-+  2413  0.0  0.0  36396 18812 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s3.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1013.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+ 19884  0.0  0.0  36396 18428 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s1.sock F=/dev/null
wmf-pt-+ 22239  0.0  0.0  36304 18820 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s3.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1018.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+ 22389  0.0  0.0  36308 18572 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s2.sock F=/dev/null
wmf-pt-+ 22498  0.0  0.0  36400 18756 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s7.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1016.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  5474  0.0  0.0  36400 19020 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s8.sock F=/dev/null
wmf-pt-+  5550  0.0  0.0  36308 18584 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s5.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1019.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+ 17265  0.0  0.0  36308 18828 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s4.sock F=/dev/null
wmf-pt-+ 17352  0.0  0.0  36296 18656 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s6.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1014.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  4065  0.0  0.0  36304 18644 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s2.sock F=/dev/null
wmf-pt-+  4170  0.0  0.0  36408 18908 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s7.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1015.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  4822  0.0  0.0  36304 18908 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s4.sock F=/dev/null
wmf-pt-+  4910  0.0  0.0  36304 18784 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s6.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1020.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  8872  0.0  0.0  36300 18680 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s8.sock F=/dev/null
wmf-pt-+  8959  0.0  0.0  36304 18952 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s5.sock F=/dev/null

Tested with my own user on clouddb1015:3316 seems to be working fine, so I assume it is also working fine everywhere

u15343@clouddb1015.eqiad.wmnet[frwiki_p]> select sleep(600);
ERROR 2013 (HY000): Lost connection to MySQL server during query

Thanks a lot for getting this sorted!

Note: I also tested the old hosts and a stop/start on them for each service.