Parametrize wmf-pt-kill so it can connect to different sockets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Aug 17 2020, 6:26 AM

Description

We run wmf-pt-kill on the labsdb hosts to kill long queries.
Right now the service runs as a daemon with the following options:

wmf-pt-+ 28800  0.0  0.0 145144 66488 ?        Ss   Aug16   0:17 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 3600 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /run/mysqld/mysqld.sock F=/dev/null

This was originally packaged at T203674.
The most recent package was built a few months ago at T248843 for Buster and the new pt-kill version.

The new clouddb hosts will run multi-instance, so using the default socket -S /run/mysqld/mysqld.sock won't work, as mysql will have different socket locations, like we do on normal multi-instance hosts:

root@db1099:/run/mysqld# ls
mysqld.s1.sock	mysqld.s8.sock

We should change the wmf-pt-kill puppet code to accept socket location maybe based on the hiera files, for instance this is a multi-instance hiera class:

cat hieradata/hosts/db1099.yaml
# db1099
# Buffer pool sizes/instance enabled
profile::mariadb::core::multiinstance::num_instances: 2
profile::mariadb::core::multiinstance::s1: '185G'
profile::mariadb::core::multiinstance::s8: '185G'

Maybe wmf-pt-kill can use those s1 and s8 options and attach itself to those sockets, as they are called mysqld.sX. As probably the new clouddb hosts will need this sort of files to run multiinstance.
If there is no hiera file and/or there is no multi-instance there, wmf-pt-kill should just assume the default socket (like we do know) as both systems will live in parallel for sometime (the old single-instance and the new multi-instance system).

Maybe @Kormat can take the lead of this at this and discuss with cloud-services-team, some different approaches?

Details

	Subject	Repo	Branch	Lines +/-
	wikireplicas: fix up wmf-pt-kill service on multiinstance replicas	operations/puppet	production	+33 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Marostegui	T233766 labsdb1011 mariadb crashed
		Restricted Task
		Restricted Task
		Unknown Object (Task)
Resolved	RobH	T260441 (Need By: ASAP) rack/setup/install clouddb10[13-20]
Open	None	T215858 Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema
Resolved	fnegri	T280152 Mitigate breaking changes from the new Wiki Replicas architecture
Resolved	• Bstorm	T260389 Redesign and rebuild the wikireplicas service using a multi-instance architecture
Resolved	• Bstorm	T260511 Parametrize wmf-pt-kill so it can connect to different sockets
Resolved	• Bstorm	T274044 Fix systemd and possibly logrotate around the wmf-pt-kill service for multi-instance wikireplicas

Event Timeline

• Marostegui created this task.Aug 17 2020, 6:26 AM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptAug 17 2020, 6:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Marostegui triaged this task as Medium priority.Aug 17 2020, 6:26 AM

• Marostegui added a parent task: T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema.

• Marostegui removed a parent task: T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema.

• Marostegui added a project: DBA.

• Marostegui moved this task from Triage to Backlog on the DBA board.

• Kormat added a project: User-Kormat.Aug 18 2020, 10:43 AM

• Kormat moved this task from Unsorted 💣 to Back Burner 🏛️ on the User-Kormat board.Aug 24 2020, 9:09 AM

• Marostegui moved this task from Backlog to Ready on the DBA board.Oct 30 2020, 6:50 AM

• Marostegui added a project: Data-Services.Nov 17 2020, 3:40 PM

While some package changes could be needed, one thing to note is some work could be done with just puppet- When we created the prometheus-mysqld-exporter multiplexer for multi-instance db hosts, we just used the original package and created an additional 'prometheus-mysqld-exporter@' systemd unit and its configuration.

Checking all changes on that deploy may be useful for this work (even if there was some things that were corrected afterwards):

https://gerrit.wikimedia.org/r/c/operations/puppet/+/364396

• Marostegui mentioned this in T268312: Deploy labsdbuser and views to new clouddb hosts.Dec 1 2020, 5:51 AM

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Dec 5 2020, 12:11 AM

@Bstorm keep in mind that while this is important to have it running (puppet keeps reporting failures as the daemon cannot be started) it shouldn't be a blocker for getting some users on the new hosts.
We can run the pt-kill on a screen (like we used to do in the past), it is not ideal, but shouldn't stop us from moving some beta-testers to these new hosts.

Jaime's approach might be easier to implement perhaps, but I would leave that up to @Bstorm and @Kormat

• Marostegui added a parent task: T260389: Redesign and rebuild the wikireplicas service using a multi-instance architecture.Dec 16 2020, 9:29 AM

I'll give @jcrespo 's approach a go today. Looking at the package code it would likely be pretty simple to use puppet to disable the unit that the package installs and create the systemd units necessary to inject the right info.

LSobanski subscribed.Jan 7 2021, 3:41 PM

Change 654890 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: fix up wmf-pt-kill service on multiinstance replicas

https://gerrit.wikimedia.org/r/654890

gerritbot added a project: Patch-For-Review.Jan 7 2021, 3:54 PM

Change 654890 merged by Bstorm:
[operations/puppet@production] wikireplicas: fix up wmf-pt-kill service on multiinstance replicas

https://gerrit.wikimedia.org/r/654890

That patch was safe on the old servers (no change). On the multi-instance I see the error: Access denied for user 'wmf-pt-kill'@'localhost'. That sounds like it is pretty close to working.

RhinosF1 subscribed.Jan 11 2021, 10:24 PM

Yep, the user is not created. Creating the grant using the info in modules/role/templates/mariadb/grants/wiki-replicas.sql since that appears to be what is in the existing replicas.

I believe all replicas pass puppet now (after creating that grant). @Marostegui if you can check that the software is doing what it should be doing now, I think this can be closed.

Maintenance_bot removed a project: Patch-For-Review.Jan 11 2021, 11:10 PM

Thanks a lot Brooke for working on this.
Looks like it is working fine, pt-kill is up everywhere and with the correct values depending on the role:

8 hosts will be targeted:
clouddb[1013-1020].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) clouddb1017.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  2311  0.0  0.0  36296 18424 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s1.sock F=/dev/null
wmf-pt-+  2413  0.0  0.0  36396 18812 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s3.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1013.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+ 19884  0.0  0.0  36396 18428 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s1.sock F=/dev/null
wmf-pt-+ 22239  0.0  0.0  36304 18820 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s3.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1018.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+ 22389  0.0  0.0  36308 18572 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s2.sock F=/dev/null
wmf-pt-+ 22498  0.0  0.0  36400 18756 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s7.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1016.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  5474  0.0  0.0  36400 19020 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s8.sock F=/dev/null
wmf-pt-+  5550  0.0  0.0  36308 18584 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s5.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1019.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+ 17265  0.0  0.0  36308 18828 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s4.sock F=/dev/null
wmf-pt-+ 17352  0.0  0.0  36296 18656 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s6.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1014.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  4065  0.0  0.0  36304 18644 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s2.sock F=/dev/null
wmf-pt-+  4170  0.0  0.0  36408 18908 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s7.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1015.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  4822  0.0  0.0  36304 18908 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s4.sock F=/dev/null
wmf-pt-+  4910  0.0  0.0  36304 18784 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s6.sock F=/dev/null
===== NODE GROUP =====
(1) clouddb1020.eqiad.wmnet
----- OUTPUT of 'ps aux | grep "w...l | grep -v grep' -----
wmf-pt-+  8872  0.0  0.0  36300 18680 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s8.sock F=/dev/null
wmf-pt-+  8959  0.0  0.0  36304 18952 ?        Ss   Jan11   0:02 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 10800 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /var/run/mysqld/mysqld.s5.sock F=/dev/null

Tested with my own user on clouddb1015:3316 seems to be working fine, so I assume it is also working fine everywhere

u15343@clouddb1015.eqiad.wmnet[frwiki_p]> select sleep(600);
ERROR 2013 (HY000): Lost connection to MySQL server during query

Thanks a lot for getting this sorted!

Note: I also tested the old hosts and a stop/start on them for each service.

• jcrespo awarded a token.Jan 12 2021, 9:11 AM

• Bstorm added a subtask: T274044: Fix systemd and possibly logrotate around the wmf-pt-kill service for multi-instance wikireplicas.Feb 7 2021, 12:20 AM

• Bstorm closed subtask T274044: Fix systemd and possibly logrotate around the wmf-pt-kill service for multi-instance wikireplicas as Resolved.Aug 20 2021, 4:18 PM

Parametrize wmf-pt-kill so it can connect to different socketsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Parametrize wmf-pt-kill so it can connect to different sockets
Closed, ResolvedPublic
Actions

Related Objects
Search...