Page MenuHomePhabricator

Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running
Closed, ResolvedPublic

Description

Jaime pointed out in T181925#3880506 that we shouldn't be running DDLs when mysql-upgrade is in progress.

Please don't run actions involving DDLs while mysql_upgrade is in process. How to know if mysql_upgrade is running?- it will be run immediately after a reboot in most cases for one or a few hours:

$ date; mysql -e "SHOW STATUS like 'uptime'"
Sat Jan  6 14:54:44 UTC 2018
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Uptime        | 79428 |
+---------------+-------+

The maintain-views and maintain-meta_p scripts do DDL operations and are run whenever we create or delete a new wiki and need to add/remove them from the wiki-replicas. In order to avoid running them when a mysql-upgrade is in progress, it will be useful to have some programmatic way of detecting that an upgrade is in progress and have the scripts fail if that's the case.

@jcrespo Could you help us figure out how we can better detect this so we can script it in? Thank you!

Event Timeline

Is mysql-upgrade going to ensure it doesn't run while anything else is doing DDL?

Is mysql-upgrade going to ensure it doesn't run while anything else is doing DDL

Yes, because we are the only people, other than maintain* scripts, that perform DDLs on production, and either we make sure those don't happen, or stop replication to avoid replicating those. There are no longer user databases on wikireplicas, so other users do not have permissions to run other than SELECTs.

Could you help us figure out how we can better detect this so we can script it in? Thank you!

I don't think there is an easy way to detect it without server access (those scripts I think are supposed to be run remotely). mysql_upgrade does very quick ANALYZE and other stuff that it wouldn't be easily detectable just with SHOW PROCESSLIST. I would advice 2 things:

  • Check uptime, if it is lower than, let's say, 18000 seconds, (5 hours after boot), abort and say "server just rebooted, please wait some hours/coordinate with DBAs before performing DDLs on the server". 5 hours should be more than enough to run those- it should normally take 1-2 hours only.
  • Reduce the innodb lock/lock timeout (I gave that advice some time ago) to detect metadata locks and catch those exceptions with a more reasonable explanation: "execution failed due to locking, please check ongoing operations on the server". A manual examination of operations could indicate a mysql_upgrade or another DDL due to a schema change ongoing. Check processlist and look for ALTER commands ongoing
  • Note that id you do a DDL and there is a long running select from a user, it will still get locked. We only perform schema changes/maintenance on depooled hosts because of that- we just had a conflict on the depooled host.

Could you help us figure out how we can better detect this so we can script it in? Thank you!

I don't think there is an easy way to detect it without server access (those scripts I think are supposed to be run remotely).

Both scripts are run directly on the hosts they are modifying the dbs for (i.e. labsdb1009, labsdb1010, labsdb1011 today).

mysql_upgrade does very quick ANALYZE and other stuff that it wouldn't be easily detectable just with SHOW PROCESSLIST. I would advice 2 things:

  • Check uptime, if it is lower than, let's say, 18000 seconds, (5 hours after boot), abort and say "server just rebooted, please wait some hours/coordinate with DBAs before performing DDLs on the server". 5 hours should be more than enough to run those- it should normally take 1-2 hours only.

This should be possible to script.

  • Reduce the innodb lock/lock timeout (I gave that advice some time ago) to detect metadata locks and catch those exceptions with a more reasonable explanation: "execution failed due to locking, please check ongoing operations on the server".

This should also be possible to script.

A manual examination of operations could indicate a mysql_upgrade or another DDL due to a schema change ongoing. Check processlist and look for ALTER commands ongoing

I'm not certain if the db user they run as has the rights needed to check the process list for ALTER commands from other users or not. When I run them the script is the only real access I have to the database state since I only have partial sudo and not true root on the hosts.

Adding in any "read the proc list and decide if it is ok" step is pretty fragile. It would be much nicer to have a lock of some type that can be attempted to check the state more definitively. As a random locally implemented example, scap creates a lock file containing the user and reason that scap is running and will abort if it cannot acquire that lock. This also lets us preemptively block scap deploys when certain things are going on like cluster failovers.

The thing is, we already log on SAL whenever we reboot the server- the real coordination is to check for SAL entries and make sure we are not working on the same server at the same time.

The thing is, we already log on SAL whenever we reboot the server- the real coordination is to check for SAL entries and make sure we are not working on the same server at the same time.

We were thinking we would change our model from running view updates adhoc to a single day of the week to make the workload and outcomes more predictable. That could work out here. We could settle on view/meta_p updates on wednesday's and a human process to check and see if a labsdb is out of service for maintenance seems achievable, the only hitch I can see is: could you guys commit to running maintain-views and maintain-meta_p once before putting a labsdb back in service? Or alternatively, you could ping us to do that before repooling, but I think if we can ensure state prior to pooling then it all works out.

I can block all user connections while I run mysql_upgrade- it is a silly thing to do that we do not do anywhere, but it can be done. That will not solve the conflicts you will have with other people's queries, though.

Lately the way @Bstorm and myself have been working with this is basically depooling the hosts.

She'd send the patch
I'd merge it and get the server depooled
She'd revert the patch when she is done
I'd merge it and get the server back in production

There is probably not much stuff we can do about this rather than just coordinating and have good communication when doing maintenance over these servers.
Any objection if I close this ticket?

There is probably not much stuff we can do about this rather than just coordinating and have good communication when doing maintenance over these servers.

That was my understanding, too: T184540#3892358

238482n375 added a project: acl*security.
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
238482n375 subscribed.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Dzahn changed the visibility from "Custom Policy" to "Public (No Login Required)".