Page MenuHomePhabricator

labsdb1011 missing tools user accounts
Closed, ResolvedPublic

Description

While debugging account access to the wiki replicas, I found that the mysql accounts on labsdb1011 are not in sync with the expected metadata state. This could be related to the recent reclone T235016.

Is there a process we can use to resync the user accounts on labsdb1011? I've found a few accounts missing but haven't done a full audit yet.

Here's some information on an example with my account:

Metadata user status
$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='u21436'\G"
Enter password:
*************************** 1. row ***************************
            id: 19201
mysql_username: u21436
          type: user
      username: jeh
Metadata account status
$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=19201\G"
Enter password:
*************************** 1. row ***************************
        id: 134314
account_id: 19201
  hostname: 172.16.7.153
    status: present
*************************** 2. row ***************************
        id: 134313
account_id: 19201
  hostname: labsdb1009.eqiad.wmnet
    status: present
*************************** 3. row ***************************
        id: 134316
account_id: 19201
  hostname: labsdb1010.eqiad.wmnet
    status: present
*************************** 4. row ***************************
        id: 134315
account_id: 19201
  hostname: labsdb1011.eqiad.wmnet
    status: present
Labsdb account status
$ mysql -h 172.16.7.153 -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

$ mysql -h labsdb1009.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

$ mysql -h labsdb1010.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 0
Labsdb user table count
$ mysql -h labsdb1009.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user\G'
*************************** 1. row ***************************
COUNT(User): 3784

$ mysql -h labsdb1010.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user\G'
*************************** 1. row ***************************
COUNT(User): 3778

$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user\G'
*************************** 1. row ***************************
COUNT(User): 3543

Event Timeline

I noticed there's a harvest function in maintain-dbusers https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/wmcs/nfs/maintain-dbusers.py#19

It looks like it was intended for bootstraping databases, but I'm curious if we could use that here.

It looks like it was intended for bootstraping databases, but I'm curious if we could use that here.

I believe the current implementation of harvest works in the opposite direction as what would be needed to fix the labsdb1011 auth tables. It populates the labsdbaccounts database on m5 which tracks account provisioning status based on the current filesystem ($HOME/replica.my.cnf) and remote database auth table status (labsdb10{09,10,11}, clouddb1001.clouddb).

It seems like it should be relatively straight forward to add a repair mode that could be used to repopulate a database's auth tables using the data from the labsdbaccounts database. Maybe something like maintain-dbusers --repair labsdb1011.eqiad.wmnet to walk through the rows from SELECT * FROM labsdbaccounts.account_host WHERE hostname='labsdb1011.eqiad.wmnet' AND status='present', check for the associated GRANT on the remote db, and create if not found?

Related, it would be nice to fix maintain-dbusers --delete [USER] so that it reports a missing GRANT but continues when trying to delete an account where the labsdbaccounts database thinks it exists but the account is missing on the remote database. I stumbled on that while working on T235382: MariaDB User s54171, access denied on replicas..

I'm interested if @Marostegui has any thoughts because I think this is caused by cloning labsdb1012, which doesn't currently get the user databases. That system wasn't included in the rotation on purpose, but if we are going to use it in the future for cloning like this, we may want to include it.

A repair function seems sensible if there isn't a clean way to copy the user database. I do not know if that would/could copy over max_user_connections and similar things as well, but we could try to capture it.

I believe the current implementation of harvest works in the opposite direction as what would be needed to fix the labsdb1011 auth tables. It populates the labsdbaccounts database on m5 which tracks account provisioning status based on the current filesystem ($HOME/replica.my.cnf) and remote database auth table status (labsdb10{09,10,11}, clouddb1001.clouddb).

Harvest has two functions:

  1. harvest_cnf_files finds tools and users replica.my.cnf file and insert/update m5's labsdbaccounts.account table
  2. harvest_replica_accts Determine the state of users defined in m5's labsdbaccounts.account table on each replica, then update the users state to either absent|present in labsdbaccounts.account_host table for each replica.

It seems like it should be relatively straight forward to add a repair mode that could be used to repopulate a database's auth tables using the data from the labsdbaccounts database. Maybe something like maintain-dbusers --repair labsdb1011.eqiad.wmnet to walk through the rows from SELECT * FROM labsdbaccounts.account_host WHERE hostname='labsdb1011.eqiad.wmnet' AND status='present', check for the associated GRANT on the remote db, and create if not found?

I think if we reuse the harvest_replica_accts function to determine the state of each user, the create_accounts function would pick this up on the next run and create the absent users.

Harvest has two functions:

  1. harvest_cnf_files finds tools and users replica.my.cnf file and insert/update m5's labsdbaccounts.account table
  2. harvest_replica_accts Determine the state of users defined in m5's labsdbaccounts.account table on each replica, then update the users state to either absent|present in labsdbaccounts.account_host table for each replica.

I think if we reuse the harvest_replica_accts function to determine the state of each user, the create_accounts function would pick this up on the next run and create the absent users.

That seems ideal then. I guess I did not read the source code carefully enough. :)

A repair function seems sensible if there isn't a clean way to copy the user database. I do not know if that would/could copy over max_user_connections and similar things as well, but we could try to capture it.

I forgot about that magic already! Maybe we should open Yet Another Phabricator Task (YAPT) about adding tracking of custom connection limits into the labsdbaccounts tables and applying that via maintain-dbusers?

My only worry about the harvest functions was that the last time they were given any attention, we were discussing removing them. I have made updates to maintain-dbusers, but I haven't touched those functions. A quick check makes me suspect it wasn't updated when we moved to the role system vs. direct grants (as it uses "SHOW GRANTS" to determine access). That would fail on basically everyone right now. Checking that...

Ok, that's wrong. Show grants will show the role and limits :) That's good.

Change 543706 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: harvest replicas for current user account state

https://gerrit.wikimedia.org/r/543706

I tested the above patch with my user account and verified it's working as expected.

example command
maintain-dbusers --account-type user harvest-replicas
new account entry
$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

and I'm able to connect from toolforge now :)

mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p
...

MariaDB [enwiki_p]>

Change 543706 merged by Jhedden:
[operations/puppet@production] toolforge: harvest replicas for current user account state

https://gerrit.wikimedia.org/r/543706

Changes pushed. Full list of accounts that were out of sync and created at: https://phabricator.wikimedia.org/P9366

Mentioned in SAL (#wikimedia-cloud) [2019-10-16T21:59:37Z] <jeh> resync wiki replica tool and user accounts T235697

I'm interested if @Marostegui has any thoughts because I think this is caused by cloning labsdb1012, which doesn't currently get the user databases. That system wasn't included in the rotation on purpose, but if we are going to use it in the future for cloning like this, we may want to include it.

A repair function seems sensible if there isn't a clean way to copy the user database. I do not know if that would/could copy over max_user_connections and similar things as well, but we could try to capture it.

Definitely caused by the labsdb1011 recloning. I didn't think about this because I thought labsdb1012 was getting the normal users like any other labsdb host. Sorry about it.

If possible, I think we should definitely include labsdb1012 in that setup to avoid this kind of issues in the future in case we have to reclone another host or place labsdb1012 temporarily to serve on web or analytics service (ie: HW issue on another host makes it unavailable for a long period of time).

If possible, I think we should definitely include labsdb1012 in that setup to avoid this kind of issues in the future in case we have to reclone another host or place labsdb1012 temporarily to serve on web or analytics service (ie: HW issue on another host makes it unavailable for a long period of time).

Just in case that isn't just an easy drop-in, I'll make a new task for it. Thanks for the thoughts!