labsdb1011 missing tools user accounts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• JHedden
	Oct 16 2019, 7:12 PM

Description

While debugging account access to the wiki replicas, I found that the mysql accounts on labsdb1011 are not in sync with the expected metadata state. This could be related to the recent reclone T235016.

Is there a process we can use to resync the user accounts on labsdb1011? I've found a few accounts missing but haven't done a full audit yet.

Here's some information on an example with my account:

Metadata user status

$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='u21436'\G"
Enter password:
*************************** 1. row ***************************
            id: 19201
mysql_username: u21436
          type: user
      username: jeh

Metadata account status

$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=19201\G"
Enter password:
*************************** 1. row ***************************
        id: 134314
account_id: 19201
  hostname: 172.16.7.153
    status: present
*************************** 2. row ***************************
        id: 134313
account_id: 19201
  hostname: labsdb1009.eqiad.wmnet
    status: present
*************************** 3. row ***************************
        id: 134316
account_id: 19201
  hostname: labsdb1010.eqiad.wmnet
    status: present
*************************** 4. row ***************************
        id: 134315
account_id: 19201
  hostname: labsdb1011.eqiad.wmnet
    status: present

Labsdb account status

$ mysql -h 172.16.7.153 -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

$ mysql -h labsdb1009.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

$ mysql -h labsdb1010.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 0

Labsdb user table count

$ mysql -h labsdb1009.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user\G'
*************************** 1. row ***************************
COUNT(User): 3784

$ mysql -h labsdb1010.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user\G'
*************************** 1. row ***************************
COUNT(User): 3778

$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user\G'
*************************** 1. row ***************************
COUNT(User): 3543

Details

	Subject	Repo	Branch	Lines +/-
	toolforge: harvest replicas for current user account state	operations/puppet	production	+9 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• JHedden	T235697 labsdb1011 missing tools user accounts
		Resolved		• Bstorm	T235791 Add labsdb1012 to maintain-dbusers

Event Timeline

• JHedden created this task.Oct 16 2019, 7:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 16 2019, 7:12 PM

I noticed there's a harvest function in maintain-dbusers https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/wmcs/nfs/maintain-dbusers.py#19

It looks like it was intended for bootstraping databases, but I'm curious if we could use that here.

Related task T235382

In T235697#5581757, @JHedden wrote:

It looks like it was intended for bootstraping databases, but I'm curious if we could use that here.

I believe the current implementation of harvest works in the opposite direction as what would be needed to fix the labsdb1011 auth tables. It populates the labsdbaccounts database on m5 which tracks account provisioning status based on the current filesystem ($HOME/replica.my.cnf) and remote database auth table status (labsdb10{09,10,11}, clouddb1001.clouddb).

It seems like it should be relatively straight forward to add a repair mode that could be used to repopulate a database's auth tables using the data from the labsdbaccounts database. Maybe something like maintain-dbusers --repair labsdb1011.eqiad.wmnet to walk through the rows from SELECT * FROM labsdbaccounts.account_host WHERE hostname='labsdb1011.eqiad.wmnet' AND status='present', check for the associated GRANT on the remote db, and create if not found?

Related, it would be nice to fix maintain-dbusers --delete [USER] so that it reports a missing GRANT but continues when trying to delete an account where the labsdbaccounts database thinks it exists but the account is missing on the remote database. I stumbled on that while working on T235382: MariaDB User s54171, access denied on replicas..

I think it is pretty likely that T235413: Access denied to Toolforge DB since last Friday is related as well.

I'm interested if @Marostegui has any thoughts because I think this is caused by cloning labsdb1012, which doesn't currently get the user databases. That system wasn't included in the rotation on purpose, but if we are going to use it in the future for cloning like this, we may want to include it.

A repair function seems sensible if there isn't a clean way to copy the user database. I do not know if that would/could copy over max_user_connections and similar things as well, but we could try to capture it.

In T235697#5581859, @bd808 wrote:

I believe the current implementation of harvest works in the opposite direction as what would be needed to fix the labsdb1011 auth tables. It populates the labsdbaccounts database on m5 which tracks account provisioning status based on the current filesystem ($HOME/replica.my.cnf) and remote database auth table status (labsdb10{09,10,11}, clouddb1001.clouddb).

Harvest has two functions:

harvest_cnf_files finds tools and users replica.my.cnf file and insert/update m5's labsdbaccounts.account table
harvest_replica_accts Determine the state of users defined in m5's labsdbaccounts.account table on each replica, then update the users state to either absent|present in labsdbaccounts.account_host table for each replica.

It seems like it should be relatively straight forward to add a repair mode that could be used to repopulate a database's auth tables using the data from the labsdbaccounts database. Maybe something like maintain-dbusers --repair labsdb1011.eqiad.wmnet to walk through the rows from SELECT * FROM labsdbaccounts.account_host WHERE hostname='labsdb1011.eqiad.wmnet' AND status='present', check for the associated GRANT on the remote db, and create if not found?

I think if we reuse the harvest_replica_accts function to determine the state of each user, the create_accounts function would pick this up on the next run and create the absent users.

In T235697#5582077, @JHedden wrote:

Harvest has two functions:

harvest_cnf_files finds tools and users replica.my.cnf file and insert/update m5's labsdbaccounts.account table

harvest_replica_accts Determine the state of users defined in m5's labsdbaccounts.account table on each replica, then update the users state to either absent|present in labsdbaccounts.account_host table for each replica.

I think if we reuse the harvest_replica_accts function to determine the state of each user, the create_accounts function would pick this up on the next run and create the absent users.

That seems ideal then. I guess I did not read the source code carefully enough. :)

In T235697#5581884, @Bstorm wrote:

A repair function seems sensible if there isn't a clean way to copy the user database. I do not know if that would/could copy over max_user_connections and similar things as well, but we could try to capture it.

I forgot about that magic already! Maybe we should open Yet Another Phabricator Task (YAPT) about adding tracking of custom connection limits into the labsdbaccounts tables and applying that via maintain-dbusers?

My only worry about the harvest functions was that the last time they were given any attention, we were discussing removing them. I have made updates to maintain-dbusers, but I haven't touched those functions. A quick check makes me suspect it wasn't updated when we moved to the role system vs. direct grants (as it uses "SHOW GRANTS" to determine access). That would fail on basically everyone right now. Checking that...

Ok, that's wrong. Show grants will show the role and limits :) That's good.

Change 543706 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolforge: harvest replicas for current user account state

https://gerrit.wikimedia.org/r/543706

gerritbot added a project: Patch-For-Review.Oct 16 2019, 9:27 PM

I tested the above patch with my user account and verified it's working as expected.

example command

maintain-dbusers --account-type user harvest-replicas

new account entry

$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G'
*************************** 1. row ***************************
COUNT(User): 1

and I'm able to connect from toolforge now :)

mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p
...

MariaDB [enwiki_p]>

Change 543706 merged by Jhedden:
[operations/puppet@production] toolforge: harvest replicas for current user account state

https://gerrit.wikimedia.org/r/543706

Changes pushed. Full list of accounts that were out of sync and created at: https://phabricator.wikimedia.org/P9366

Mentioned in SAL (#wikimedia-cloud) [2019-10-16T21:59:37Z] <jeh> resync wiki replica tool and user accounts T235697

bd808 mentioned this in T235413: Access denied to Toolforge DB since last Friday.Oct 16 2019, 10:00 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2019, 10:10 PM

• Bstorm awarded a token.Oct 16 2019, 11:10 PM

In T235697#5581884, @Bstorm wrote:

I'm interested if @Marostegui has any thoughts because I think this is caused by cloning labsdb1012, which doesn't currently get the user databases. That system wasn't included in the rotation on purpose, but if we are going to use it in the future for cloning like this, we may want to include it.

A repair function seems sensible if there isn't a clean way to copy the user database. I do not know if that would/could copy over max_user_connections and similar things as well, but we could try to capture it.

Definitely caused by the labsdb1011 recloning. I didn't think about this because I thought labsdb1012 was getting the normal users like any other labsdb host. Sorry about it.

If possible, I think we should definitely include labsdb1012 in that setup to avoid this kind of issues in the future in case we have to reclone another host or place labsdb1012 temporarily to serve on web or analytics service (ie: HW issue on another host makes it unavailable for a long period of time).

• JHedden claimed this task.Oct 17 2019, 12:13 PM

• JHedden closed this task as Resolved.Oct 17 2019, 1:33 PM

In T235697#5582876, @Marostegui wrote:

If possible, I think we should definitely include labsdb1012 in that setup to avoid this kind of issues in the future in case we have to reclone another host or place labsdb1012 temporarily to serve on web or analytics service (ie: HW issue on another host makes it unavailable for a long period of time).

Just in case that isn't just an easy drop-in, I'll make a new task for it. Thanks for the thoughts!

• JHedden mentioned this in T235906: Cannot access replica databases.Oct 18 2019, 7:19 PM

• Bstorm closed subtask T235791: Add labsdb1012 to maintain-dbusers as Resolved.Oct 22 2019, 9:38 PM

labsdb1011 missing tools user accountsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

labsdb1011 missing tools user accounts
Closed, ResolvedPublic
Actions

Related Objects
Search...