Page MenuHomePhabricator

cleanup mariadb datadir situation on cloud VPS hosts using simplelamp role
Closed, ResolvedPublic

Description

The puppet role simplelamp2 uses /srv/sqldata as the default path for the data dir:

Stdlib::Unixpath $datadir = lookup(profile::mariadb::generic_server::datadir, {'default_value' => '/srv/sqldata'}),

while the default dir is /var/lib/mysql.

But if the server is not restarted manually after puppet runs this setting is not applied yet and eventually if a VM is rebooted it gets applied and mariadb now can't find databases and refuses to start, causing problems as described in T321763

Possible fixes are:

  • change the default dir to /var/lib/mysql and never use /srv
  • add a service restart to puppet code
  • set correct datadir values for projects/instances in Hiera (repo or Horizon)

But we need to make sure we fix this also for all existing systems using this and not just for future systems.


Project: glampipe - this project has been deleted. that it's still listed here is T334127

Project: openocr - one instance "api.openocr.eqiad1.wikimedia.cloud" that I can't connect to. but the role would only be applied to instances called test*, skipping

Project: reading-web-staging

  • pixel.reading-web-staging.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting

Project: signwriting

  • signwriting-swis-2022.signwriting.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting
  • signwriting-swserver-2022.signwriting.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting

Project: vuessr

  • prototype1.vuessr.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting

Project: wikipathways

  • data.wikipathways.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting
  • wikipathways-dev.wikipathways.eqiad1.wikimedia.cloud - can not connect to instance

Project: wikisp

  • mars.wikisp.eqiad1.wikimedia.cloud - mysql IS running - runtime datadir and config datadir are /srv/sqldata. - should get Hiera setting

Project: wikispeech

  • producer.wikispeech.eqiad1.wikimedia.cloud - mysql IS running - runtime datadir and config datadir are /srv/sqldata. - should get Hiera setting

Project: wildcat

  • dannyb.wildcat.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting

Project: wmf-research-tools

  • knowledge-gap-index-tool.wmf-research-tools.eqiad1.wikimedia.cloud - mysql is not running but there is data in /srv/sqldata and /var/lib/mysql/- should get Hiera setting

Event Timeline

the role includes profile::mariadb::generic_server which is, in production, used by:

  • phabricator
  • parsoid/testreduce
  • VRTS

but NOT by "production databases".

Dzahn updated the task description. (Show Details)

Change 888800 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] simplelamp2: change default mariadb datadir to /var/lib/mysql/

https://gerrit.wikimedia.org/r/888800

@taavi any other tag that should be added to this?

Jelto triaged this task as Medium priority.
Jelto moved this task from Incoming to Backlog on the collaboration-services board.

Mentioned in SAL (#wikimedia-cloud) [2023-03-13T04:10:18Z] <Deus> ceres-01: Change datadir to /var/lib/mysql (T329571)

Getting back to this to fix it for real.

So here is the list of users of the role:

https://openstack-browser.toolforge.org/puppetclass/role::simplelamp2

I have to go through each instance in these projects and determine:

  • is mysql running and can I connect to it as root
  • what is the current datadir configured in /etc/my.cnf
  • what is the current datadir when asking for running config on the db server itself (to check if they have restarted the service and it matches)

Then I can fix it by:

  • those that actually use the /srv/ path, make that an explicit setting in their project Hiera (in this case in Horizon, not repo, so users can edit it themselves)
  • those that use the /var/lib path, do nothing
  • merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/888800
  • double check things are ok after merge

This should mean noop for everyone and for future users this should not break anymore and cause these unexpected issues at the first service restart.

Dzahn changed the task status from Open to In Progress.Apr 5 2023, 6:27 PM
Dzahn updated the task description. (Show Details)
Dzahn added a project: Cloud-VPS.

@taavi Could you please do the following edits for me? In these projects, edit project-wide Hiera data in Horizon:

reading-web-staging
signwriting
vuessr
wikipathways
wikisp
wikispeech
wildcat
wmf-research-tools

In each of them add the following key/value:

profile::simplelamp2::database_datadir: '/srv/sqldata'

Right now this will do nothing as this profile does not exist yet. I will then merge a change that refactors the role to include a new profile::simplelamp2 so that we can do a Hiera lookup for the datadir there.

I will disable puppet on all affected machines and re-enable one by one to do this carefully and check each one.

temp assigned for this step above, assign back to me when it's done or if there are problems with it. thank you!

Configured that hiera key everywhere where role::simplelamp2 is assigned.

Change 888800 merged by Dzahn:

[operations/puppet@production] simplelamp2: change default mariadb datadir to /var/lib/mysql/

https://gerrit.wikimedia.org/r/888800

Thank you @taavi ! I merged the change above which refactors the role into role/profile so that we could add a Hiera lookup for the datadir value. Now with the Hiera keys you added for me.. this was a noop on every instance that uses it. I checked them all and enabled puppet one by one.. This changed nothing.

But also it fixes it now for every future user of simplelamp2, they should simply get /var/lib/mysql as the default value.

Additionally I am going to ask existing users if they want help with their DB setup. Only 2 of them had a running DB, but not sure if others were broken due to me or just not used in the first place.

Hey @Galahad since you are on this ticket, I am telling you here. So on your instance "mars" I saw mysql is running and both runtime config and config file say the datadir is /srv/sqldata. Nothing changed here.. except.. the new default is now /var/lib/mysql to avoid issues from the past and your project has a Hiera setting that says it's overriding it to be /srv/sqldata. If you are fine with that.. there is nothing you have to do. If you want to change it, you now can in your project Hiera. Please let me know if you would like any help regarding databases in your project regarding the data dir.

I sent an email to some admins of all other affected projects to ask if they even use mysql here and want any help with it.

I got responses from users:

signwriting - confirmed their sites running on this are up and fine. since mysql wasn't running we can safely assume it's not used.

wmf-research-tools - confirmed isn't using mysql and no immediate plans to do so.

wildcat - said it is planned to store data locally but isn't currently doing yet

wikispeech - said they are happy to use the default settings and the server is not yet in use, so changing that, even if it means losing the old database, is fine

wikipathways - confirmed they are not moving mysql on that host

pixel - confirmed they are using mysql but it's inside a docker container, so not the one I was looking at

@taavi For the 6 above, we can remove the Hiera key again, so they will use defaults in the future. Thank you.

Mentioned in SAL (#wikimedia-cloud) [2023-04-12T19:26:50Z] <mutante> - vrts-1001 - editing /etc/my.cnf to set mariadb datadir to /var/lib/mysql instead of /srv/sqldata and restart service, issue like T329571

Change 908331 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools)

https://gerrit.wikimedia.org/r/908331

Change 908331 merged by Dzahn:

[operations/puppet@production] vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools)

https://gerrit.wikimedia.org/r/908331

In T334971 the Hiera override was removed again from 6 of the projects.

This leaves only the special cases "vuessr" and "wikisp". The former had no mysql running and the latter was the only one actually using the /srv/sqldata path.

Did not get a response via email from these.

With that this is as good as it gets and I did what I could do.

The new default is /var/lib/mysql/ and there should be no more suprises for any future users.

Change 909788 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] mariadb::generic_server: change default datadir path

https://gerrit.wikimedia.org/r/909788

Change 909787 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phorge: add parameter for db_datadir and use default path

https://gerrit.wikimedia.org/r/909787

Change 909786 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: add parameter for db_datadir in cloud and use default path

https://gerrit.wikimedia.org/r/909786

Change 909787 merged by Dzahn:

[operations/puppet@production] phorge: add parameter for db_datadir and use default path

https://gerrit.wikimedia.org/r/909787

Change 909786 merged by Dzahn:

[operations/puppet@production] phabricator: add parameter for db_datadir in cloud and use default path

https://gerrit.wikimedia.org/r/909786

Change 909788 merged by Dzahn:

[operations/puppet@production] mariadb::generic_server: change default datadir path

https://gerrit.wikimedia.org/r/909788

As the final step the default dir in mariadb::generic_server has been changed.. and double checked it wasn't a change for anything using it.

The only prod user was testreduce and all the cloud VPS projects had been checked. Either they already use the default or they have a Hiera setting that overrides it.

https://puppet-compiler.wmflabs.org/output/909788/40894/

This resolves the task.