Page MenuHomePhabricator

Automate maintain-views workflow
Open, MediumPublic

Description

This process is at

https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Updating_views

It's messy but should be possible to automate. We might want to move the pooling/depooling process out of hiera first though.

Event Timeline

Change 760880 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/cookbooks@master] Add cookbooks for running maintain-views

https://gerrit.wikimedia.org/r/760880

I have created a first pass at "automating" this, by creating a cookbook which prompts the user for every action to take. Once that procedure is shown to work, we can start making the functions run with no user actions, and make the confirmation steps programmatic as well.

Here's what a run of the current cookbook looks like:

(venv) cookbooks $ cookbook -d sre.wikireplicas.update-views
DRY-RUN: Executing cookbook sre.wikireplicas.update-views with args: []
DRY-RUN: START - Cookbook sre.wikireplicas.update-views updating wikireplica views
Eventually, the depooling and repooling will be done programmatically
(see https://phabricator.wikimedia.org/T297026).

For now, depooling and repooling require puppet patches.
It's a good idea to make the first puppet patch to depool the "web"
host in advance and to get it reviewed.

For instructions, see the depool_host function of
cookbooks/sre/wikireplicas/update-views.py.

If you don't have the puppet patch ready, you'll have to wait for somebody to review it,
during this cookbook, or or self-review it as you execute these steps.

Once you understand one depool/repool puppet patch, the rest are fairly straightforward:
you'll undo the first puppet patch to repool the "web" host, then make the equivalent
change on the "analytics" host.

Before you continue, consider getting a +1 on the first patch,
so you know you're on the right track.
==> Ready to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Go post in #wikimedia-data-persistence that you are updating the wikireplicas views.
Include the task number.
==> Ready to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
If maintain-views.yaml is being updated, for example to change a customviews entry,
there will be a gerrit.wikimedia.org/r/c/operations/puppet puppet patch.
==> Are there puppet changes to be merged?
> no
Now to figure out which section is changing. We'll use this to determine which hosts to update.
To determine which section you are updating, refer to https://noc.wikimedia.org/db.php
==> Which section are you updating?
> s1
Now look in hieradata/hosts/clouddb10*.yaml for s1 to find which hosts are affected.
For this next prompt, enter both hosts separated by a plus sign (+)
==> What pair of hosts are you updating?
> clouddb1013+clouddb1017
Now to update host clouddb1013.
Make and review a patch to depool the host clouddb1013, if you haven't already.

It should be doable by uncommenting the lines in either hieradata/hosts/dbproxy1019.yaml
or hieradata/hosts/dbproxy1018.yaml for
profile::mariadb::proxy::multiinstance_replicas::section_overrides:
that correspond to the section you are modifying for the host that will remain pooled.

By setting the other host as the override for that section,
you depool the host you are updating.
Submit the patch to operations/puppet now.
Log on to puppetmaster1001 and run puppet-merge.
==> Has the puppet patch been merged?
Type "go" to proceed or "abort" to interrupt the execution
> go
Log on to the dbproxy host you modifed and run puppet agent.
Ensure you are on the right dbproxy host.
==> Does the dbproxy host have the latest puppet commit?
Type "go" to proceed or "abort" to interrupt the execution
> go
Reload haproxy to apply the change. Run the following:
    sudo systemctl reload haproxy
Then check that the host has been depooled by inspecting the output of:
    echo "show stat" | socat /run/haproxy/haproxy.sock stdio
Look for the line starting with mariadb-s1
There should be one for the other host, which is currently pooled, and none for clouddb1013
==> Has the host been depooled successfully?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now ssh to host clouddb1013 to close any lingering database connections.
Connect to mariadb with the following command:
    sudo mysql -S /var/run/mysqld/mysqld.s1.sock
Check the processes list in the User column
for anything started by users (u####) and services (s####):
    MariaDB [(none)]> show processlist;
Quit any remaining "userspace" connections using kill <id>
==> Have all the user connections been closed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now is a good time to run a sql statement that you expect to fail,
with the views not having the change applied yet.
Run your test sql statement now.
==> Does your sql statement reflect the expected pre-update state?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now it's time to run maintain-views itself:
    sudo maintain-views
You can update a subset of databases and tables with --database and --table,
 and you can test it before applying changes with --dry-run.
Run the maintain-views script now.
==> Did the maintain-views script succeed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now re-run the sql statement that failed before, to test the views have been updated.
==> Does your sql statement reflect the expected outcome?
Type "go" to proceed or "abort" to interrupt the execution
> go
To repool the host, you can make a new patch that comments out the overrides,
or just revert the old one.
Submit the patch to operations/puppet now.
Log on to puppetmaster1001 and run puppet-merge.
==> Has the puppet patch been merged?
Type "go" to proceed or "abort" to interrupt the execution
> go
Log on to the dbproxy host you modifed and run puppet agent.
Ensure you are on the right dbproxy host.
==> Does the dbproxy host have the latest puppet commit?
Type "go" to proceed or "abort" to interrupt the execution
> go
Reload haproxy to apply the change. Run the following:
    sudo systemctl reload haproxy
Then check that the host has been depooled by inspecting the output of:
    echo "show stat" | socat /run/haproxy/haproxy.sock stdio
Look for the line starting with mariadb-s1
There should be two, one for clouddb1013 and one for the other host.
==> Has the host been repooled successfully?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now to update host clouddb1017.
Make and review a patch to depool the host clouddb1017, if you haven't already.

It should be doable by uncommenting the lines in either hieradata/hosts/dbproxy1019.yaml
or hieradata/hosts/dbproxy1018.yaml for
profile::mariadb::proxy::multiinstance_replicas::section_overrides:
that correspond to the section you are modifying for the host that will remain pooled.

By setting the other host as the override for that section,
you depool the host you are updating.
Submit the patch to operations/puppet now.
Log on to puppetmaster1001 and run puppet-merge.
==> Has the puppet patch been merged?
Type "go" to proceed or "abort" to interrupt the execution
> go
Log on to the dbproxy host you modifed and run puppet agent.
Ensure you are on the right dbproxy host.
==> Does the dbproxy host have the latest puppet commit?
Type "go" to proceed or "abort" to interrupt the execution
> go
Reload haproxy to apply the change. Run the following:
    sudo systemctl reload haproxy
Then check that the host has been depooled by inspecting the output of:
    echo "show stat" | socat /run/haproxy/haproxy.sock stdio
Look for the line starting with mariadb-s1
There should be one for the other host, which is currently pooled, and none for clouddb1017
==> Has the host been depooled successfully?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now ssh to host clouddb1017 to close any lingering database connections.
Connect to mariadb with the following command:
    sudo mysql -S /var/run/mysqld/mysqld.s1.sock
Check the processes list in the User column
for anything started by users (u####) and services (s####):
    MariaDB [(none)]> show processlist;
Quit any remaining "userspace" connections using kill <id>
==> Have all the user connections been closed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now is a good time to run a sql statement that you expect to fail,
with the views not having the change applied yet.
Run your test sql statement now.
==> Does your sql statement reflect the expected pre-update state?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now it's time to run maintain-views itself:
    sudo maintain-views
You can update a subset of databases and tables with --database and --table,
 and you can test it before applying changes with --dry-run.
Run the maintain-views script now.
==> Did the maintain-views script succeed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Now re-run the sql statement that failed before, to test the views have been updated.
==> Does your sql statement reflect the expected outcome?
Type "go" to proceed or "abort" to interrupt the execution
> go
To repool the host, you can make a new patch that comments out the overrides,
or just revert the old one.
Submit the patch to operations/puppet now.
Log on to puppetmaster1001 and run puppet-merge.
==> Has the puppet patch been merged?
Type "go" to proceed or "abort" to interrupt the execution
> go
Log on to the dbproxy host you modifed and run puppet agent.
Ensure you are on the right dbproxy host.
==> Does the dbproxy host have the latest puppet commit?
Type "go" to proceed or "abort" to interrupt the execution
> go
Reload haproxy to apply the change. Run the following:
    sudo systemctl reload haproxy
Then check that the host has been depooled by inspecting the output of:
    echo "show stat" | socat /run/haproxy/haproxy.sock stdio
Look for the line starting with mariadb-s1
There should be two, one for clouddb1017 and one for the other host.
==> Has the host been repooled successfully?
Type "go" to proceed or "abort" to interrupt the execution
> go
DRY-RUN: END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) updating wikireplica views

We might want to move the pooling/depooling process out of hiera first though.

I think we can disable puppet temporarily using spicerack with puppet.disabled(reason):, then remove the server we want to depool out of /etc/haproxy/conf.d/multi-db-replicas.cfg using something like

cat /etc/haproxy/conf.d/multi-db-replicas.cfg | grep -v clouddb1014

and restart haproxy.

Then when puppet is re-enabled, it will add the server back to the multi-db-replicas.cfg file and the next haproxy restart will repool it.

We might want to move the pooling/depooling process out of hiera first though.

I think we can disable puppet temporarily using spicerack with puppet.disabled(reason):, then remove the server we want to depool out of /etc/haproxy/conf.d/multi-db-replicas.cfg using something like

cat /etc/haproxy/conf.d/multi-db-replicas.cfg | grep -v clouddb1014

and restart haproxy.

Then when puppet is re-enabled, it will add the server back to the multi-db-replicas.cfg file and the next haproxy restart will repool it.

FYI I split this part to T300427 and left a suggestion of my own there

@Majavah ah yes, I remember that now, I forgot where it was. Thanks for the link

I have not had time to look at this, but thank you for working on it!

I have started to iterate on the cookbook at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/760880 again, and I'm not sure how to get the dbproxy hosts in a consistent order (the order itself is arbitrary, but the instructions should tell the user to make one patch first consistently. I'm choosing "web" first).

There are 2 hosts: dbproxy1018 and dbproxy1019, and the "web" host that I'm choosing first is dbproxy1019. I could hardcode dbproxy1019 into the cookbook, but ideally I could query for the host with the following in hiera:

# hieradata/hosts/dbproxy1019.yaml
profile::mariadb::proxy::multiinstance_replicas::replica_type: 'web'

Does anybody know how to query for dbproxy1019 from cumin?

Perhaps there is a dns record for this?

I have started to iterate on the cookbook at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/760880 again, and I'm not sure how to get the dbproxy hosts in a consistent order (the order itself is arbitrary, but the instructions should tell the user to make one patch first consistently. I'm choosing "web" first).

There are 2 hosts: dbproxy1018 and dbproxy1019, and the "web" host that I'm choosing first is dbproxy1019. I could hardcode dbproxy1019 into the cookbook, but ideally I could query for the host with the following in hiera:

# hieradata/hosts/dbproxy1019.yaml
profile::mariadb::proxy::multiinstance_replicas::replica_type: 'web'

Does anybody know how to query for dbproxy1019 from cumin?

Completely untested (as I don't have access), but you should be able to use a query like

P{C:profile::mariadb::proxy::multiinstance_replicas%replica_type = web}

Although I'd consider hardcoding them for now and instead spending time on automating the depool step instead.

Change 760880 merged by jenkins-bot:

[operations/cookbooks@master] Add cookbooks for running maintain-views

https://gerrit.wikimedia.org/r/760880

Change 773670 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/cookbooks@master] sre.wikireplicas.update-views: add more options

https://gerrit.wikimedia.org/r/773670

Change 989130 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/cookbooks@master] wikireplicas: update-views: always run on all hosts

https://gerrit.wikimedia.org/r/989130

Change 989130 merged by jenkins-bot:

[operations/cookbooks@master] wikireplicas: update-views: always run on all hosts

https://gerrit.wikimedia.org/r/989130

fnegri triaged this task as Medium priority.Jul 15 2024, 9:16 AM