Page MenuHomePhabricator

Move Superset and Turnilo to an-tool1010
Closed, ResolvedPublic

Description

In T268146 the dcops team racked an-tool1010, a physical node to host Turnilo and Superset (currently running on VMs).

Some things to remember:

  • Superset uses a database on an-coord1001, so moving it to another host should be done carefully.
  • We need to upgrade scap configs for both gerrit repositories, to be able to deploy to new nodes.
  • At the end we need to decommission the two VMs on Ganeti (an-tool1007 and analytics-tool1004).
  • Both backends have config in ATS for external domains.

@razzi interested? :)

Event Timeline

fdans triaged this task as Medium priority.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Change 644672 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] superset: add cached to an-tool1010

https://gerrit.wikimedia.org/r/644672

My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that caching works as expected.

My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that caching works as expected.

If possible I'd avoid it, because there will be two supersets trying to write to the same database. One safer idea could be to tweak the superset staging instance on an-tool1005 (we co-locate turnilo and superset in there, without any external domain) to use the memcached on an-tool1010. Once we know what config works, it should be quick and easy to:

  1. stop superset on analytics-tool1004
  2. deploy it to an-tool1010 (with the memcached settings)
  3. Move Varnish/ATS settings to the new backend.

What do you think?

Just talked with Razzi, we decided to do this in a slightly different order just to allow him to progress and to decouple memcached from the an-tool1010 move.

Razzi has already confirmed that memcached configs work on staging superset on an-tool1005. Next steps:

Set up memcached for Superset prod.

  • Run memcached on an-tool1010.
  • Configure superset on analytics-tool1004 to use memcached on an-tool1010.

Move turnilo and superset to an-tool1010

  • Schedule short downtime for superset and turnilo.
  • Stop superset and turnilo during downtime.
  • Merge puppet patch to set up superset and turnilo on an-tool1010 (with memcached settings).
  • Merge puppet patch to route ATS/Varnish superset and turnilo traffic to an-tool1010. (downtime over).
  • Decom analytics-tool1004.

@elukey, does this plan sound ok with you?

Yep! Looks good, my only suggestion is/was to avoid having multiple instances of superset using the db at the same time to avoid "multi-writes" scenarios.

One question for Razzi - can you add some detail about what test was done to confirm that memcached settings work? I am asking because sometimes when I was testing dashboards I didn't notice problems that the team did, so before pulling the trigger I'd also ask to whoever has time to quickly test the memcached set up to spot anomalies and report back in here if anything weird is found. I'd also like to know if there are performance improvements with the new set up or not etc.. (curious about the results!)

@elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached.

Currently there'presto dashboards are not loading on staging due to an SSL error; that's worth fixing, I'm thinking now, since we should test the presto<->memcache functionality. Once that's fixed, we can take some timing measurements on staging with and without the cache.

@elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached.

Perfect, this is great! The tests that I would add are something like:

  • Change a dashboard and see if it reloads correctly (not caching the result etc..). Staging uses a different db so we can change stuff as we want. The db is also not in sync with the production one, so we could dump superset_production and load it into superset_staging, to have a good replica of all the dashboards in production. I can show you how to do it, and add docs if needed (I do it all the time that I upgrade superset on staging).
  • See how much difference there is between loading a dashboard hitting memcached vs not hitting it. This is a little tricky since Druid cache will also play a role, but we can try to set up something together to see.
  • Ask to analytics-internal@ some volunteer to test superset and see if anything looks weird, just to make sure that we are not missing some bug introduced by memcached.

Currently there'presto dashboards are not loading on staging due to an SSL error; that's worth fixing, I'm thinking now, since we should test the presto<->memcache functionality. Once that's fixed, we can take some timing measurements on staging with and without the cache.

Fixed, the reason was that staging wasn't updated after the switch in TLS certs that I did a month ago for the presto cluster. Basically it was using the old self-signed CA cert to verify Presto's TLS connection, I forced it to use the puppet CA one and it worked fine :)

Change 650179 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010

https://gerrit.wikimedia.org/r/650179

Change 647387 had a related patch set uploaded (by Razzi; owner: Razzi):
[analytics/superset/deploy@master] Install pylibmc and update wheels for superset

https://gerrit.wikimedia.org/r/647387

For superset, the following 3 patches should be all we need to move traffic over with a short window of downtime:

The first 3 can be deployed whenever, though once the third is deployed, we need to be careful as we'll have multiple clients connecting to the same database.

Change 650526 had a related patch set uploaded (by Razzi; owner: Razzi):
[analytics/superset/deploy@master] Add an-tool1010.eqiad.wmnet to scap/targets

https://gerrit.wikimedia.org/r/650526

Change 647387 merged by Razzi:
[analytics/superset/deploy@master] Install pylibmc and update wheels for superset

https://gerrit.wikimedia.org/r/647387

Change 650526 merged by Razzi:
[analytics/superset/deploy@master] Add an-tool1010.eqiad.wmnet to scap/targets

https://gerrit.wikimedia.org/r/650526

Change 650179 merged by Razzi:
[operations/puppet@production] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010

https://gerrit.wikimedia.org/r/650179

Superset is now running on an-tool1010, so analytics-tool1004 can be decommissioned.

Next up is to migrate turnilo.

Spoke with @elukey and we're thinking of leaving turnilo on an-tool1007 for now, rather than co-locating it with superset, so that issues with either service won't affect the other. If we go that route, all that's left for this ticket is to decommission analytics-tool1004. @Ottomata what do you think?

Turnilo on an-tool1007 consumes very little resources, and it feels better to me to avoid co-location since afaik we don't have any performance issue reported by Turnilo users. I am worried that Superset could influence Turnilo's performance if co-located, even if the host is beefy enough for both. If we'll need to co-locate for better perfs in the future it will be possible :)

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: analytics-tool1004.eqiad.wmnet

  • analytics-tool1004.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Here's the error from attempting to decommission analytics-tool1004:

Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----
2021-01-08 16:28:28,779 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2021-01-08 16:30:44,827 [ERROR] Failed to run
Traceback (most recent call last):
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 687, in main
    batch_status, ret_code = run_commit(args, config, tmpdir)
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 591, in run_commit
    netbox.collect()
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect
    self._collect_device(device, True)
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 198, in _collect_device
    if self.addresses[primary.id].dns_name:
KeyError: 4137
================
PASS |                                                                                  |   0% (0/1) [02:16<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████| 100% (1/1) [02:16<00:00, 136.93s/hosts]
100.0% (1/1) of nodes failed to execute command 'cd /tmp && runus...e asset tag one"': netbox1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'cd /tmp && runus...e asset tag one"'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 350, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 73, in run
    results = netbox_host.run_sync(command, is_safe=True)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)
ERROR: some step failed, check the task updates.
Updated Phabricator task T268219
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

Change 655634 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove analytics-tool1004 from puppet (decommed node)

https://gerrit.wikimedia.org/r/655634

Change 655634 merged by Elukey:
[operations/puppet@production] Remove analytics-tool1004 from puppet (decommed node)

https://gerrit.wikimedia.org/r/655634

Change 666486 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Remove unused analytics_cluster::superset

https://gerrit.wikimedia.org/r/666486

Change 666486 merged by Razzi:
[operations/puppet@production] Remove unused analytics_cluster::superset

https://gerrit.wikimedia.org/r/666486

Ok, this is done, with one last bit of cleanup: I'd like to rename role::analytics_cluster::ui::dashboards to role::analytics_cluster::ui::superset since it's only hosting superset, not turnilo. Currently there are comments like "will eventually host turnilo as well" etc. Then I'll close this.

Change 701066 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: rename analytics_cluster::ui::{dashboards,superset}

https://gerrit.wikimedia.org/r/701066

Change 701066 merged by Razzi:

[operations/puppet@production] superset: rename analytics_cluster::ui::{dashboards,superset}

https://gerrit.wikimedia.org/r/701066

Change 701327 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Rename superset's hiera config

https://gerrit.wikimedia.org/r/701327

Change 701327 merged by Elukey:

[labs/private@master] Rename superset's hiera config

https://gerrit.wikimedia.org/r/701327