Move Superset and Turnilo to an-tool1010
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Nov 19 2020, 10:12 AM

Description

In T268146 the dcops team racked an-tool1010, a physical node to host Turnilo and Superset (currently running on VMs).

Some things to remember:

Superset uses a database on an-coord1001, so moving it to another host should be done carefully.
We need to upgrade scap configs for both gerrit repositories, to be able to deploy to new nodes.
At the end we need to decommission the two VMs on Ganeti (an-tool1007 and analytics-tool1004).
Both backends have config in ATS for external domains.

@razzi interested? :)

Details

Subject	Repo	Branch	Lines +/-
Rename superset's hiera config	labs/private	master	+0 -0
superset: rename analytics_cluster::ui::{dashboards,superset}	operations/puppet	production	+18 -19
Remove unused analytics_cluster::superset	operations/puppet	production	+0 -46
Remove analytics-tool1004 from puppet (decommed node)	operations/puppet	production	+0 -12
role::analytics_cluster::ui::dashboards: Add superset to an-tool1010	operations/puppet	production	+41 -4
Add an-tool1010.eqiad.wmnet to scap/targets	analytics/superset/deploy	master	+1 -0
Install pylibmc and update wheels for superset	analytics/superset/deploy	master	+4 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T255145 Analytics Hardware for Fiscal Year 2020/2021
Resolved	• razzi	T268219 Move Superset and Turnilo to an-tool1010
Resolved	• razzi	T268784 Configure superset cache

Event Timeline

elukey created this task.Nov 19 2020, 10:12 AM

• fdans assigned this task to • razzi.Nov 19 2020, 5:54 PM

• fdans triaged this task as Medium priority.

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

JAllemandou added a subtask: T268784: Configure superset cache .Nov 25 2020, 7:27 PM

Ottomata edited projects, added Analytics-Clusters; removed Analytics.Nov 30 2020, 5:04 PM

Change 644672 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] superset: add cached to an-tool1010

https://gerrit.wikimedia.org/r/644672

gerritbot added a project: Patch-For-Review.Dec 2 2020, 5:38 AM

My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that caching works as expected.

In T268219#6670249, @razzi wrote:

My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that caching works as expected.

If possible I'd avoid it, because there will be two supersets trying to write to the same database. One safer idea could be to tweak the superset staging instance on an-tool1005 (we co-locate turnilo and superset in there, without any external domain) to use the memcached on an-tool1010. Once we know what config works, it should be quick and easy to:

stop superset on analytics-tool1004
deploy it to an-tool1010 (with the memcached settings)
Move Varnish/ATS settings to the new backend.

What do you think?

Ottomata moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.Dec 7 2020, 4:32 PM

Just talked with Razzi, we decided to do this in a slightly different order just to allow him to progress and to decouple memcached from the an-tool1010 move.

Razzi has already confirmed that memcached configs work on staging superset on an-tool1005. Next steps:

Set up memcached for Superset prod.

Run memcached on an-tool1010.
Configure superset on analytics-tool1004 to use memcached on an-tool1010.

Move turnilo and superset to an-tool1010

Schedule short downtime for superset and turnilo.
Stop superset and turnilo during downtime.
Merge puppet patch to set up superset and turnilo on an-tool1010 (with memcached settings).
Merge puppet patch to route ATS/Varnish superset and turnilo traffic to an-tool1010. (downtime over).
Decom analytics-tool1004.

@elukey, does this plan sound ok with you?

Yep! Looks good, my only suggestion is/was to avoid having multiple instances of superset using the db at the same time to avoid "multi-writes" scenarios.

One question for Razzi - can you add some detail about what test was done to confirm that memcached settings work? I am asking because sometimes when I was testing dashboards I didn't notice problems that the team did, so before pulling the trigger I'd also ask to whoever has time to quickly test the memcached set up to spot anomalies and report back in here if anything weird is found. I'd also like to know if there are performance improvements with the new set up or not etc.. (curious about the results!)

@elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached.

Currently there'presto dashboards are not loading on staging due to an SSL error; that's worth fixing, I'm thinking now, since we should test the presto<->memcache functionality. Once that's fixed, we can take some timing measurements on staging with and without the cache.

In T268219#6685049, @razzi wrote:

@elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached.

Perfect, this is great! The tests that I would add are something like:

Change a dashboard and see if it reloads correctly (not caching the result etc..). Staging uses a different db so we can change stuff as we want. The db is also not in sync with the production one, so we could dump superset_production and load it into superset_staging, to have a good replica of all the dashboards in production. I can show you how to do it, and add docs if needed (I do it all the time that I upgrade superset on staging).
See how much difference there is between loading a dashboard hitting memcached vs not hitting it. This is a little tricky since Druid cache will also play a role, but we can try to set up something together to see.
Ask to analytics-internal@ some volunteer to test superset and see if anything looks weird, just to make sure that we are not missing some bug introduced by memcached.

Currently there'presto dashboards are not loading on staging due to an SSL error; that's worth fixing, I'm thinking now, since we should test the presto<->memcache functionality. Once that's fixed, we can take some timing measurements on staging with and without the cache.

Fixed, the reason was that staging wasn't updated after the switch in TLS certs that I did a month ago for the presto cluster. Basically it was using the old self-signed CA cert to verify Presto's TLS connection, I forced it to use the puppet CA one and it worked fine :)

Change 650179 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010

https://gerrit.wikimedia.org/r/650179

Change 647387 had a related patch set uploaded (by Razzi; owner: Razzi):
[analytics/superset/deploy@master] Install pylibmc and update wheels for superset

https://gerrit.wikimedia.org/r/647387

For superset, the following 3 patches should be all we need to move traffic over with a short window of downtime:

Update deployment settings for superset to run on an-tool1010 with memcached: https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/647387
Add an-tool1010 to scap deployments: https://gerrit.wikimedia.org/r/650526
Install superset on an-tool1010 with memcached configured: https://gerrit.wikimedia.org/r/c/operations/puppet/+/650179:
Switch traffic from analytics-tool1004 to an-tool1010: https://gerrit.wikimedia.org/r/c/operations/puppet/+/650522

The first 3 can be deployed whenever, though once the third is deployed, we need to be careful as we'll have multiple clients connecting to the same database.

Change 650526 had a related patch set uploaded (by Razzi; owner: Razzi):
[analytics/superset/deploy@master] Add an-tool1010.eqiad.wmnet to scap/targets

https://gerrit.wikimedia.org/r/650526

Change 647387 merged by Razzi:
[analytics/superset/deploy@master] Install pylibmc and update wheels for superset

https://gerrit.wikimedia.org/r/647387

Change 650526 merged by Razzi:
[analytics/superset/deploy@master] Add an-tool1010.eqiad.wmnet to scap/targets

https://gerrit.wikimedia.org/r/650526

Change 650179 merged by Razzi:
[operations/puppet@production] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010

https://gerrit.wikimedia.org/r/650179

Superset is now running on an-tool1010, so analytics-tool1004 can be decommissioned.

Next up is to migrate turnilo.

Ottomata moved this task from Q1 2021/2022 to Q3 2020/2021 on the Analytics-Clusters board.Jan 4 2021, 4:45 PM

Spoke with @elukey and we're thinking of leaving turnilo on an-tool1007 for now, rather than co-locating it with superset, so that issues with either service won't affect the other. If we go that route, all that's left for this ticket is to decommission analytics-tool1004. @Ottomata what do you think?

Turnilo on an-tool1007 consumes very little resources, and it feels better to me to avoid co-location since afaik we don't have any performance issue reported by Turnilo users. I am worried that Superset could influence Turnilo's performance if co-located, even if the host is beefy enough for both. If we'll need to co-locate for better perfs in the future it will be possible :)

Sure! fine with me.

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: analytics-tool1004.eqiad.wmnet

analytics-tool1004.eqiad.wmnet (WARN)
- Failed downtime host on Icinga (likely already removed)
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

COMMON_STEPS (FAIL)
- Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Here's the error from attempting to decommission analytics-tool1004:

Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----
2021-01-08 16:28:28,779 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2021-01-08 16:30:44,827 [ERROR] Failed to run
Traceback (most recent call last):
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 687, in main
    batch_status, ret_code = run_commit(args, config, tmpdir)
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 591, in run_commit
    netbox.collect()
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect
    self._collect_device(device, True)
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 198, in _collect_device
    if self.addresses[primary.id].dns_name:
KeyError: 4137
================
PASS |                                                                                  |   0% (0/1) [02:16<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████| 100% (1/1) [02:16<00:00, 136.93s/hosts]
100.0% (1/1) of nodes failed to execute command 'cd /tmp && runus...e asset tag one"': netbox1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'cd /tmp && runus...e asset tag one"'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 350, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 73, in run
    results = netbox_host.run_sync(command, is_safe=True)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)
ERROR: some step failed, check the task updates.
Updated Phabricator task T268219
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

• razzi moved this task from Q3 2020/2021 to Done on the Analytics-Clusters board.Jan 8 2021, 4:57 PM

Change 655634 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove analytics-tool1004 from puppet (decommed node)

https://gerrit.wikimedia.org/r/655634

Change 655634 merged by Elukey:
[operations/puppet@production] Remove analytics-tool1004 from puppet (decommed node)

https://gerrit.wikimedia.org/r/655634

Change 666486 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Remove unused analytics_cluster::superset

https://gerrit.wikimedia.org/r/666486

Change 666486 merged by Razzi:
[operations/puppet@production] Remove unused analytics_cluster::superset

https://gerrit.wikimedia.org/r/666486

• razzi closed subtask T268784: Configure superset cache as Resolved.Jun 10 2021, 11:20 PM

Ok, this is done, with one last bit of cleanup: I'd like to rename role::analytics_cluster::ui::dashboards to role::analytics_cluster::ui::superset since it's only hosting superset, not turnilo. Currently there are comments like "will eventually host turnilo as well" etc. Then I'll close this.

Ottomata closed this task as Resolved.Jun 21 2021, 3:48 PM

Change 701066 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: rename analytics_cluster::ui::{dashboards,superset}

https://gerrit.wikimedia.org/r/701066

Change 701066 merged by Razzi:

[operations/puppet@production] superset: rename analytics_cluster::ui::{dashboards,superset}

https://gerrit.wikimedia.org/r/701066

Change 701327 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Rename superset's hiera config

https://gerrit.wikimedia.org/r/701327

Change 701327 merged by Elukey:

[labs/private@master] Rename superset's hiera config