Page MenuHomePhabricator

Review an-coord1001's usage and failover plans
Open, Needs TriagePublic

Description

This task is a high level placeholder to remember us to check the following (for an-coord1001):

  • Are 32G of RAM still enough? The current usage seems indicating that the host it too crowded. We have oozie (2G jvm heap), presto (4G jvm heap), hive (server 8G jvm heap, metastore 4G heap), mariadb (4G of innodb buffer usage + some space more for mariadb's internals).
  • What is our plan if the host goes down and requires extended maintenance? (Say days). Are we prepared to move daemons and mariadb elsewhere? Creating a temp VM in ganeti seems not possible, we'd need something like 32G of ram (that would be a big ask).

Maybe we could think about moving some daemons elsewhere, for example oozie could be placed in an-scheduler1001 (that currently doesn't do anything, waiting for airflow). Another possibility could be to create separate VMs for Hive (an-hive-server/an-hive-metastore/an-presto-coord).

Details

Show related patches Customize query in gerrit

Event Timeline

elukey created this task.Jul 8 2020, 8:51 AM

I put some thinking into the host failure use case (hw failure, an-coord1001 down for days), and about the cost of finding a workaround until a replacement arrives. This is a high level list of things:

  • Multiple services would be impaired all at once, namely Druid clusters (degradation only probably, Druid uses the db for some things), Superset (down completely), Airflow (Search instance, but possibly even ours if we'll have one in the future), Hive, Presto, Oozie.
  • We'd probably try to move daemons around, like oozie/hive/presto on an-launcher1002 (puppet changes, etc..) and we'd need to find a host capable of running a DB (a ganeti VM might be enough, the meta db is not that big, so adding the cost of creating a VM and configuring it properly with mariadb etc..)
  • We'd need to pull the most recent backup, load it on the new db host/vm, and point all the services to it.
  • We'd also need to completely stop most of our jobs, since without hive/oozie we can't do much.
  • People using the cluster would be severely impaired since working on hadoop without hive/presto/etc.. is not super easy.

The optimistic time to do all the above is 3/4 hours of SRE work, but it could probably end up in 6/8 easily if we encounter issues (or even more). Then there is the time to rollback to the previous config once the hw replacement arrives.

There could be some incremental optimizations to do that would alleviate the scenario above:

  1. Move oozie to an-launcher1002, and add TLS configs to connect to an-coord1001's mariadb.
  2. Move all the db clients (superset/airflow/druid/hive/etc..) to a dbproxy DNS/VIP (set up in precedence, that basically decides what is the master to sent the writes to between an-coord1001 and db1108). This needs a chat with Data Persistence, I have already scheduled a meeting with Manuel.
  3. Think about moving hive/presto daemons to separate VMs, using TLS to connect to dbproxy/an-coord1001

If we do all 3, then a failure of an-coord1001 would simply end up in dbproxy routing requests to db1108, without requiring any effort from Analytics to restore the service (since all db clients would probably experience a brief unavailability). If we want to avoid under-usage of an-coord1001, we could either move the matomo's db to it (and increase the innodb buffer memory used etc..) or leave hive/presto on an-coord1001, and create vms only if needed (an-coord1001 down etc..).

@Ottomata: thoughts? Too pessimistic? :D

All for it! Ultimately everything should be as isolated as possible, and the DB on an-coord1001 is a big SPOF. I'd prioritize 2., trying to get the db replicated and in some kind of hot failover mode.

Although, I guess if we can move all the daemons to other hosts and/or VMs, then we can rename and repurpose both an-coord1001 and db1108 to I dunno, an-db100[12] :p

I had a chat with Manuel this morning, and one possibility would be the following:

  • Add dbproxy hosts (very ligthweight) that run basically haproxy (usually there are two for HA). The puppetization is already handled by Data Persistence, but we'd need to maintain the hosts. A couple of Ganeti instances should probably suffice.
  • Set dbproxy hosts with a master/standby configuration. In our case, an-coord1001:3306 and db1108:3322. dbproxys would listen on port 3306 and proxy (at TCP level) connections to the right host, depending on some rules.
  • The standard settings for dbproxy is to have a rule to failover for the master, but avoid to failback automatically. This allows any db operator to decide what to do to recover, since if a failover happens (and the standby is read/write enabled) then i means split brain with the master's status, and hence a master rebuild is necessary before doing the rollback.

The other solution could be to avoid dbproxies and keep db1108 as read-only replica, relying on a manual failover procedure if problems occurs. For example, if an-coord1001 goes down badly, we could:

  1. Stop replication from an-cood1001 on db1108.
  2. Set the meta-db replica on db1108 as read-write.
  3. Instruct all the clients to move their config to db1108:3302.

Puppet could be set up to make the above 3 steps really easy, and we could pre-set TLS for all clients to leverage encryption as well (so localhost-only conns wouldn't be a concern if moved to a remote host).

I like more the manual failover solution, seems a first good step towards better handling SPOF. If we see that it doesn't work we could always think to add dbproxies.

I like more the manual failover solution, seems a first good step towards better handling SPOF

I agree with that - incremental steps looks good :).

if we can move all the daemons to other hosts and/or VMs, then we can rename and repurpose both an-coord1001 and db1108 to I dunno, an-db100[12] :p

I like that idea as well: decoupling DBs from daemons as the number of daemons using the DBs grow feels good (and therefore make the nodes explicitely DBs :) .

About daemons relocation, we could separate between functional roles:

  • scheduling (oozie, airflow - on an-scheduler1001) - Can be small IMO as long as airflow jobs are tiny (oozie uses yarn for scaling)/
  • querying (hive-metastore, hive-server, prester-coord - on an-query-coord (making a new name now, TBD obviously)) - Should be large-ish as managing many queries for many data over many machines from many users (including prod) should happen without too many problems (I'm a many-ac) ;)

Change 612821 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::database::meta: enable TLS

https://gerrit.wikimedia.org/r/612821

The high level list of things to do are:

  • Add TLS support to an-coord1001 and db1108
  • Think about a global setting to gather coordinates for the oozie/hive databases (we have them in various profiles now)
  • Add TLS settings to Oozie and possibly move it to an-scheduler1002
  • Add TLS to Superset's config
  • Add TLS to Airflow's config
  • Write some documentation on Wikitech about how to failover from an-coord1001 to db1108 if needed

Change 613619 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::oozie::server: allow to specific jdbc host:port

https://gerrit.wikimedia.org/r/613619

Change 613620 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hive::client: allow to specify jdbc host:port

https://gerrit.wikimedia.org/r/613620

Change 613619 merged by Elukey:
[operations/puppet@production] profile::oozie::server: allow to specific jdbc host:port

https://gerrit.wikimedia.org/r/613619

Change 613620 merged by Elukey:
[operations/puppet@production] profile::hive::client: allow to specify jdbc host:port

https://gerrit.wikimedia.org/r/613620

Change 612821 merged by Elukey:
[operations/puppet@production] profile::analytics::database::meta: enable TLS

https://gerrit.wikimedia.org/r/612821

I was able to add TLS support to the analytics-meta test instance on analytics1030, and I tested connections with/without TLS, all good. Hive and Oozie seem to work fine. Next step is to restart mariadb on an-coord1001 (will do it on Monday).

I also merged some patches to ease a switch from an-coord1001 to db1108, I think for the moment is a global puppet config is not needed (not straightforward as I thought/hoped), I'll add all the steps to follow in the docs.

The main remaining problem for Druid/Hive/Oozie etc.. is that in order to use TLS, they'd need a truststore to use to validate Puppet-based TLS certificates, and we don't have any.

Change 614758 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] superset: enable TLS to connect to Mysql

https://gerrit.wikimedia.org/r/614758

Change 614758 merged by Elukey:
[operations/puppet@production] superset: enable TLS to connect to Mysql

https://gerrit.wikimedia.org/r/614758

Change 614764 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::search::airflow: use TLS with mysql

https://gerrit.wikimedia.org/r/614764

Change 614764 merged by Elukey:
[operations/puppet@production] profile::analytics::search::airflow: use TLS with mysql

https://gerrit.wikimedia.org/r/614764

elukey added a comment.EditedJul 21 2020, 7:35 AM

Added TLS for Airflow and Superset. Also tested TLS for hive-metastore/oozie -> mysq (ref: https://aws.amazon.com/premiumsupport/knowledge-center/ssl-hive-emr-metastore-rds-mysql/) but I keep getting handshake failures in Hadoop test, I am wondering if it is due to the old openssl used by CDH 5.x. This is what I am using (the option is the same for oozie/hive):

<property>
    <name>oozie.service.JPAService.jdbc.url</name>
    <value>jdbc:mysql://localhost:3306/oozie?useSSL=true&amp;serverSslCert=/etc/ssl/certs/Puppet_Internal_CA.pem</value>
    <description>
        JDBC URL.
    </description>
</property>

Tried of course to swap localhost with analytics1030 (the test coordinator's hostname), and also to add up to TLS1.0 compatibility to the mariadb config, same error:

Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:159)
        at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2041)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1145)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1388)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1416)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1400)
        at com.mysql.jdbc.ExportControlled.transformSocketToSSLSocket(ExportControlled.java:160)

Edit: used -Djavax.net.debug=all and it seems indeed that the cipher suites are not compatible between oozie/hive and mariadb.

Edit 2: after a lot of tests, I found a running config useSSL=true&amp;enabledTLSProtocols=TLSv1.2

elukey added a comment.Aug 4 2020, 2:33 PM

Getting back to this: memory usage on an-coord1001 seems high, plus for the sake of availability (not having all daemons concentrated in one place), I'd do the following:

  • move oozie on an-scheduler1002 (without TLS to coord1001 sadly)
  • move the Presto coordinator to a separate VM only for it

This would leave hive and mariadb on an-coord1001.

Change 618339 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move oozie server to an-scheduler1001

https://gerrit.wikimedia.org/r/618339

Change 618339 abandoned by Elukey:
[operations/puppet@production] Move oozie server to an-scheduler1001

Reason:
not needed anymore

https://gerrit.wikimedia.org/r/618339

Change 632896 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: enable TLS for conns to mysql

https://gerrit.wikimedia.org/r/632896

Change 632896 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: enable TLS for conns to mysql

https://gerrit.wikimedia.org/r/632896

elukey added a comment.EditedThu, Oct 8, 9:58 AM

Today I had a chat with Joseph and some ideas came up. Rather than creating specialized hosts (like an-query100x, an-scheduler100x, etc..) we could simply rename an-scheduler1002 as an-coord1002, and keeping two "coordinators" in a sort of active/active mode.

For example:

  • an-coord1001 could run oozie, the mariadb instance (meta database for oozie/hue/etc..), and if needed airflow (or any other tool that we'll choose).
  • an-coord1002 could run hive server and metastore, presto query coordinator

In case of hw failure of an-coord1001 (say days to recover) we could do something like the following:

  • failover the databases to db1108 (our replica)
  • move oozie/airflow/etc.. to an-coord1002

Total time requested would be probably an hour of work, that is not really bad. Something similar could happen if an-coord1002 fails. The only caveat is that we should run "fire drills" once in a while to verify that on one host all services could run without issues. Long term we could/would also move the database out of an-coord1001 to a dedicated host (like db1109), to make the two an-coord nodes completely stateless.

The alternative could be to have different hosts like:

  • an-query1001 for hive/presto/etc..
  • an-scheduler1001 for oozie/airflow/etc..

We could in theory do the same service failover described above with the two coordinators anyway, and we'd have more "specialized" nodes (load/traffic/etc.. wise) but the number of nodes to manage would be more (with possibly a little bit more ops).

We never really liked the word "coordinator" for what an-coord does, but it is true that it would be convenient to keep the same naming for consistency (and coordinator is generic enough to have different meaning etc..).

Change 632909 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable TLS between Druid clusters and Mariadb on an-coord1001

https://gerrit.wikimedia.org/r/632909

Change 632909 merged by Elukey:
[operations/puppet@production] Enable TLS between Druid clusters and Mariadb on an-coord1001

https://gerrit.wikimedia.org/r/632909

Nuria added a comment.Thu, Oct 8, 9:54 PM

+1 to the active/active plan

Change 633516 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove lvmbackups of Analytics meta from the Hadoop Standby master

https://gerrit.wikimedia.org/r/633516

Change 633516 merged by Elukey:
[operations/puppet@production] Remove lvmbackups of Analytics meta from the Hadoop Standby master

https://gerrit.wikimedia.org/r/633516

Change 633519 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: avoid lvm backups to an-master1002

https://gerrit.wikimedia.org/r/633519

Change 633519 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: avoid lvm backups to an-master1002

https://gerrit.wikimedia.org/r/633519

Change 633545 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Clean up Analytics lvm-based backup not used anymore

https://gerrit.wikimedia.org/r/633545

Change 633545 merged by Elukey:
[operations/puppet@production] Clean up Analytics lvm-based backup not used anymore

https://gerrit.wikimedia.org/r/633545

Change 635000 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assing role::analytics_cluster::coordinator::query to an-coord1002

https://gerrit.wikimedia.org/r/635000

There is a major problem in the plan of having two coordinators, namely the fact that we hardcode an-coord1001 all over oozie properties in refinery. This means that a change of location for the hive server 2 implies a roll restart of all oozie coordinators...

After reading https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_ha_hiveserver2.html I had an idea about a possible way forward, that may also be applied to other services like oozie.

We have the following hive configs all over the places:

#refinery
hive/an-coord1001.eqiad.wmnet@WIKIMEDIA
jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default

In puppet we have a variation of the above, but all the time we have to specify 1) where hive runs 2) what is its kerberos service principal. If we want to have an active/standby set of hosts, like an-coord1001/an-coord1002, we could do something like the following:

  1. Created a DNS CNAME for each hive service, like hive-analytics.eqiad.wmnet and hive-test-analytics.eqiad.wmnet
  2. Create a new kerberos principal, hive/hive-analytics.eqiad.wmnet@WIKIMEDIA and add its credentials to both keytabs for hive/an-coord1001 and hive/an-coord1002
  3. Replace all jdbc strings with jdbc:hive2://hive-analytics.eqiad.wmnet:10000/default (or test)
  4. Replace all hive principals config with hive/hive-analytics.eqiad.wmnet@WIKIMEDIA (or test)
  5. Roll restart all the services using it, and restart all the oozie coordinators

At this point we'd have, in theory, a transparent way to failover, since moving hive from an-coord1001 to an-coord1002 for example would only mean changing the DNS CNAME (assuming that DNS TTLs are respected, this needs to be tested).

I like! Could we do the same idea, but instead of using DNS CNAMEs, use LVS instead? Might be easier to do failovers with LVS than waiting for DNS TTLs to expire.

Also, instead of hive-analytics and hive-test-analytics, can we do analytics-hive and analytics-test-hive? Presto is set up to refer to our Hive as analytics_hive, and I'm pretty sure hadoop's HA name s analytics-hadoop.

@Ottomata LVS inside the analytics vlan is problematic :(

+1 for naming, I don't have any particular preference

LVS inside the analytics vlan is problematic :(

Oh right we wanted to do that for druid long ago but couldn't. But why should it be! I don't remember why it didn't work but maybe we can work with traffic and fix it.

LVS inside the analytics vlan is problematic :(

Oh right we wanted to do that for druid long ago but couldn't. But why should it be! I don't remember why it didn't work but maybe we can work with traffic and fix it.

We could, but I think that for the first step a DNS CNAME should be good, we want to get to a reliable active/standby status, LVS is more active/active in my mind. We could do DNS first, then LVS, I think setting up the new name is the most annoying part (then if we want to replace the CNAME with a LVS VIP it should be a quick DNS change and few other things). What do you think?

Change 635844 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hive::client: refactor code to have settings only in one place

https://gerrit.wikimedia.org/r/635844

Change 635850 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add CNAME analytics-test-hive

https://gerrit.wikimedia.org/r/635850

Change 635850 merged by Elukey:
[operations/dns@master] Add CNAME analytics-test-hive

https://gerrit.wikimedia.org/r/635850

Change 635844 merged by Elukey:
[operations/puppet@production] profile::hive::client: refactor code to have settings only in one place

https://gerrit.wikimedia.org/r/635844

Change 635861 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::oozie::client: make config global

https://gerrit.wikimedia.org/r/635861

Change 635861 merged by Elukey:
[operations/puppet@production] profile::oozie::client: make config global

https://gerrit.wikimedia.org/r/635861

Change 635955 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: set new kerberos principal for Hadoop test

https://gerrit.wikimedia.org/r/635955

Change 635955 merged by Elukey:
[operations/puppet@production] hive: set new kerberos principal for Hadoop test

https://gerrit.wikimedia.org/r/635955

Change 635961 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: change metastore and server hostnames for Hadoop test

https://gerrit.wikimedia.org/r/635961

Change 635961 merged by Elukey:
[operations/puppet@production] hive: change metastore and server hostnames for Hadoop test

https://gerrit.wikimedia.org/r/635961

Summary of actions done:

  • created a dns CNAME analytics-test-hive.eqiad.wmnet -> an-test-coord1001.eqiad.wmnet
  • created the kerberos principal hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA on krb1001
  • executed the following on krb1001:
kadmin.local ktadd -norandkey -k /srv/kerberos/keytabs/an-test-coord1001.eqiad.wmnet/hive/hive.keytab hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA

root@krb1001:/home/elukey# klist -ekt /srv/kerberos/keytabs/an-test-coord1001.eqiad.wmnet/hive/hive.keytab
Keytab name: FILE:/srv/kerberos/keytabs/an-test-coord1001.eqiad.wmnet/hive/hive.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   1 10/09/2020 11:09:28 hive/an-test-coord1001.eqiad.wmnet@WIKIMEDIA (aes256-cts-hmac-sha1-96)
   1 10/12/2020 05:58:27 hive/an-test-coord1001.eqiad.wmnet@WIKIMEDIA (aes256-cts-hmac-sha1-96)
   1 10/23/2020 07:53:31 hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA (aes256-cts-hmac-sha1-96)
  • Deployed the new keytab on an-test-coord1001
  • Set hive server and metastore kerberos principals to hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA in the global config (to update all hive-site.xmls)
  • Restarted hive metastore and server daemons

The first tests are good, everything seems to work fine. I am blocked on T266322 since webrequest_load seems not working, after that I'll probably be able to do more testing. One thing that I want to try is changing the DNS CNAME to something like an-coord1002.eqiad.wmnet (new host without nothing running on it) to see what happens to oozie/hive-cli/etc.. (in theory I expect everything to fail after the DNS TTL expires, in practice there may be surprises).