Review an-coord1001's usage and failover plans
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jul 8 2020, 8:51 AM

Description

This task is a high level placeholder to remember us to check the following (for an-coord1001):

Are 32G of RAM still enough? The current usage seems indicating that the host it too crowded. We have oozie (2G jvm heap), presto (4G jvm heap), hive (server 8G jvm heap, metastore 4G heap), mariadb (4G of innodb buffer usage + some space more for mariadb's internals).

What is our plan if the host goes down and requires extended maintenance? (Say days). Are we prepared to move daemons and mariadb elsewhere? Creating a temp VM in ganeti seems not possible, we'd need something like 32G of ram (that would be a big ask).

Maybe we could think about moving some daemons elsewhere, for example oozie could be placed in an-scheduler1001 (that currently doesn't do anything, waiting for airflow). Another possibility could be to create separate VMs for Hive (an-hive-server/an-hive-metastore/an-presto-coord).

Details

Subject	Repo	Branch	Lines +/-
profile::kerberos::client: Default to use DNS canonicalisation	operations/puppet	production	+28 -3
Fix oozie conf errors found during all-restart	analytics/refinery	master	+4 -4
role::analytics_cluster::coordinator: move Presto to analytics-hive	operations/puppet	production	+1 -1
hive: force TLS from the Metastore to the db-host when needed	operations/puppet	production	+19 -7
hive: set session token handling via database for Hadoop test	operations/puppet	production	+11 -0
kerberos: set dns_canonicalize_hostname = true for krb1001/2001	operations/puppet	production	+21 -1
kerberos: add dns_canonicalize_hostname = false to clients	operations/puppet	production	+1 -0
role::analytics_cluster::coordinator::replica: add meta db instance	operations/puppet	production	+7 -1
Introduce role::analytics_cluster::coordinator::replica	operations/puppet	production	+94 -1
Add analytics-hive.eqiad.wmnet CNAME to an-coord1002	operations/dns	master	+1 -0
Assing role::analytics_cluster::coordinator::query to an-coord1002	operations/puppet	production	+239 -177
role::analytics_cluster::coordinator: increase innodb buffer size	operations/puppet	production	+16 -13
Change analytics-test-hive CNAME to test failover	operations/dns	master	+1 -1
hive: change metastore and server hostnames for Hadoop test	operations/puppet	production	+4 -4
hive: set new kerberos principal for Hadoop test	operations/puppet	production	+3 -3
profile::oozie::client: make config global	operations/puppet	production	+23 -12
profile::hive::client: refactor code to have settings only in one place	operations/puppet	production	+79 -519
Add CNAME analytics-test-hive	operations/dns	master	+1 -0
Clean up Analytics lvm-based backup not used anymore	operations/puppet	production	+0 -113
role::analytics_cluster::coordinator: avoid lvm backups to an-master1002	operations/puppet	production	+0 -4
Remove lvmbackups of Analytics meta from the Hadoop Standby master	operations/puppet	production	+0 -30
Enable TLS between Druid clusters and Mariadb on an-coord1001	operations/puppet	production	+4 -0
role::druid::analytics::worker: enable TLS for conns to mysql	operations/puppet	production	+1 -0
Move oozie server to an-scheduler1001	operations/puppet	production	+139 -51
profile::analytics::search::airflow: use TLS with mysql	operations/puppet	production	+1 -1
superset: enable TLS to connect to Mysql	operations/puppet	production	+2 -2
profile::analytics::database::meta: enable TLS	operations/puppet	production	+7 -0
profile::hive::client: allow to specify jdbc host:port	operations/puppet	production	+4 -0
profile::oozie::server: allow to specific jdbc host:port	operations/puppet	production	+4 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T240437 Analytics Ops Technical Debt
Resolved	elukey	T257412 Review an-coord1001's usage and failover plans
Resolved	elukey	T268028 Move oozie's hive2 actions to analytics-hive.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 613620 merged by Elukey:
[operations/puppet@production] profile::hive::client: allow to specify jdbc host:port

https://gerrit.wikimedia.org/r/613620

Change 612821 merged by Elukey:
[operations/puppet@production] profile::analytics::database::meta: enable TLS

https://gerrit.wikimedia.org/r/612821

Maintenance_bot removed a project: Patch-For-Review.Jul 17 2020, 1:10 PM

I was able to add TLS support to the analytics-meta test instance on analytics1030, and I tested connections with/without TLS, all good. Hive and Oozie seem to work fine. Next step is to restart mariadb on an-coord1001 (will do it on Monday).

I also merged some patches to ease a switch from an-coord1001 to db1108, I think for the moment is a global puppet config is not needed (not straightforward as I thought/hoped), I'll add all the steps to follow in the docs.

The main remaining problem for Druid/Hive/Oozie etc.. is that in order to use TLS, they'd need a truststore to use to validate Puppet-based TLS certificates, and we don't have any.

Change 614758 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] superset: enable TLS to connect to Mysql

https://gerrit.wikimedia.org/r/614758

gerritbot added a project: Patch-For-Review.Jul 20 2020, 2:44 PM

Change 614758 merged by Elukey:
[operations/puppet@production] superset: enable TLS to connect to Mysql

https://gerrit.wikimedia.org/r/614758

Change 614764 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::search::airflow: use TLS with mysql

https://gerrit.wikimedia.org/r/614764

Change 614764 merged by Elukey:
[operations/puppet@production] profile::analytics::search::airflow: use TLS with mysql

https://gerrit.wikimedia.org/r/614764

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2020, 4:10 PM

Added TLS for Airflow and Superset. Also tested TLS for hive-metastore/oozie -> mysq (ref: https://aws.amazon.com/premiumsupport/knowledge-center/ssl-hive-emr-metastore-rds-mysql/) but I keep getting handshake failures in Hadoop test, I am wondering if it is due to the old openssl used by CDH 5.x. This is what I am using (the option is the same for oozie/hive):

<property>
    <name>oozie.service.JPAService.jdbc.url</name>
    <value>jdbc:mysql://localhost:3306/oozie?useSSL=true&amp;serverSslCert=/etc/ssl/certs/Puppet_Internal_CA.pem</value>
    <description>
        JDBC URL.
    </description>
</property>

Tried of course to swap localhost with analytics1030 (the test coordinator's hostname), and also to add up to TLS1.0 compatibility to the mariadb config, same error:

Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:159)
        at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2041)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1145)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1388)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1416)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1400)
        at com.mysql.jdbc.ExportControlled.transformSocketToSSLSocket(ExportControlled.java:160)

Edit: used -Djavax.net.debug=all and it seems indeed that the cipher suites are not compatible between oozie/hive and mariadb.

Edit 2: after a lot of tests, I found a running config useSSL=true&enabledTLSProtocols=TLSv1.2

Added documentation to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta

Getting back to this: memory usage on an-coord1001 seems high, plus for the sake of availability (not having all daemons concentrated in one place), I'd do the following:

move oozie on an-scheduler1002 (without TLS to coord1001 sadly)
move the Presto coordinator to a separate VM only for it

This would leave hive and mariadb on an-coord1001.

Change 618339 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move oozie server to an-scheduler1001

https://gerrit.wikimedia.org/r/618339

gerritbot added a project: Patch-For-Review.Aug 4 2020, 3:59 PM

elukey moved this task from Backlog to Q1 2020/2021 on the Analytics-Clusters board.Aug 5 2020, 9:56 AM

Ottomata moved this task from Q1 2020/2021 to Q1 2021/2022 on the Analytics-Clusters board.Sep 16 2020, 5:07 PM

Change 618339 abandoned by Elukey:
[operations/puppet@production] Move oozie server to an-scheduler1001

Reason:
not needed anymore

https://gerrit.wikimedia.org/r/618339

Change 632896 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: enable TLS for conns to mysql

https://gerrit.wikimedia.org/r/632896

Change 632896 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: enable TLS for conns to mysql

https://gerrit.wikimedia.org/r/632896

Today I had a chat with Joseph and some ideas came up. Rather than creating specialized hosts (like an-query100x, an-scheduler100x, etc..) we could simply rename an-scheduler1002 as an-coord1002, and keeping two "coordinators" in a sort of active/active mode.

For example:

an-coord1001 could run oozie, the mariadb instance (meta database for oozie/hue/etc..), and if needed airflow (or any other tool that we'll choose).
an-coord1002 could run hive server and metastore, presto query coordinator

In case of hw failure of an-coord1001 (say days to recover) we could do something like the following:

failover the databases to db1108 (our replica)
move oozie/airflow/etc.. to an-coord1002

Total time requested would be probably an hour of work, that is not really bad. Something similar could happen if an-coord1002 fails. The only caveat is that we should run "fire drills" once in a while to verify that on one host all services could run without issues. Long term we could/would also move the database out of an-coord1001 to a dedicated host (like db1109), to make the two an-coord nodes completely stateless.

The alternative could be to have different hosts like:

an-query1001 for hive/presto/etc..
an-scheduler1001 for oozie/airflow/etc..

We could in theory do the same service failover described above with the two coordinators anyway, and we'd have more "specialized" nodes (load/traffic/etc.. wise) but the number of nodes to manage would be more (with possibly a little bit more ops).

We never really liked the word "coordinator" for what an-coord does, but it is true that it would be convenient to keep the same naming for consistency (and coordinator is generic enough to have different meaning etc..).

Change 632909 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable TLS between Druid clusters and Mariadb on an-coord1001

https://gerrit.wikimedia.org/r/632909

Change 632909 merged by Elukey:
[operations/puppet@production] Enable TLS between Druid clusters and Mariadb on an-coord1001

https://gerrit.wikimedia.org/r/632909

+1 to the active/active plan

Change 633516 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove lvmbackups of Analytics meta from the Hadoop Standby master

https://gerrit.wikimedia.org/r/633516

Change 633516 merged by Elukey:
[operations/puppet@production] Remove lvmbackups of Analytics meta from the Hadoop Standby master

https://gerrit.wikimedia.org/r/633516

Change 633519 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: avoid lvm backups to an-master1002

https://gerrit.wikimedia.org/r/633519

Change 633519 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: avoid lvm backups to an-master1002

https://gerrit.wikimedia.org/r/633519

Change 633545 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Clean up Analytics lvm-based backup not used anymore

https://gerrit.wikimedia.org/r/633545

Change 633545 merged by Elukey:
[operations/puppet@production] Clean up Analytics lvm-based backup not used anymore

https://gerrit.wikimedia.org/r/633545

Created https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta#Restore_a_Backup

Change 635000 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assing role::analytics_cluster::coordinator::query to an-coord1002

https://gerrit.wikimedia.org/r/635000

There is a major problem in the plan of having two coordinators, namely the fact that we hardcode an-coord1001 all over oozie properties in refinery. This means that a change of location for the hive server 2 implies a roll restart of all oozie coordinators...

After reading https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_ha_hiveserver2.html I had an idea about a possible way forward, that may also be applied to other services like oozie.

We have the following hive configs all over the places:

#refinery
hive/an-coord1001.eqiad.wmnet@WIKIMEDIA
jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default

In puppet we have a variation of the above, but all the time we have to specify 1) where hive runs 2) what is its kerberos service principal. If we want to have an active/standby set of hosts, like an-coord1001/an-coord1002, we could do something like the following:

Created a DNS CNAME for each hive service, like hive-analytics.eqiad.wmnet and hive-test-analytics.eqiad.wmnet
Create a new kerberos principal, hive/hive-analytics.eqiad.wmnet@WIKIMEDIA and add its credentials to both keytabs for hive/an-coord1001 and hive/an-coord1002
Replace all jdbc strings with jdbc:hive2://hive-analytics.eqiad.wmnet:10000/default (or test)
Replace all hive principals config with hive/hive-analytics.eqiad.wmnet@WIKIMEDIA (or test)
Roll restart all the services using it, and restart all the oozie coordinators

At this point we'd have, in theory, a transparent way to failover, since moving hive from an-coord1001 to an-coord1002 for example would only mean changing the DNS CNAME (assuming that DNS TTLs are respected, this needs to be tested).

I like! Could we do the same idea, but instead of using DNS CNAMEs, use LVS instead? Might be easier to do failovers with LVS than waiting for DNS TTLs to expire.

Also, instead of hive-analytics and hive-test-analytics, can we do analytics-hive and analytics-test-hive? Presto is set up to refer to our Hive as analytics_hive, and I'm pretty sure hadoop's HA name s analytics-hadoop.

@Ottomata LVS inside the analytics vlan is problematic :(

+1 for naming, I don't have any particular preference

LVS inside the analytics vlan is problematic :(

Oh right we wanted to do that for druid long ago but couldn't. But why should it be! I don't remember why it didn't work but maybe we can work with traffic and fix it.

In T257412#6571309, @Ottomata wrote:

LVS inside the analytics vlan is problematic :(

Oh right we wanted to do that for druid long ago but couldn't. But why should it be! I don't remember why it didn't work but maybe we can work with traffic and fix it.

We could, but I think that for the first step a DNS CNAME should be good, we want to get to a reliable active/standby status, LVS is more active/active in my mind. We could do DNS first, then LVS, I think setting up the new name is the most annoying part (then if we want to replace the CNAME with a LVS VIP it should be a quick DNS change and few other things). What do you think?

Sure!

Change 635844 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hive::client: refactor code to have settings only in one place

https://gerrit.wikimedia.org/r/635844

Change 635850 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add CNAME analytics-test-hive

https://gerrit.wikimedia.org/r/635850

Change 635850 merged by Elukey:
[operations/dns@master] Add CNAME analytics-test-hive

https://gerrit.wikimedia.org/r/635850

Change 635844 merged by Elukey:
[operations/puppet@production] profile::hive::client: refactor code to have settings only in one place

https://gerrit.wikimedia.org/r/635844

Change 635861 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::oozie::client: make config global

https://gerrit.wikimedia.org/r/635861

Change 635861 merged by Elukey:
[operations/puppet@production] profile::oozie::client: make config global

https://gerrit.wikimedia.org/r/635861

Change 635955 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: set new kerberos principal for Hadoop test

https://gerrit.wikimedia.org/r/635955

Change 635955 merged by Elukey:
[operations/puppet@production] hive: set new kerberos principal for Hadoop test

https://gerrit.wikimedia.org/r/635955

Change 635961 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: change metastore and server hostnames for Hadoop test

https://gerrit.wikimedia.org/r/635961

Change 635961 merged by Elukey:
[operations/puppet@production] hive: change metastore and server hostnames for Hadoop test

https://gerrit.wikimedia.org/r/635961

Summary of actions done:

created a dns CNAME analytics-test-hive.eqiad.wmnet -> an-test-coord1001.eqiad.wmnet
created the kerberos principal hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA on krb1001
executed the following on krb1001:

kadmin.local ktadd -norandkey -k /srv/kerberos/keytabs/an-test-coord1001.eqiad.wmnet/hive/hive.keytab hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA

root@krb1001:/home/elukey# klist -ekt /srv/kerberos/keytabs/an-test-coord1001.eqiad.wmnet/hive/hive.keytab
Keytab name: FILE:/srv/kerberos/keytabs/an-test-coord1001.eqiad.wmnet/hive/hive.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   1 10/09/2020 11:09:28 hive/an-test-coord1001.eqiad.wmnet@WIKIMEDIA (aes256-cts-hmac-sha1-96)
   1 10/12/2020 05:58:27 hive/an-test-coord1001.eqiad.wmnet@WIKIMEDIA (aes256-cts-hmac-sha1-96)
   1 10/23/2020 07:53:31 hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA (aes256-cts-hmac-sha1-96)

Deployed the new keytab on an-test-coord1001
Set hive server and metastore kerberos principals to hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA in the global config (to update all hive-site.xmls)
Restarted hive metastore and server daemons

The first tests are good, everything seems to work fine. I am blocked on T266322 since webrequest_load seems not working, after that I'll probably be able to do more testing. One thing that I want to try is changing the DNS CNAME to something like an-coord1002.eqiad.wmnet (new host without nothing running on it) to see what happens to oozie/hive-cli/etc.. (in theory I expect everything to fail after the DNS TTL expires, in practice there may be surprises).

JAllemandou awarded a token.Oct 26 2020, 8:23 AM

Change 638502 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Change analytics-test-hive CNAME to test failover

https://gerrit.wikimedia.org/r/638502

Change 638502 merged by Elukey:
[operations/dns@master] Change analytics-test-hive CNAME to test failover

https://gerrit.wikimedia.org/r/638502

Tested in labs the CNAME failover to something that is not running any daemon at the moment (an-coord1002) and I got in the expected times failures for connection timeouts (tested webrequest_load, spark and pyspark, hive and beeline). The only weirdness is Hue, but not sure if it is related to Bigtop or to this change (will investigate).

Overall I think that the proof of concept works really well, I think that we can productionize it!

Change 638923 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: increase innodb buffer size after RAM expansion

https://gerrit.wikimedia.org/r/638923

Change 638923 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: increase innodb buffer size

https://gerrit.wikimedia.org/r/638923

Very interesting issue happened with hue and hive while testing the analytics-test-hive.eqiad.wmnet CNAME in Hadoop test.

Hue emits the following error when loading, since it defaults to the hive panel (that in turn tries to establish a session with the Hive server): thrift.transport.TTransport.TTransportException: Bad status: 3 (b'GSS initiate failed') - as always kerberos client errors are very explanatory.

On the Hive server 2 side, more info:

Caused by: sun.security.krb5.KrbCryptoException: Checksum failed
	at sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType.decrypt(Aes256CtsHmacSha1EType.java:102) ~[?:1.8.0_265]
	at sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType.decrypt(Aes256CtsHmacSha1EType.java:94) ~[?:1.8.0_265]

After some digging, it seemed as if Hue was trying to establish a session with Hive using the wrong principal name, ending up in failing to authenticate on the hive server side. The https://web.mit.edu/kerberos/krb5-devel/doc/admin/princ_dns.html doc helped a lot, since the knob to tune in order to make this work is the following: dns_canonicalize_hostname = false in /etc/krb5.conf on the Hue host.

Since Hue uses pyhive, that in turn should use C-based sasl library to authenticate, I tried to repro on another node (an-test-client1001) with the following snippet:

from pyhive import hive
engine = hive.Connection(host="analytics-test-hive.eqiad.wmnet", port=10000, username="hive/analytics-test-hive.eqiad.wmnet@WIKIMEDIA", database='wmf', auth='KERBEROS', kerberos_service_name="hive")

And it behaves the same, namely without dns_canonicalize_hostname = false it doesn't work. All the other tests that I did (hive/beeline/spark/oozie) were on java-based tools, that I think do not care a lot about dns_canonicalize_hostname (they probably implement their own logic).

Another interesting thing is that on an-test-ui1001 (Hue) and an-test-client1001 (generic client) buster is installed, and the krb version is 1.17 (meanwhile on stretch it is 1.15).

Kerberos 1.19 changes the defaults as the doc outlines:

If dns_canonicalize_hostname is set to fallback (the default value in release 1.19 and later), the hostname is initially treated according to the rules for dns_canonicalize_hostname=false. If a ticket request fails because the service principal is unknown, it the hostname will be canonicalized according to the rules for dns_canonicalize_hostname=true and the request will be retried.

The fallback value seems not accepted on 1.17 (hue complains if set), so I am wondering if we could just set dns_canonicalize_hostname = false on all nodes? It shouldn't cause problems in theory. @MoritzMuehlenhoff thoughts? If there is any concern we can just turn it off on analytics client nodes..

Awesome analysis :)

Change 640100 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: add dns_canonicalize_hostname = false to clients

https://gerrit.wikimedia.org/r/640100

Change 635000 abandoned by Elukey:
[operations/puppet@production] Assing role::analytics_cluster::coordinator::query to an-coord1002

Reason:

https://gerrit.wikimedia.org/r/635000

Change 640364 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce role::analytics_cluster::coordinator::replica

https://gerrit.wikimedia.org/r/640364

Change 640365 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add analytics-hive.eqiad.wmnet CNAME to an-coord1002

https://gerrit.wikimedia.org/r/640365

Change 640365 merged by Elukey:
[operations/dns@master] Add analytics-hive.eqiad.wmnet CNAME to an-coord1002

https://gerrit.wikimedia.org/r/640365

Change 640364 merged by Elukey:
[operations/puppet@production] Introduce role::analytics_cluster::coordinator::replica

https://gerrit.wikimedia.org/r/640364

I added another hive-server2 to an-coord1002, for the high availability metastore I'd wait until after bigtop to have a more up to date version of hive to work on. Me and Joseph will test oozie jobs running with the analytics-hive.eqiad.wmnet target.

The high level plan, before bigtop, is to:

move all oozie jobs' hive2 action settings to analytics-hive.eqiad.wmnet.
move all clients to use analytics-hive.eqiad.wmnet as target (jdbc for beeline, etc..)
add a presto query coordinator to an-coord1002
set up oozie's clients to use analytics-oozie.eqiad.wmnet (if possible)
add a analytics-db.eqiad.wmnet CNAME and move clients to it.

After bigtop:

add another metastore to an-coord1002 and set a high availability setup with an-coord1001

In this way, if an-coord1001 goes down, the failover should be something like:

recreate the db from db1108 (backups + binary logs) on an-coord1002 and point clients to it (changing cname only).
move hive/oozie cnames to an-coord1002 (no oozie job restart, etc..)
restart daemons etc.. if needed (for example, druid's overlord might require some encouragement to use then new CNAME endpoint etc..)
set replication on db1108 to pull from an-coord1002, instead of 1001.

I find the above a little bit better than the previous plan, especially compared to failing over the meta db directly to db1108, since it should be quicker and more efficient to failover/failback.

Change 641381 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator::replica: add meta db instance

https://gerrit.wikimedia.org/r/641381

Change 641381 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator::replica: add meta db instance

https://gerrit.wikimedia.org/r/641381

Change 640100 merged by Elukey:
[operations/puppet@production] kerberos: add dns_canonicalize_hostname = false to clients

https://gerrit.wikimedia.org/r/640100

Change 641737 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] kerberos: set dns_canonicalize_hostname = true for krb1001/2001

https://gerrit.wikimedia.org/r/641737

Change 641737 merged by Elukey:
[operations/puppet@production] kerberos: set dns_canonicalize_hostname = true for krb1001/2001

https://gerrit.wikimedia.org/r/641737

While we transition oozie jobs to analytics-hive.eqiad.wmnet, here's some info about other daemons:

So the hive metastore is different from the server, it requires all the servers (in our case an-coord1001 and an-coord1002) to be listed in hive-site.xml, and clients will pick up one. The sessions are managed in the database, not in memory, so multiple instances can work together and clients can load balance. We tried the session on DB settings a while ago, but then we reverted since it was a problem with bigtop: https://gerrit.wikimedia.org/r/c/operations/puppet/+/576099/. I'd like to re-test in Hadoop test to see how it goes.

For Oozie, zookeeper is used for multiple instances, but it is not a priority for the moment.

Change 643222 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: set session token handling via database for Hadoop test

https://gerrit.wikimedia.org/r/643222

Change 643222 merged by Elukey:
[operations/puppet@production] hive: set session token handling via database for Hadoop test

https://gerrit.wikimedia.org/r/643222

It sounds like https://issues.apache.org/jira/browse/HIVE-17368 is the issue that we are facing, and the fix is:

https://github.com/apache/hive/commit/623ecaa53bc7d559a4181383ecea25ce0868cbdc

Maybe I can try to recompile the hive package and see?

The correct patch for hive 2.x should be https://github.com/apache/hive/commit/b3a6e524a6cb17893cbbeb26a877b0196ba16e21.diff

It seems to work! Created https://issues.apache.org/jira/browse/BIGTOP-3455 to ask if the patch can be included for Bigtop 1.5.

Ottomata added a project: Analytics-Kanban.Nov 30 2020, 4:58 PM

Change 647215 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: force TLS from the Metastore to the db-host when needed

https://gerrit.wikimedia.org/r/647215

Change 647215 merged by Elukey:
[operations/puppet@production] hive: force TLS from the Metastore to the db-host when needed

https://gerrit.wikimedia.org/r/647215

Ottomata assigned this task to elukey.Dec 14 2020, 4:21 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Dec 14 2020, 4:23 PM

Change 650034 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Fix oozie conf errors found during all-restart

https://gerrit.wikimedia.org/r/650034

Change 651458 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_clustr::coordinator: move Presto to analytics-hive

https://gerrit.wikimedia.org/r/651458

Change 651458 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: move Presto to analytics-hive

https://gerrit.wikimedia.org/r/651458

Change 650034 merged by Mforns:
[analytics/refinery@master] Fix oozie conf errors found during all-restart

https://gerrit.wikimedia.org/r/650034

Ottomata moved this task from Q1 2021/2022 to Done on the Analytics-Clusters board.Jan 4 2021, 4:41 PM

• fdans closed subtask T268028: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet as Resolved.Jan 25 2021, 7:00 PM

In T257412#6605740, @elukey wrote:

The fallback value seems not accepted on 1.17 (hue complains if set), so I am wondering if we could just set dns_canonicalize_hostname = false on all nodes?

After quite a bit of debugging why kerberized SSH logins didn't work, I realised, that they broke (compared to my original tests from some months ago) by the new default setting for "dns_canonicalize_hostname",

With dns_canonicalize_hostname = false, the GSSAPI support in OpenSSH is unable to find the host principal:

debug2: input_userauth_request: try method gssapi-with-mic [preauth]
debug1: Unspecified GSS failure.  Minor code may provide more information
No key table entry found matching host/sretest1001@

I'll narrow this down further.

Change 671130 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] profile::kerberos::client: Default to use DNS canonicalisation

https://gerrit.wikimedia.org/r/671130

Change 671130 merged by Muehlenhoff:
[operations/puppet@production] profile::kerberos::client: Default to use DNS canonicalisation

https://gerrit.wikimedia.org/r/671130

elukey mentioned this in T278353: Review the usage of dns_canonicalize=false for Kerberos.Mar 24 2021, 4:51 PM