Page MenuHomePhabricator

Move oozie's hive2 actions to analytics-hive.eqiad.wmnet
Closed, ResolvedPublic8 Estimated Story Points

Description

As described in the parent task, we'd like to move all oozie bundle/coordinator settings to:

hive_principal='hive/analytics-hive.eqiad.wmnet@WIKIMEDIA'
hive2_jdbc_url='jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default'

The CNAME currently points to an-coord1002, where a hive-server2 is running. In this way regular clients with "old" credentials will not be disrupted. This will allow a slow and careful migration while we transition to the new creds (to see if anything pops up).

Event Timeline

Change 641440 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Migrate browser general job to use cname credentials

https://gerrit.wikimedia.org/r/641440

Change 641440 merged by Mforns:
[analytics/refinery@master] Migrate browser general job to use cname credentials

https://gerrit.wikimedia.org/r/641440

elukey triaged this task as High priority.Nov 23 2020, 10:31 AM
elukey added a project: Analytics.
elukey moved this task from Incoming to Ops Week on the Analytics board.

Change 643013 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] Move webrequest_laod to analytics-hive hive2 credentials

https://gerrit.wikimedia.org/r/643013

Change 643013 merged by Elukey:
[analytics/refinery@master] Move webrequest_load to analytics-hive hive2 credentials

https://gerrit.wikimedia.org/r/643013

Change 643762 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] oozie: move all hive2 actions settings to analytics-hive.eqiad.wmnet

https://gerrit.wikimedia.org/r/643762

Change 643823 had a related patch set uploaded (by Elukey; owner: Elukey):
[wikimedia/discovery/analytics@master] oozie: replace occurrences of an-coord1001 with analytics-hive

https://gerrit.wikimedia.org/r/643823

Change 643762 abandoned by Elukey:
[analytics/refinery@master] oozie: move all hive2 actions settings to analytics-hive.eqiad.wmnet

Reason:
Needs to be split into multiple steps

https://gerrit.wikimedia.org/r/643762

While reviewing my change, I realized that there is something more problematic for the Hive Metastore, that we should probably resolve sooner rather than later. Long post sorry in advance :D

From various guides that I found, there seem to be the following ways to make Hive HA:

  • Hive server 2 - use the same kerberos principal for two or more hive servers, and configure clients to point to one of them. This is what we do with analtyics-hive.eqiad.wmnet and its related kerberos principal, hive/analtyics-hive.eqiad.wmnet@WIKIMEDIA.
  • Hive metastore - store the session tokens on the DB with hive.cluster.delegation.token.store.class set to DBTokenStore, and then configure clients to have multiple thrift connection options, like thrift://an-coord1001.eqiad.wmnet:9083,thrift://an-coord1002.eqiad.wmnet:9083. I assume that the kerberos principal is always a hostname-based one (so hive/an-coord1001.eqiad.wmnet@WIKIMEDIA, ..).

The problem for the metastore is that in all clients we need to specify the principal to use, for example in oozie for Spark actions:

hive_principal           = hive/an-coord1001.eqiad.wmnet@WIKIMEDIA
hive_metastore_uri  = thrift://an-coord1001.eqiad.wmnet:9083

This could become something like:

hive_principal            = hive/_HOST@WIKIMEDIA
hive_metastore_uris  = thrift://an-coord1001.eqiad.wmnet:9083,thrift://an-coord1002.eqiad.wmnet:9083

But it relies on multiple assumptions related to clients (not only oozie):

  • a client supports multiple metastore uris (namely doing some sort of try/failover)
  • hive/_HOST@WIKIMEDIA is supported

I would apply the same mechanism as for the hive server, namely using analytics-hive.eqiad.wmnet for both principals and thrift URI. It should work fine, but I didn't find any trace of this in any guide/tutorial, so I am wondering if I am missing something. In theory it shouldn't be a problem, but..

I think I found a good compromise for our use case. We could have a config for each of the following use case:

  • on clients, hive-site.xml and other configs (oozie properties, spark, etc..) should all point to thrift://analytics-hive.eqiad.wmnet:9083. In this way we control what instance is active, and there is a single and simply config to give to people.
  • on coordinators, hive-site.xml should use the multi-thrift-URI scheme, like thrift://an-coord1001.eqiad.wmnet:9083,thrift://an-coord1002.eqiad.wmnet:9083. The idea that I have is that the hive server knows how to use multiple metastores, so having both listed is better for availability (for example, say that one metatore gets overloaded temporarily yielding to timeouts).

Change 644239 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hive::client: make DBTokenStore the default

https://gerrit.wikimedia.org/r/644239

Change 644239 merged by Elukey:
[operations/puppet@production] profile::hive::client: make DBTokenStore the default

https://gerrit.wikimedia.org/r/644239

Change 643823 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] oozie: replace occurrences of an-coord1001 with analytics-hive

https://gerrit.wikimedia.org/r/643823

Change 644353 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] analytics: Replace an-coord1001 with analytics-hive

https://gerrit.wikimedia.org/r/644353

Moved from ops week to ops excellence :)

Change 644353 abandoned by Razzi:
[operations/puppet@production] analytics: Replace an-coord1001 with analytics-hive

Reason:
Should have made these changes to refinery

https://gerrit.wikimedia.org/r/644353

Change 644541 had a related patch set uploaded (by Razzi; owner: Razzi):
[analytics/refinery@master] Switch an-coord1001 to analytics-hive

https://gerrit.wikimedia.org/r/644541

Change 644541 abandoned by Razzi:
[analytics/refinery@master] Switch an-coord1001 to analytics-hive

Reason:

https://gerrit.wikimedia.org/r/644541

Change 645039 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the hive metastore to analytics-test-hive in Hadoop test

https://gerrit.wikimedia.org/r/645039

Change 645039 merged by Elukey:
[operations/puppet@production] Move the hive metastore to analytics-test-hive in Hadoop test

https://gerrit.wikimedia.org/r/645039

Change 645057 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: force the hive server to use the local metastore

https://gerrit.wikimedia.org/r/645057

Change 645057 merged by Elukey:
[operations/puppet@production] hive: force the hive server to use the local metastore

https://gerrit.wikimedia.org/r/645057

Change 647273 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add a second Hive Metastore on an-coord1002

https://gerrit.wikimedia.org/r/647273

Change 647273 merged by Elukey:
[operations/puppet@production] Add a second Hive Metastore on an-coord1002

https://gerrit.wikimedia.org/r/647273

Change 647612 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] oozie: Replace all references of an-coord1001 with analytics-hive

https://gerrit.wikimedia.org/r/647612

Change 647612 merged by Joal:
[analytics/refinery@master] oozie: Replace all references of an-coord1001 with analytics-hive

https://gerrit.wikimedia.org/r/647612

Change 650077 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: remove analytics-replicated-hive config

https://gerrit.wikimedia.org/r/650077

Change 650077 merged by Elukey:
[operations/puppet@production] hive: remove analytics-replicated-hive config

https://gerrit.wikimedia.org/r/650077

Change 651456 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651456

Change 651456 merged by Elukey:
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651456

The first failover from an-coord1002 to an-coord1001 happened via DNS CNAME change, no issue raised \o/

This task is done finally!

elukey moved this task from Next Up to Done on the Analytics-Kanban board.
elukey set the point value for this task to 8.

Change 651786 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651786

Change 651786 merged by Ottomata:
[operations/dns@master] Failover analytics-hive.eqiad.wmnet to an-coord1001

https://gerrit.wikimedia.org/r/651786

Mentioned in SAL (#wikimedia-analytics) [2020-12-23T15:53:00Z] <ottomata> point analytics-hive.eqiad.wmnet back at an-coord1001 - T268028 T270768

Nice, I just did my first failover too (due to T270768). Back at an-coord1001 now. :)