Page MenuHomePhabricator

Deploy search platform airflow service
Closed, ResolvedPublic

Description

Airflow is a workflow orchestration system that search platform is trialing as a replacement for oozie. Per discussion with the analytics team we think a Ganeti VM in the analytics network is the right direction forward to deploy this instance.

  • setup/install os to airflow1001
  • provision mariadb (tbd, may reuse an-coord1001 mariadb)
  • provision spark and hadoop client configuration
  • deploy search/airflow and wikimedia/discovery/analytics repositories via scap
  • initialize databases and verify proper operation
  • deploy airflow systemd units
  • add kerberos credentials

Details

Show related patches Customize query in gerrit

Event Timeline

EBernhardson renamed this task from Deploy airflow to analytics cluster to Deploy search platform airflow service.Oct 22 2019, 6:07 PM
EBernhardson updated the task description. (Show Details)
EBernhardson moved this task from needs triage to Ops / SRE on the Discovery-Search board.

About the database - we could use a mariadb on the VM, but my experience with that is that if the db usage is a bit above the norm then the overall performance of the VM will suffer a lot (like it used to happen for Matomo). I'd start using an-coord1001, and then we may move the db out if we see that the usage is too high (I don't expect that but it might happen). We'll also create a user that is able to modify only one database, so there shouldn't be risks of hitting the rest inadvertently. @Ottomata what do you think?

Change 544989 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] airflow: Initial deployment for search platform

https://gerrit.wikimedia.org/r/544989

Change 544990 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] airflow: Run webserver and scheduler processes

https://gerrit.wikimedia.org/r/544990

Change 552304 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Allow analytics-search-users to manage search/airflow instances

https://gerrit.wikimedia.org/r/552304

Change 544989 merged by Elukey:
[operations/puppet@production] airflow: Initial deployment for search platform

https://gerrit.wikimedia.org/r/544989

Change 552872 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] airflow: move hiera config under role and add missing params

https://gerrit.wikimedia.org/r/552872

Change 552872 merged by Elukey:
[operations/puppet@production] airflow: move hiera config under role and add missing params

https://gerrit.wikimedia.org/r/552872

Change 552878 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::search::airflow: fix file resource and add deps

https://gerrit.wikimedia.org/r/552878

Change 552878 merged by Elukey:
[operations/puppet@production] profile::analytics::search::airflow: fix file resource and add deps

https://gerrit.wikimedia.org/r/552878

Change 552882 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::search::airflow: fix directory ensure

https://gerrit.wikimedia.org/r/552882

Change 552882 merged by Elukey:
[operations/puppet@production] profile::analytics::search::airflow: fix directory ensure

https://gerrit.wikimedia.org/r/552882

Change 552304 merged by Dzahn:
[operations/puppet@production] Allow analytics-search-users to manage search/airflow instances

https://gerrit.wikimedia.org/r/552304

@EBernhardson Your user (and dcausse,gehel,bearloga,chelsyx) has been created on an-airflow1001.

[an-airflow1001:~] $ id ebernhardson
uid=3088(ebernhardson) gid=500(wikidev) groups=500(wikidev),816(airflow-search-admins)

[an-airflow1001:~] $ grep airflow /etc/group
airflow:x:1001:
airflow-search-admins:x:816:ebernhardson,dcausse,gehel,bearloga,chelsyx

[an-airflow1001:~] $ sudo cat /etc/sudoers.d/airflow-search-admins 
# This file is managed by Puppet!

%airflow-search-admins ALL = NOPASSWD: /usr/sbin/service airflow-webserver *
%airflow-search-admins ALL = NOPASSWD: /usr/sbin/service airflow-scheduler *
%airflow-search-admins ALL = NOPASSWD: /bin/systemctl start airflow-scheduler
%airflow-search-admins ALL = NOPASSWD: /bin/systemctl restart airflow-scheduler
%airflow-search-admins ALL = NOPASSWD: /bin/systemctl stop airflow-scheduler
%airflow-search-admins ALL = NOPASSWD: /bin/systemctl start airflow-webserver
%airflow-search-admins ALL = NOPASSWD: /bin/systemctl stop airflow-webserver
%airflow-search-admins ALL = NOPASSWD: /bin/systemctl restart airflow-webserver
%airflow-search-admins ALL = (airflow) NOPASSWD: /srv/deployment/search/airflow/venv/bin/airflow *

Change 544990 merged by Dzahn:
[operations/puppet@production] airflow: Run webserver and scheduler processes

https://gerrit.wikimedia.org/r/544990

The systemd units have been created on an-airflow1001 now.

But the services fail to start. In the case of the webserver it is:

PermissionError: [Errno 13] Permission denied: '/etc/airflow/unittests.cfg

/etc/airflow is owned root:airflow with 440.

In the case of the scheduler i just see that it exits with code 1 so far.

Change 553384 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] airflow: move parameters, use lookup, style changes

https://gerrit.wikimedia.org/r/553384

Mentioned in SAL (#wikimedia-operations) [2019-11-27T19:00:33Z] <mutante> an-airflow1001: cd /etc/ ; chown airflow airflow; systemctl start airflow-webserver to let airflow write unittests.cfg (it tries to write this on first start and did not have permissions to do so) T236180

Change 553392 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] airflow: fix ensure => directory

https://gerrit.wikimedia.org/r/553392

Change 553392 merged by Dzahn:
[operations/puppet@production] airflow: fix ensure => directory

https://gerrit.wikimedia.org/r/553392

Change 553397 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] airflow: require_package python3-mysqldb

https://gerrit.wikimedia.org/r/553397

Change 553397 merged by Dzahn:
[operations/puppet@production] airflow: require_package python3-mysqldb

https://gerrit.wikimedia.org/r/553397

After the above deployments things are looking mostly in order, but stuck on a mariadb access error:

sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1698, "Access denied for user 'search_airflow'@'2620:0:861:106:10:64:36:119'")

@elukey When you have a chance, could you double check the grants on an-coord1001?

After the above deployments things are looking mostly in order, but stuck on a mariadb access error:

sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1698, "Access denied for user 'search_airflow'@'2620:0:861:106:10:64:36:119'")

@elukey When you have a chance, could you double check the grants on an-coord1001?

Apologies for the delay, I didn't see the ping :(

I added the grants for the IPv6 IP, was only for IPv4 before. Can you retry and see if it works?

Apologies for the delay, I didn't see the ping :(

I added the grants for the IPv6 IP, was only for IPv4 before. Can you retry and see if it works?

The authentication now completes, but we have further problems to investigate:

Exception: Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql

Airflow has a section in their FAQ on this:

This means explicit_defaults_for_timestamp is disabled in your mysql server and you need to enable it by:

1. Set explicit_defaults_for_timestamp = 1 under the mysqld section in your my.cnf file.
2. Restart the Mysql server.

I'm not sure that's particularly viable here though. It may be that we need to spin up mysql on the an-airflow1001 instance, as I'm not particularly comfortable changing around the defaults for other applications on an-coord1001.

Change 554215 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] airflow: add a local mariadb server

https://gerrit.wikimedia.org/r/554215

explicit_defaults_for_timestamp = 1 will become the standard in future mysql versions.

nevertheless.. here is a change to install a local mariadb server for testing purposes.

From https://mariadb.com/kb/en/library/server-system-variables/#explicit_defaults_for_timestamp:

explicit_defaults_for_timestamp
Description: This option causes CREATE TABLE to create all TIMESTAMP columns as NULL with the DEFAULT NULL attribute, Without this option, TIMESTAMP columns are NOT NULL and have implicit DEFAULT clauses. The old behavior is deprecated.
Commandline: --explicit-defaults-for-timestamp=[={0|1}]
Scope: Global
Dynamic: No
Data Type: bolean
Default Value: OFF
Introduced: MariaDB 10.1.8

And just to confirm, it is off now:

MariaDB [(none)]> show variables like 'explicit_defaults_for_timestamp';
+---------------------------------+-------+
| Variable_name                   | Value |
+---------------------------------+-------+
| explicit_defaults_for_timestamp | OFF   |
+---------------------------------+-------+
1 row in set (0.01 sec)

The only explicit warning that I can see in https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_ig_mysql.html is related to Cloudera Navigator, that we don't really use.

The risk that I can see is that some existing tool that we have relies on implicit defaults could be affected, but IIUC this will be about new tables created so in theory we should be ok. I'd be inclined to test it on Analytics Meta to be more resilient to future upgrades of Mariadb, but at the same time it could be less problematic to just add a mariadb instance on an-airflow1001.

@Ottomata any preference?

Let's do it on Analytics Meta. I think it will be fine.

fwiw, somebody on mysql support channel told me about "explicit_defaults_for_timestamp" that "scope is system and session.You can set it per session".

fwiw, somebody on mysql support channel told me about "explicit_defaults_for_timestamp" that "scope is system and session.You can set it per session".

Hmm, i tried setting it on mariadb 10.1.39, but was rejected:

SET SESSION explicit_defaults_for_timestamp=1;
ERROR 1238 (HY000): Variable 'explicit_defaults_for_timestamp' is a read only variable

Looking at docs, it seems mysql[1] allows it to be dynamic, but mariadb[2] doesn't.

[1] https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_explicit_defaults_for_timestamp
[2] https://mariadb.com/kb/en/library/server-system-variables/#explicit_defaults_for_timestamp

Change 554354 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] analaytics meta db: explicit_defaults_for_timestamp=on

https://gerrit.wikimedia.org/r/554354

Change 553384 merged by Dzahn:
[operations/puppet@production] airflow: move parameters, use lookup, style changes

https://gerrit.wikimedia.org/r/553384

Change 554354 merged by Elukey:
[operations/puppet@production] analytics meta db: explicit_defaults_for_timestamp=on

https://gerrit.wikimedia.org/r/554354

Mentioned in SAL (#wikimedia-analytics) [2019-12-04T11:36:44Z] <elukey> restart mariadb on analytics1030 (hadoop test coordinator) to test explicit_defaults_for_timestamp - T236180

Mentioned in SAL (#wikimedia-analytics) [2019-12-05T09:34:18Z] <elukey> stop oozie/hive-*; restart mariadb; restart oozie/hive-* on an-coord1001 to pick up explicit_defaults_for_timestamp - T236180

MariaDB [(none)]> show variables like 'explicit_defaults_for_timestamp';
+---------------------------------+-------+
| Variable_name                   | Value |
+---------------------------------+-------+
| explicit_defaults_for_timestamp | ON    |
+---------------------------------+-------+
1 row in set (0.00 sec)

@EBernhardson ready to go!

Change 556034 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] java::analytics: Make java 8 the default on buster

https://gerrit.wikimedia.org/r/556034

Change 556034 merged by Ottomata:
[operations/puppet@production] java::analytics: Make java 8 the default on buster

https://gerrit.wikimedia.org/r/556034

Change 556037 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix jdk-8 path for alternatives::select in profile::java::analytics

https://gerrit.wikimedia.org/r/556037

Change 556037 merged by Ottomata:
[operations/puppet@production] Fix jdk-8 path for alternatives::select in profile::java::analytics

https://gerrit.wikimedia.org/r/556037

Change 556719 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::search::airflow: add analytics cluster users

https://gerrit.wikimedia.org/r/556719

Change 556719 merged by Elukey:
[operations/puppet@production] role::search::airflow: add analytics cluster users

https://gerrit.wikimedia.org/r/556719

Change 556722 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::search::airflow: fix the scheduler's syslog id

https://gerrit.wikimedia.org/r/556722

Change 556722 merged by Elukey:
[operations/puppet@production] profile::analytics::search::airflow: fix the scheduler's syslog id

https://gerrit.wikimedia.org/r/556722

Change 556744 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::search::airflow: add kerberos config and keytab

https://gerrit.wikimedia.org/r/556744

Change 556744 merged by Elukey:
[operations/puppet@production] role::search::airflow: add kerberos config and keytab

https://gerrit.wikimedia.org/r/556744

I don't quite have this fully working yet from an-airflow. I suppose it is initialized but not yet fully operating. Will be testing the kerberos integration this week which should be the last step.

Change 558687 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] airflow: Enable kerberos configuration

https://gerrit.wikimedia.org/r/558687

Change 558687 merged by Elukey:
[operations/puppet@production] airflow: Enable kerberos configuration

https://gerrit.wikimedia.org/r/558687

Change 554215 abandoned by Dzahn:
airflow: add a local mariadb server

Reason:
comments above

https://gerrit.wikimedia.org/r/554215

TJones claimed this task.