Page MenuHomePhabricator

Deploy a PostgreSQL service for Airflow to use
Closed, ResolvedPublic3 Estimated Story Points

Description

We have discovered an incompatibility between our next version of Airflow and MariaDB

https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html#choosing-database-backend

image.png (151×1 px, 49 KB)

Whilst Airflow has been working with MariaDB up until now, the upgrade of the database to the new version failed.

Therefore, we need to choose a different back-end from the list of supported databases.
The only options available to us from a licensing perspective are:

  • MySQL
  • SQLite
  • PostgreSQL

SQLite is not really suitable for a production deployment of Airflow, so that leaves us with MySQL or PostgreSQL.

Of these two options, the initial assessment is that PostgreSQL would be preferred over MySQL.

Therefore, the aim of this ticket is to:

  • Choose a deployment scenario for PostgreSQL
  • Deploy a PostgreSQL service according to the chosen deployment scenario
  • Ensure that we have appropriate availability guarantees in place for PostgreSQL - (i.e. replication, failover, backup, restore, monitoring)

Note that we have a related ticket that investigates whether or not we can migrate existing data or whether we would need to start again with a fresh database: T319502: Consider migrating the Airflow MariaDB databases to Postgres

Event Timeline

EChetty renamed this task from Evaluate Migration to Postgress For Airflow to Evaluate migration to PostgreSQL for Airflow.Oct 5 2022, 4:18 PM
EChetty assigned this task to Antoine_Quhen.
EChetty set the point value for this task to 3.Oct 10 2022, 4:10 PM

Deployment scenarios

1 Physical server pair to serve all Airflow instances

This involves using two physical servers that are already deployed but were previously earmarked for hosting the Analytics MySQL Meta databases that are currently hosted on an-coord100[1-2]

The ticket for this planned migration work is: T284150: Bring an-mariadb100[12] into service

Pros
  • High performance option
  • Instant hardware provision (because it's already installed and insetup)
  • Ready for HA with a pair of servers
Cons
  • Introduces further delay to the splitting of MariaDB from the an-coord role

2 Virtual server pair to serve all Airflow instances

This would require us to create a new pair of VMs to support PostgreSQL (primary and standby) - Due to the nature of ganeti, we would be limited to a lower specification of server

Pros
  • Low waiting time to deploy VMs
  • No impact on MariaDB splitting role from the an-coord role
  • Ready for HA with a pair of servers
Cons
  • Lower performance/capacity option

3 PostgreSQL instance per airflow instance

This would require us adding a postgreSQL instance to each airflow VM

Pros
  • No shared database backend across Airflow instances
Cons
  • Lower performance option
  • PostgreSQL shares the same resources as the machine hosting the airflow scheduler, web server, executor
  • Not HA ready

Going with Option 1

1 Physical server pair to serve all Airflow instances

This involves using two physical servers that are already deployed but were previously earmarked for hosting the Analytics MySQL Meta databases that are currently hosted on an-coord100[1-2]

The ticket for this planned migration work is: T284150: Bring an-mariadb100[12] into service

Pros
  • High performance option
  • Instant hardware provision (because it's already installed and insetup)
  • Ready for HA with a pair of servers
Cons
  • Introduces further delay to the splitting of MariaDB from the an-coord role

As stated in T319440#8320425 we have met to discuss these options and decided to use the existing an-db100[1-2] servers for the new shared PostgreSQL instance. That's option 1 above.
We will replace an-db100[1-2] with an-mariadb100[1-2] (being purchased in T319437) so that the existing plan to extract the analytics mysql meta database from the an-coord role is not negatively impacted by this decision.

We did like the de-coupling of Airflow instances inherent in option 3 and if we already had a way to run PostgreSQL in containers (on Kubernetes) then we might well have taken this option. However, we didn't feel that running at least five new PostgreSQL instances on the Airflow VMs was a worthwhile trade-off for this decoupling. Perhaps when we have support for PersistentVolumeClaims in Kubernetes this might be an even more suitable option, but for now option 1 seems like the best pragmatic decision.

I will get on with creating patches to deploy PostgreSQL to an-db100[1-2].

Thanks for all the notes!!
One question: I don't understand what replacing an-db100[1-2] with an-mariadb100[1-2] means. Does it mean that we will use the already provisioned machines an-db100[1-2] for PostreSQL and then purchase 2 new machines, an-mariadb100[1-2], for all the mariadb databases? (sorry I have no permits for T319437)

Thanks for all the notes!!
One question: I don't understand what replacing an-db100[1-2] with an-mariadb100[1-2] means. Does it mean that we will use the already provisioned machines an-db100[1-2] for PostreSQL and then purchase 2 new machines, an-mariadb100[1-2], for all the mariadb databases? (sorry I have no permits for T319437)

Yes, that's exactly right @mforns. We previously purchased an-db100[1-2] and they are ready for use now. They are in the insetup role in puppet, which means we can just apply a new role to them and start customizing.
They were ready for the project to move MariaDB away from an-coord100[1-2] in T284150: Bring an-mariadb100[12] into service but that project get de-prioritized and as such these servers have been unused since being racked.

By purchasing a new pair for this project (an-mariadb100[1-2]) then we can still proceed with migrating MariaDB away from an-coord100[1-2], but at a more convenient time. In the meantime we can use the existing hardware (an-db100[1-2]) for PostgreSQL.

I think that this ticket and T319502: Consider migrating the Airflow MariaDB databases to Postgres are probably duplicates and should be merged. @EChetty?

It seems the above ticket is more about trying to migrate the data to Postgres

Change 844460 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add postgresql replication password for new an-db servers

https://gerrit.wikimedia.org/r/844460

Change 844460 merged by Btullis:

[labs/private@master] Add postgresql replication password for new an-db servers

https://gerrit.wikimedia.org/r/844460

Change 843502 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add initial manifests for postgresql on an-db100[1-2]

https://gerrit.wikimedia.org/r/843502

BTullis renamed this task from Evaluate migration to PostgreSQL for Airflow to Deploy a PostgreSQL service for Airflow to use.Oct 19 2022, 1:11 PM
BTullis updated the task description. (Show Details)
BTullis triaged this task as High priority.Oct 19 2022, 1:14 PM

I now have a puppet change that will allow us to:

  • deploy PostgreSQL to an-db100[1-2]
  • configure replication from a primary (an-db1001) to a replica (an-db1002) server
  • configure backups of the postgresql data, both on each host and also to bacula

I will create the actual airflow databases and usernames etc. in a subsequent commit, once this has been deployed.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-db1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-db1001.eqiad.wmnet with OS bullseye completed:

  • an-db1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210191339_btullis_2240267_an-db1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-db1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-db1002.eqiad.wmnet with OS bullseye completed:

  • an-db1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210191601_btullis_2267084_an-db1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 843502 merged by Btullis:

[operations/puppet@production] Add postgresql to an-db100[1-2]

https://gerrit.wikimedia.org/r/843502

OK, this first change is now deployed so we have a postgresql 13 service running on an-db100[1-2]

btullis@an-db1001:~$ sudo su - postgres
postgres@an-db1001:~$ psql
psql (13.8 (Debian 13.8-0+deb11u1))
Type "help" for help.

postgres=#

I will check the replication and monitoring aspects. @jcrespo is helping me to check the backup elements.

Once these are working, I will make another CR to add the various netbox databases.

I have initiated the replication with the following command.

btullis@cumin1001:~$ sudo cookbook sre.postgresql.postgres-init --replica an-db1002.eqiad.wmnet --task-id T319440 --reason "Initial setup of postgres replication"

One thing Re: Ensure that we have appropriate availability guarantees in place for PostgreSQL - (i.e. replication, failover, backup, restore, monitoring)

Data persistence team does not support backups of postgres, and in fact there are huge issues with the current WMF postgres backup workflow (T316655), so -given we have not been involved in this decision-, we won't be able to support this- it will be data engineering's responsability to handle monitoring, backups and testing. Unlike with PostgreSQL, we have proper tooling for automating, monitoring and backing up MySQL or MariaDB hosts, due to MediaWiki needs, but that is not true for postgres. If backups fail to recover, it will be the sole DE's team responsibility.

Having said that, external file backups ran yesterday at 4am:

root@backup1001:~$ check_bacula.py an-db1001.eqiad.wmnet-Monthly-1st-Thu-productionEqiad-data-engineering-postgres
id: 479917, ts: 2022-10-21 04:05:00, type: F, status: T, bytes: 3488
✔️

A restore job was just run and sent to an-db1001:/var/tmp/bacula-restores please check everything that should be there is there and is able to be restored.

Please delete the restore files when done.

Thanks Jaime.

Data persistence team does not support backups of postgres, and in fact there are huge issues with the current WMF postgres backup workflow (T316655), so -given we have not been involved in this decision-, we won't be able to support this- it will be data engineering's responsability to handle monitoring, backups and testing. Unlike with PostgreSQL, we have proper tooling for automating, monitoring and backing up MySQL or MariaDB hosts, due to MediaWiki needs, but that is not true for postgres. If backups fail to recover, it will be the sole DE's team responsibility.

This is fully understood and we appreciate the clarity. Apologies if you feel that we should have involved the Data Persistence team more in the decision making process.
Sadly, the requirement to switch Airflow from MariaDB to PostgreSQL came to us rather out of the blue, so this is something of a pragmatic decision on our part, but we will take full responsibility for the availability guarantees of the service.


I've checked the restored files and I'm happy that they can be reloaded to postgres. I've subsequently deleted an-db1001:/var/tmp/bacula-restores as requested.

Great!

The main issue I wanted to surface is that there are ongoing issues with postgres backup workflow, and wanted you and your team to be aware of those! We are not ready to provide a guarantee that those would work. We can provide it with other technologies after long running (multi-year) projects to prepare and build them properly.

Change 845559 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy passwords for the airflow database users

https://gerrit.wikimedia.org/r/845559

Change 845560 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a simple mechanism for creating postgresql users and databases

https://gerrit.wikimedia.org/r/845560

Change 845559 merged by Btullis:

[labs/private@master] Add dummy passwords for the airflow database users

https://gerrit.wikimedia.org/r/845559

Change 845560 merged by Btullis:

[operations/puppet@production] Add a simple mechanism for creating postgresql users and databases

https://gerrit.wikimedia.org/r/845560

Change 849122 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Open up the postrges service to the analytics vlans

https://gerrit.wikimedia.org/r/849122

Change 849122 merged by Btullis:

[operations/puppet@production] Open up the postrges service to the analytics vlans

https://gerrit.wikimedia.org/r/849122

Change 849500 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a postgres user with an IPv6 RFC 4193 host match

https://gerrit.wikimedia.org/r/849500

Change 849500 merged by Btullis:

[operations/puppet@production] Add a postgres user with an IPv6 RFC 4193 host match

https://gerrit.wikimedia.org/r/849500

Change 850185 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a postgres user with our IPv6 network address

https://gerrit.wikimedia.org/r/850185

Change 850185 merged by Btullis:

[operations/puppet@production] Add a postgres user with our IPv6 network address

https://gerrit.wikimedia.org/r/850185

Change 851690 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a postgresql database for the airflow development

https://gerrit.wikimedia.org/r/851690

Change 851690 merged by Btullis:

[operations/puppet@production] Add a postgresql database for the airflow development

https://gerrit.wikimedia.org/r/851690

BTullis moved this task from Ready to Deploy to Done on the Data Pipelines (Sprint 04) board.

I am happy with this service now, although there will be some follow-up in terms of monitoring.
We have an airflow-specific database check, which will fire if the database is down. Similarly, we have a systemd alert, which will fire if postgresql is down.

It would be nice to ensure that we have stats in prometheus and any relevant alertmanager alarms, but we can come back to that.
So I'm marking this ticket as resolved.

Change 891804 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Ensure that the airflow database names match existing conventions

https://gerrit.wikimedia.org/r/891804

Change 891804 merged by Btullis:

[operations/puppet@production] Ensure that the airflow database names match existing conventions

https://gerrit.wikimedia.org/r/891804