Maniphest T319440

Deploy a PostgreSQL service for Airflow to use
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	• EChetty
	Oct 5 2022, 4:17 PM

Description

We have discovered an incompatibility between our next version of Airflow and MariaDB

https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html#choosing-database-backend

Whilst Airflow has been working with MariaDB up until now, the upgrade of the database to the new version failed.

Therefore, we need to choose a different back-end from the list of supported databases.
The only options available to us from a licensing perspective are:

MySQL
SQLite
PostgreSQL

SQLite is not really suitable for a production deployment of Airflow, so that leaves us with MySQL or PostgreSQL.

Of these two options, the initial assessment is that PostgreSQL would be preferred over MySQL.

Therefore, the aim of this ticket is to:

Choose a deployment scenario for PostgreSQL
Deploy a PostgreSQL service according to the chosen deployment scenario
Ensure that we have appropriate availability guarantees in place for PostgreSQL - (i.e. replication, failover, backup, restore, monitoring)

Note that we have a related ticket that investigates whether or not we can migrate existing data or whether we would need to start again with a fresh database: T319502: Consider migrating the Airflow MariaDB databases to Postgres

Details

Subject	Repo	Branch	Lines +/-
Ensure that the airflow database names match existing conventions	operations/puppet	production	+2 -2
Add a postgresql database for the airflow development	operations/puppet	production	+1 -0
Add a postgres user with our IPv6 network address	operations/puppet	production	+11 -1
Add a postgres user with an IPv6 RFC 4193 host match	operations/puppet	production	+11 -1
Open up the postrges service to the analytics vlans	operations/puppet	production	+4 -8
Add a simple mechanism for creating postgresql users and databases	operations/puppet	production	+24 -1
Add dummy passwords for the airflow database users	labs/private	master	+5 -1
Add postgresql to an-db100[1-2]	operations/puppet	production	+118 -1
Add postgresql replication password for new an-db servers	labs/private	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: rLPRIa140637aa326: Add dummy passwords for the airflow database users
rLPRI4847a892252d: Add postgresql replication password for new an-db servers
T319502: Consider migrating the Airflow MariaDB databases to Postgres
Mentioned Here: T316655: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals
T319502: Consider migrating the Airflow MariaDB databases to Postgres
T284150: Bring an-mariadb100[12] into service

Event Timeline

• EChetty created this task.Oct 5 2022, 4:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2022, 4:17 PM

• EChetty renamed this task from Evaluate Migration to Postgress For Airflow to Evaluate migration to PostgreSQL for Airflow.Oct 5 2022, 4:18 PM

• EChetty assigned this task to Antoine_Quhen.

• EChetty set the point value for this task to 3.Oct 10 2022, 4:10 PM

• EChetty moved this task from In Progress to Blocked/Paused on the Data Pipelines (Sprint 02) board.Oct 11 2022, 4:12 PM

JArguello-WMF reassigned this task from Antoine_Quhen to BTullis.Oct 13 2022, 4:07 PM

JArguello-WMF added a subscriber: Antoine_Quhen.

BTullis moved this task from Blocked/Paused to In Progress on the Data Pipelines (Sprint 02) board.Oct 17 2022, 8:46 AM

BTullis updated the task description. (Show Details)Oct 17 2022, 9:00 AM

Deployment scenarios

1 Physical server pair to serve all Airflow instances

This involves using two physical servers that are already deployed but were previously earmarked for hosting the Analytics MySQL Meta databases that are currently hosted on an-coord100[1-2]

The ticket for this planned migration work is: T284150: Bring an-mariadb100[12] into service

Pros

High performance option
Instant hardware provision (because it's already installed and insetup)
Ready for HA with a pair of servers

Cons

Introduces further delay to the splitting of MariaDB from the an-coord role

2 Virtual server pair to serve all Airflow instances

This would require us to create a new pair of VMs to support PostgreSQL (primary and standby) - Due to the nature of ganeti, we would be limited to a lower specification of server

Pros

Low waiting time to deploy VMs
No impact on MariaDB splitting role from the an-coord role
Ready for HA with a pair of servers

Cons

Lower performance/capacity option

3 PostgreSQL instance per airflow instance

This would require us adding a postgreSQL instance to each airflow VM

Pros

No shared database backend across Airflow instances

Cons

Lower performance option
PostgreSQL shares the same resources as the machine hosting the airflow scheduler, web server, executor
Not HA ready

Going with Option 1

1 Physical server pair to serve all Airflow instances

This involves using two physical servers that are already deployed but were previously earmarked for hosting the Analytics MySQL Meta databases that are currently hosted on an-coord100[1-2]

The ticket for this planned migration work is: T284150: Bring an-mariadb100[12] into service

Pros

High performance option

Instant hardware provision (because it's already installed and insetup)

Ready for HA with a pair of servers

Cons

Introduces further delay to the splitting of MariaDB from the an-coord role

As stated in T319440#8320425 we have met to discuss these options and decided to use the existing an-db100[1-2] servers for the new shared PostgreSQL instance. That's option 1 above.
We will replace an-db100[1-2] with an-mariadb100[1-2] (being purchased in T319437) so that the existing plan to extract the analytics mysql meta database from the an-coord role is not negatively impacted by this decision.

We did like the de-coupling of Airflow instances inherent in option 3 and if we already had a way to run PostgreSQL in containers (on Kubernetes) then we might well have taken this option. However, we didn't feel that running at least five new PostgreSQL instances on the Airflow VMs was a worthwhile trade-off for this decoupling. Perhaps when we have support for PersistentVolumeClaims in Kubernetes this might be an even more suitable option, but for now option 1 seems like the best pragmatic decision.

I will get on with creating patches to deploy PostgreSQL to an-db100[1-2].

BTullis updated the task description. (Show Details)Oct 17 2022, 10:37 AM

• EChetty edited projects, added Data Pipelines (Sprint 03); removed Data Pipelines (Sprint 02).Oct 17 2022, 11:07 AM

• EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 03) board.Oct 17 2022, 11:08 AM

Thanks for all the notes!!
One question: I don't understand what replacing an-db100[1-2] with an-mariadb100[1-2] means. Does it mean that we will use the already provisioned machines an-db100[1-2] for PostreSQL and then purchase 2 new machines, an-mariadb100[1-2], for all the mariadb databases? (sorry I have no permits for T319437)

In T319440#8322081, @mforns wrote:

Thanks for all the notes!!
One question: I don't understand what replacing an-db100[1-2] with an-mariadb100[1-2] means. Does it mean that we will use the already provisioned machines an-db100[1-2] for PostreSQL and then purchase 2 new machines, an-mariadb100[1-2], for all the mariadb databases? (sorry I have no permits for T319437)

Yes, that's exactly right @mforns. We previously purchased an-db100[1-2] and they are ready for use now. They are in the insetup role in puppet, which means we can just apply a new role to them and start customizing.
They were ready for the project to move MariaDB away from an-coord100[1-2] in T284150: Bring an-mariadb100[12] into service but that project get de-prioritized and as such these servers have been unused since being racked.

By purchasing a new pair for this project (an-mariadb100[1-2]) then we can still proceed with migrating MariaDB away from an-coord100[1-2], but at a more convenient time. In the meantime we can use the existing hardware (an-db100[1-2]) for PostgreSQL.

I think that this ticket and T319502: Consider migrating the Airflow MariaDB databases to Postgres are probably duplicates and should be merged. @EChetty?

BTullis mentioned this in T319502: Consider migrating the Airflow MariaDB databases to Postgres.Oct 18 2022, 11:16 AM

Thanks @BTullis!

In T319440#8324569, @BTullis wrote:

I think that this ticket and T319502: Consider migrating the Airflow MariaDB databases to Postgres are probably duplicates and should be merged. @EChetty?

It seems the above ticket is more about trying to migrate the data to Postgres

Change 844460 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add postgresql replication password for new an-db servers

https://gerrit.wikimedia.org/r/844460

Change 844460 merged by Btullis:

[labs/private@master] Add postgresql replication password for new an-db servers

https://gerrit.wikimedia.org/r/844460

BTullis mentioned this in rLPRI4847a892252d: Add postgresql replication password for new an-db servers.Oct 19 2022, 10:44 AM

Maintenance_bot removed a project: Patch-For-Review.Oct 19 2022, 11:30 AM

Change 843502 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add initial manifests for postgresql on an-db100[1-2]

https://gerrit.wikimedia.org/r/843502

gerritbot added a project: Patch-For-Review.Oct 19 2022, 12:54 PM

BTullis renamed this task from Evaluate migration to PostgreSQL for Airflow to Deploy a PostgreSQL service for Airflow to use.Oct 19 2022, 1:11 PM

BTullis updated the task description. (Show Details)

I now have a puppet change that will allow us to:

deploy PostgreSQL to an-db100[1-2]
configure replication from a primary (an-db1001) to a replica (an-db1002) server
configure backups of the postgresql data, both on each host and also to bacula

I will create the actual airflow databases and usernames etc. in a subsequent commit, once this has been deployed.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-db1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-db1001.eqiad.wmnet with OS bullseye completed:

an-db1001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210191339_btullis_2240267_an-db1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-db1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-db1002.eqiad.wmnet with OS bullseye completed:

an-db1002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210191601_btullis_2267084_an-db1002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 843502 merged by Btullis:

[operations/puppet@production] Add postgresql to an-db100[1-2]

https://gerrit.wikimedia.org/r/843502

OK, this first change is now deployed so we have a postgresql 13 service running on an-db100[1-2]

btullis@an-db1001:~$ sudo su - postgres
postgres@an-db1001:~$ psql
psql (13.8 (Debian 13.8-0+deb11u1))
Type "help" for help.

postgres=#

I will check the replication and monitoring aspects. @jcrespo is helping me to check the backup elements.

Once these are working, I will make another CR to add the various netbox databases.

BTullis updated the task description. (Show Details)Oct 20 2022, 12:19 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 20 2022, 12:30 PM

I have initiated the replication with the following command.

btullis@cumin1001:~$ sudo cookbook sre.postgresql.postgres-init --replica an-db1002.eqiad.wmnet --task-id T319440 --reason "Initial setup of postgres replication"

One thing Re: Ensure that we have appropriate availability guarantees in place for PostgreSQL - (i.e. replication, failover, backup, restore, monitoring)

Data persistence team does not support backups of postgres, and in fact there are huge issues with the current WMF postgres backup workflow (T316655), so -given we have not been involved in this decision-, we won't be able to support this- it will be data engineering's responsability to handle monitoring, backups and testing. Unlike with PostgreSQL, we have proper tooling for automating, monitoring and backing up MySQL or MariaDB hosts, due to MediaWiki needs, but that is not true for postgres. If backups fail to recover, it will be the sole DE's team responsibility.

Having said that, external file backups ran yesterday at 4am:

root@backup1001:~$ check_bacula.py an-db1001.eqiad.wmnet-Monthly-1st-Thu-productionEqiad-data-engineering-postgres
id: 479917, ts: 2022-10-21 04:05:00, type: F, status: T, bytes: 3488
✔️

A restore job was just run and sent to an-db1001:/var/tmp/bacula-restores please check everything that should be there is there and is able to be restored.

Please delete the restore files when done.

Thanks Jaime.

Data persistence team does not support backups of postgres, and in fact there are huge issues with the current WMF postgres backup workflow (T316655), so -given we have not been involved in this decision-, we won't be able to support this- it will be data engineering's responsability to handle monitoring, backups and testing. Unlike with PostgreSQL, we have proper tooling for automating, monitoring and backing up MySQL or MariaDB hosts, due to MediaWiki needs, but that is not true for postgres. If backups fail to recover, it will be the sole DE's team responsibility.

This is fully understood and we appreciate the clarity. Apologies if you feel that we should have involved the Data Persistence team more in the decision making process.
Sadly, the requirement to switch Airflow from MariaDB to PostgreSQL came to us rather out of the blue, so this is something of a pragmatic decision on our part, but we will take full responsibility for the availability guarantees of the service.

I've checked the restored files and I'm happy that they can be reloaded to postgres. I've subsequently deleted an-db1001:/var/tmp/bacula-restores as requested.

Great!

The main issue I wanted to surface is that there are ongoing issues with postgres backup workflow, and wanted you and your team to be aware of those! We are not ready to provide a guarantee that those would work. We can provide it with other technologies after long running (multi-year) projects to prepare and build them properly.

Change 845559 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy passwords for the airflow database users

https://gerrit.wikimedia.org/r/845559

Change 845560 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a simple mechanism for creating postgresql users and databases

https://gerrit.wikimedia.org/r/845560

Change 845559 merged by Btullis:

[labs/private@master] Add dummy passwords for the airflow database users

https://gerrit.wikimedia.org/r/845559

BTullis mentioned this in rLPRIa140637aa326: Add dummy passwords for the airflow database users.Oct 21 2022, 2:38 PM

Change 845560 merged by Btullis:

[operations/puppet@production] Add a simple mechanism for creating postgresql users and databases

https://gerrit.wikimedia.org/r/845560

Change 849122 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Open up the postrges service to the analytics vlans

https://gerrit.wikimedia.org/r/849122

Change 849122 merged by Btullis:

[operations/puppet@production] Open up the postrges service to the analytics vlans

https://gerrit.wikimedia.org/r/849122

Change 849500 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a postgres user with an IPv6 RFC 4193 host match

https://gerrit.wikimedia.org/r/849500

Change 849500 merged by Btullis:

[operations/puppet@production] Add a postgres user with an IPv6 RFC 4193 host match

https://gerrit.wikimedia.org/r/849500

Change 850185 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a postgres user with our IPv6 network address

https://gerrit.wikimedia.org/r/850185

Change 850185 merged by Btullis:

[operations/puppet@production] Add a postgres user with our IPv6 network address

https://gerrit.wikimedia.org/r/850185

Change 851690 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a postgresql database for the airflow development

https://gerrit.wikimedia.org/r/851690

Change 851690 merged by Btullis:

[operations/puppet@production] Add a postgresql database for the airflow development

https://gerrit.wikimedia.org/r/851690

• EChetty moved this task from Sprint 03 to Sprint 04 on the Data Pipelines board.Nov 7 2022, 10:21 AM

• EChetty edited projects, added Data Pipelines (Sprint 04); removed Data Pipelines (Sprint 03).

• EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 04) board.

• EChetty moved this task from In Progress to Ready to Deploy on the Data Pipelines (Sprint 04) board.Nov 7 2022, 4:15 PM

BTullis updated the task description. (Show Details)Nov 7 2022, 4:16 PM

I am happy with this service now, although there will be some follow-up in terms of monitoring.
We have an airflow-specific database check, which will fire if the database is down. Similarly, we have a systemd alert, which will fire if postgresql is down.

It would be nice to ensure that we have stats in prometheus and any relevant alertmanager alarms, but we can come back to that.
So I'm marking this ticket as resolved.

Change 891804 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Ensure that the airflow database names match existing conventions

https://gerrit.wikimedia.org/r/891804

Change 891804 merged by Btullis:

[operations/puppet@production] Ensure that the airflow database names match existing conventions

https://gerrit.wikimedia.org/r/891804

Maintenance_bot removed a project: Patch-For-Review.Feb 24 2023, 11:30 AM

Deploy a PostgreSQL service for Airflow to useClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Deployment scenarios

1 Physical server pair to serve all Airflow instances

Pros

Cons

2 Virtual server pair to serve all Airflow instances

Pros

Cons

3 PostgreSQL instance per airflow instance

Pros

Cons

Going with Option 1

1 Physical server pair to serve all Airflow instances

Pros

Cons

Deploy a PostgreSQL service for Airflow to use
Closed, ResolvedPublic3 Estimated Story Points
Actions