Create a dedicated postgresql+postgis cluster for maps
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	awight
	Nov 8 2022, 3:48 PM

Description

Currently, postgresql+postgis is implemented on bare metal, with each maps application server hosting its own postgres replica. Once kartotherian is fully containerized, this deployment won't make sense any more, and postgres should be moved into a dedicated cluster behind a read-write and a read-only, load-balanced discovery URL.

Details

Subject	Repo	Branch	Lines +/-
services: add tcp health checks to Tegola's eqiad/codfw configs	operations/deployment-charts	master	+11 -11
services: add health checks to Tegola's postgres TCP proxy	operations/deployment-charts	master	+11 -0
charts: update tegola-vector-tiles to mesh.configuration:1.10.1	operations/deployment-charts	master	+181 -45
modules: add health checks to the mesh's _tcp_cluster config	operations/deployment-charts	master	+21 -2
modules: add mesh.configuration 1.10.1	operations/deployment-charts	master	+706 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T198901 Migrate production services to kubernetes using the pipeline
Open	None	T321959 Tech Wishes - Maps service infrastructure deprecations
Open	None	T216826 Move Kartotherian to Kubernetes
Open	None	T322647 Create a dedicated postgresql+postgis cluster for maps

Event Timeline

awight created this task.Nov 8 2022, 3:48 PM

@awight, on Tegola (which is running on k8s), we already have envoy doing load-balancing there, details can be found at tegola-vector-tiles. It makes sense to use the same solution for kartotherian I reckon, do you agree?

thiemowmde moved this task from Incoming to Watching or blocked on the WMDE-TechWish-Maintenance board.Jan 12 2023, 6:12 PM

WMDE-Fisch removed a project: WMDE-TechWish-Maintenance.Nov 13 2024, 7:24 PM

@jijiki I am looking into the same problem in the task T378944, and I have a doubt - what happens if one of the maps node goes down for hw failure or maintenance? Is the envoy tcp load balancer going to remove it from rotation, or will it keep erroring out periodically until it is finally depooled? I was wondering if creating something like maps-db.discovery.wment:5432 with LVS could be a more long term solution (adding only the read replicas to the pool). What do you think?

My understanding is that, by default, the envoy LB configuration will not do any active probing of the TCP proxy endpoints set. I am wondering if we should expand the mesh's tcp proxy config with health checks like described in:

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/health_checking#per-cluster-member-health-check-config

From https://www.envoyproxy.io/docs/envoy/v1.23.12/api-v3/config/cluster/v3/cluster.proto:

If no configuration is specified no health checking will be done and all cluster members will be considered healthy at all times.

Change #1098512 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] modules: add health checks to the mesh's _tcp_cluster config

https://gerrit.wikimedia.org/r/1098512

gerritbot added a project: Patch-For-Review.Wed, Nov 27, 2:06 PM

Change #1098511 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] modules: add mesh.configuration 1.11.0

https://gerrit.wikimedia.org/r/1098511

Change #1098530 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] charts: update tegola-vector-tiles to mesh.configuration:1.11.0

https://gerrit.wikimedia.org/r/1098530

Change #1098531 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add health checks to Tegola's postgres TCP proxy

https://gerrit.wikimedia.org/r/1098531

In T322647#8393332, @jijiki wrote:

@awight, on Tegola (which is running on k8s), we already have envoy doing load-balancing there, details can be found at tegola-vector-tiles. It makes sense to use the same solution for kartotherian I reckon, do you agree?

To keep archives happy - we are inclined to keep using the tcp-proxy set in k8s for tegola, since the kartotherian use case is the same. I filed some patches to add some extra functionality like health checks of maps replicas, that I'll try to rollout for Tegola first and then we'll reuse everything for Kartotherian.

Change #1098511 merged by jenkins-bot:

[operations/deployment-charts@master] modules: add mesh.configuration 1.10.1

https://gerrit.wikimedia.org/r/1098511

Change #1098512 merged by jenkins-bot:

[operations/deployment-charts@master] modules: add health checks to the mesh's _tcp_cluster config

https://gerrit.wikimedia.org/r/1098512

Change #1098530 merged by Elukey:

[operations/deployment-charts@master] charts: update tegola-vector-tiles to mesh.configuration:1.10.1

https://gerrit.wikimedia.org/r/1098530

Change #1098531 merged by Elukey:

[operations/deployment-charts@master] services: add health checks to Tegola's postgres TCP proxy

https://gerrit.wikimedia.org/r/1098531

Maintenance_bot removed a project: Patch-For-Review.Thu, Nov 28, 2:30 PM

Deployed in tegola staging, I got this from the envoy's logs:

{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.48.6","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.249Z"}
[2024-11-28 15:20:30.265][1][info][config] [source/server/listener_manager_impl.cc:841] all dependencies initialized. starting workers
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.719Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.16.27","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.872Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.12","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.944Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.16.6","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:31.200Z"}

It looks good so far! I'll leave it running for a couple of days before hitting production to see if anything weird comes up.

Tested staging with T344324#9826584 (used previous for another tegola work) and it seems working nicely.

elukey mentioned this in T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.Thu, Nov 28, 3:31 PM

After a chat with Janis, this may be a good test:

On one maps replica eqiad node, we execute: iptables -A INPUT -s 10.64.75.250 -p tcp –destination-port 5432 -j DROP
We check logs on the Tegola's envoy's container in staging to figure out if the replica is correctly removed from service
We execute iptables -D INPUT -s 10.64.75.250 -p tcp –destination-port 5432 -j DROP to remove the rule.

Before the test I checked the envoy logs and I found that we had already a slow replica that was ejected and re-added back:

# First failures + eject
# Timings seem to confirm that after 5s timeout + 1s waiting interval + 5s timeout + 1s interval + 5s timeout maps1006 was ejected
#
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:44:46.738Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:44:52.739Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","eject_unhealthy_event":{"failure_type":"NETWORK_TIMEOUT"},"timestamp":"2024-11-28T17:44:58.739Z"}

# Some other failures, not leading to eject
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:44:58.739Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:45:04.741Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:45:10.742Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":false},"timestamp":"2024-11-28T17:45:15.745Z"}

# Another round of 3 timeouts in a row, leading to eject
#
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:28.055Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:34.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","eject_unhealthy_event":{"failure_type":"NETWORK_TIMEOUT"},"timestamp":"2024-11-28T17:56:40.056Z"}

# Other failures, the host is not added back until the  add_healthy_event
#
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:40.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:46.058Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:52.057Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:58.057Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:04.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:10.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:16.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:22.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:28.055Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:34.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:40.057Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":false},"timestamp":"2024-11-28T17:57:45.058Z"}

Excellent!

Change #1099649 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add tcp health checks to Tegola's eqiad/codfw configs

https://gerrit.wikimedia.org/r/1099649

gerritbot added a project: Patch-For-Review.Mon, Dec 2, 9:30 AM

Change #1099649 merged by Elukey:

[operations/deployment-charts@master] services: add tcp health checks to Tegola's eqiad/codfw configs

https://gerrit.wikimedia.org/r/1099649

Deployed to Tegola production, I think everything looks good. I didn't find metrics to add to the Grafana dashboard, but we'll see in the future. The setup is not safer in my opinion, and we can re-use it for kartotherian.

Not sure if we want to keep this task open for a long-run refactor of the posgres cluster (after Kartotherian runs on k8s).

Maintenance_bot removed a project: Patch-For-Review.Mon, Dec 2, 10:31 AM

In T322647#10370522, @elukey wrote:

Not sure if we want to keep this task open for a long-run refactor of the posgres cluster (after Kartotherian runs on k8s).

I think that's a good idea. I would just change "maps" to "OSM" because those machines will be entirely dedicated to serve an OSM replica for our map data source.

We should even explore if we actually need all machines we are currently using, we could probably just have 2 or 3 per DC.

Create a dedicated postgresql+postgis cluster for mapsOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create a dedicated postgresql+postgis cluster for maps
Open, Needs TriagePublic
Actions

Related Objects
Search...