Page MenuHomePhabricator

Create a dedicated postgresql+postgis cluster for maps
Open, Needs TriagePublic

Description

Currently, postgresql+postgis is implemented on bare metal, with each maps application server hosting its own postgres replica. Once kartotherian is fully containerized, this deployment won't make sense any more, and postgres should be moved into a dedicated cluster behind a read-write and a read-only, load-balanced discovery URL.

Event Timeline

@awight, on Tegola (which is running on k8s), we already have envoy doing load-balancing there, details can be found at tegola-vector-tiles. It makes sense to use the same solution for kartotherian I reckon, do you agree?

@jijiki I am looking into the same problem in the task T378944, and I have a doubt - what happens if one of the maps node goes down for hw failure or maintenance? Is the envoy tcp load balancer going to remove it from rotation, or will it keep erroring out periodically until it is finally depooled? I was wondering if creating something like maps-db.discovery.wment:5432 with LVS could be a more long term solution (adding only the read replicas to the pool). What do you think?

My understanding is that, by default, the envoy LB configuration will not do any active probing of the TCP proxy endpoints set. I am wondering if we should expand the mesh's tcp proxy config with health checks like described in:

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/health_checking#per-cluster-member-health-check-config

From https://www.envoyproxy.io/docs/envoy/v1.23.12/api-v3/config/cluster/v3/cluster.proto:

If no configuration is specified no health checking will be done and all cluster members will be considered healthy at all times.

Change #1098512 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] modules: add health checks to the mesh's _tcp_cluster config

https://gerrit.wikimedia.org/r/1098512

Change #1098511 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] modules: add mesh.configuration 1.11.0

https://gerrit.wikimedia.org/r/1098511

Change #1098530 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] charts: update tegola-vector-tiles to mesh.configuration:1.11.0

https://gerrit.wikimedia.org/r/1098530

Change #1098531 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add health checks to Tegola's postgres TCP proxy

https://gerrit.wikimedia.org/r/1098531

@awight, on Tegola (which is running on k8s), we already have envoy doing load-balancing there, details can be found at tegola-vector-tiles. It makes sense to use the same solution for kartotherian I reckon, do you agree?

To keep archives happy - we are inclined to keep using the tcp-proxy set in k8s for tegola, since the kartotherian use case is the same. I filed some patches to add some extra functionality like health checks of maps replicas, that I'll try to rollout for Tegola first and then we'll reuse everything for Kartotherian.

Change #1098511 merged by jenkins-bot:

[operations/deployment-charts@master] modules: add mesh.configuration 1.10.1

https://gerrit.wikimedia.org/r/1098511

Change #1098512 merged by jenkins-bot:

[operations/deployment-charts@master] modules: add health checks to the mesh's _tcp_cluster config

https://gerrit.wikimedia.org/r/1098512

Change #1098530 merged by Elukey:

[operations/deployment-charts@master] charts: update tegola-vector-tiles to mesh.configuration:1.10.1

https://gerrit.wikimedia.org/r/1098530

Change #1098531 merged by Elukey:

[operations/deployment-charts@master] services: add health checks to Tegola's postgres TCP proxy

https://gerrit.wikimedia.org/r/1098531

Deployed in tegola staging, I got this from the envoy's logs:

{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.48.6","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.249Z"}
[2024-11-28 15:20:30.265][1][info][config] [source/server/listener_manager_impl.cc:841] all dependencies initialized. starting workers
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.719Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.16.27","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.872Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.12","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:30.944Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.16.6","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":true},"timestamp":"2024-11-28T15:20:31.200Z"}

It looks good so far! I'll leave it running for a couple of days before hitting production to see if anything weird comes up.

Tested staging with T344324#9826584 (used previous for another tegola work) and it seems working nicely.

After a chat with Janis, this may be a good test:

  1. On one maps replica eqiad node, we execute: iptables -A INPUT -s 10.64.75.250 -p tcp –destination-port 5432 -j DROP
  2. We check logs on the Tegola's envoy's container in staging to figure out if the replica is correctly removed from service
  3. We execute iptables -D INPUT -s 10.64.75.250 -p tcp –destination-port 5432 -j DROP to remove the rule.

Before the test I checked the envoy logs and I found that we had already a slow replica that was ejected and re-added back:

# First failures + eject
# Timings seem to confirm that after 5s timeout + 1s waiting interval + 5s timeout + 1s interval + 5s timeout maps1006 was ejected
#
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:44:46.738Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:44:52.739Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","eject_unhealthy_event":{"failure_type":"NETWORK_TIMEOUT"},"timestamp":"2024-11-28T17:44:58.739Z"}

# Some other failures, not leading to eject
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:44:58.739Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:45:04.741Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:45:10.742Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":false},"timestamp":"2024-11-28T17:45:15.745Z"}

# Another round of 3 timeouts in a row, leading to eject
#
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:28.055Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:34.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","eject_unhealthy_event":{"failure_type":"NETWORK_TIMEOUT"},"timestamp":"2024-11-28T17:56:40.056Z"}

# Other failures, the host is not added back until the  add_healthy_event
#
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:40.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:46.058Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:52.057Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:56:58.057Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:04.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:10.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:16.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:22.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:28.055Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:34.056Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","health_check_failure_event":{"failure_type":"NETWORK_TIMEOUT","first_check":false},"timestamp":"2024-11-28T17:57:40.057Z"}
{"health_checker_type":"TCP","host":{"socket_address":{"protocol":"TCP","address":"10.64.0.18","resolver_name":"","ipv4_compat":false,"port_value":5432}},"cluster_name":"maps_postgres","add_healthy_event":{"first_check":false},"timestamp":"2024-11-28T17:57:45.058Z"}

Change #1099649 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add tcp health checks to Tegola's eqiad/codfw configs

https://gerrit.wikimedia.org/r/1099649

Change #1099649 merged by Elukey:

[operations/deployment-charts@master] services: add tcp health checks to Tegola's eqiad/codfw configs

https://gerrit.wikimedia.org/r/1099649

Deployed to Tegola production, I think everything looks good. I didn't find metrics to add to the Grafana dashboard, but we'll see in the future. The setup is not safer in my opinion, and we can re-use it for kartotherian.

Not sure if we want to keep this task open for a long-run refactor of the posgres cluster (after Kartotherian runs on k8s).

Not sure if we want to keep this task open for a long-run refactor of the posgres cluster (after Kartotherian runs on k8s).

I think that's a good idea. I would just change "maps" to "OSM" because those machines will be entirely dedicated to serve an OSM replica for our map data source.

We should even explore if we actually need all machines we are currently using, we could probably just have 2 or 3 per DC.