Page MenuHomePhabricator

New Service Request geoshapes
Open, MediumPublic

Description

Description: Service to generate geometric shapes from OSM PostgreSQL data and WDQS queries
Timeline: Q4
Diagram: Data flow diagram, Maps backend diagram, and Maps deployment diagram
Technologies: nodejs
Point person: @MSantos

Background

As part of T263854: [Maps] Modernize Vector Tile Infrastructure work, geoshapes service will be extracted as a standalone service.

The service currently runs on bare-metal in maps20xx.codfw.wmnet and maps10xx.eqiad.wmnet and geoshapes access the PostgreSQL DB available in it

Acceptance Criteria
  • Extract geoshapes into it's own service/repo
  • Enable PG connections from k8s cluster to maps clusters
  • Enabling the deployment-pipeline to generate the OCI (docker) container T302967
    • Creating the helm chart itself in deployment-charts
      • Benchmark in a local env (if possible, don't go overboard we want coarse data, we 'll anyway have to finetune in real traffic)
  • Submit the helmfile.d/services stanzas for review and get them merged
  • Creation of k8s namespaces/token (SRE side, open up a task and we will get it done)
    • Do the actual deployment
    • Set up LVS, DNS and discovery (that's strictly on SRE side)
    • Set up the traffic layer to send traffic to the service
    • Acceptance tests
  • Set up grafana dashboards

Event Timeline

Thanks for this task!

So I 've studied the diagrams a bit, they are helpful.

The deployment pipeline definitely supports nodejs (service-runner in fact) apps, as soon as the code is split in its own repo we can enable on it and get OCI (docker) images for it. After that we can just cooperate on the helm chart creation (I don't expect surprises there, we 've done this before)

Regarding the question:

How can this new service be moved to the k8s and still connect to the PostgreSQL DBs in maps clusters?

As long as the PostgreSQL DB is exposed in a tcp port on 1 of the nodes of the cluster, we can just connect to it, we do the same thing from mediawiki to MySQL dbs. However if we want High Availability, things become quickly more complicated as we will have to abstract away the current status quo which IIRC (correct me please if I am out of sync), there are N postgresql dbs, 1 read-write main and N-1 read-only replicas, with each node talking to the local DB. I don't think we currently have much expertise in this (postgres isn't really well supported in WMF), we will have to figure out what to do in this (mediawiki does this internally). Connection parameters (endpoint, port, user, db, password) can be supplied to the software via environmental variables or be specified in a config file, both are ok.

Note that there is 1 interesting connection in the diagrams that we will need to support specifically and that is talking to wdqs. We want to make sure we use the internal endpoint of the service (that is wdqs-internal.discovery.wmnet) for maintainability purposes (e.g. easy depooling of a DC) and separation of concerns.

Thanks for the Q4 timeline, it's pretty useful.

@akosiaris and @jijiki how can we move forward with this?

For context:

@akosiaris and @jijiki how can we move forward with this?

For context:

Hi @MSantos. @jijiki isn't available currently, but I 'll do my best to help.

Nice work on the extracting of the geoshapes into it's own repo and service, that well structured README is a breath of fresh air.

Next steps would indeed be (I am using indentation for dependencies here, some things can happen in parallel):

  • Enabling the deployment-pipeline to generate the OCI (docker) container
    • Creating the helm chart itself in deployment-charts
      • Benchmark in a local env (if possible, don't go overboard we want coarse data, we 'll anyway have to finetune in real traffic)
  • Submit the helmfile.d/services stanzas for review and get them merged
  • Creation of k8s namespaces/token (SRE side, open up a task and we will get it done)
    • Do the actual deployment
      • Set up LVS, DNS and discovery (that's strictly on SRE side)
        • Set up the traffic layer to send traffic to the service (if needed). This is a bit unclear to me currently. I am not sure from the diagrams whether the user's browser will need to talk to geoshapes (via the edge traffic layers) or kartotherian will talk to geoshapes, or both.
          • Acceptance tests
  • Set up grafana dashboards
  • Party?

Set up the traffic layer to send traffic to the service (if needed). This is a bit unclear to me currently. I am not sure from the diagrams whether the user's browser will need to talk to geoshapes (via the edge traffic layers) or kartotherian will talk to geoshapes, or both.

With the extraction, kartotherian and geoshapes shouldn't be related anymore. So we should set up the traffic layer to send geoshape's endpoint traffic to the new service

Set up the traffic layer to send traffic to the service (if needed). This is a bit unclear to me currently. I am not sure from the diagrams whether the user's browser will need to talk to geoshapes (via the edge traffic layers) or kartotherian will talk to geoshapes, or both.

With the extraction, kartotherian and geoshapes shouldn't be related anymore. So we should set up the traffic layer to send geoshape's endpoint traffic to the new service

So maps.wikimedia.org/geoshape specifically should be routed to the new service? Or can we create a new DNS for this, e.g. geoshapes.wikimedia.org? The former isn't so easy as the latter and might not even be desirable on Traffic's side, hence my question.

Set up the traffic layer to send traffic to the service (if needed). This is a bit unclear to me currently. I am not sure from the diagrams whether the user's browser will need to talk to geoshapes (via the edge traffic layers) or kartotherian will talk to geoshapes, or both.

With the extraction, kartotherian and geoshapes shouldn't be related anymore. So we should set up the traffic layer to send geoshape's endpoint traffic to the new service

So maps.wikimedia.org/geoshape specifically should be routed to the new service? Or can we create a new DNS for this, e.g. geoshapes.wikimedia.org? The former isn't so easy as the latter and might not even be desirable on Traffic's side, hence my question.

We can go with the new DNS route for geoshapes, maybe it's even better on the application side for a proper switch with possibility to rollback.

Set up the traffic layer to send traffic to the service (if needed). This is a bit unclear to me currently. I am not sure from the diagrams whether the user's browser will need to talk to geoshapes (via the edge traffic layers) or kartotherian will talk to geoshapes, or both.

With the extraction, kartotherian and geoshapes shouldn't be related anymore. So we should set up the traffic layer to send geoshape's endpoint traffic to the new service

So maps.wikimedia.org/geoshape specifically should be routed to the new service? Or can we create a new DNS for this, e.g. geoshapes.wikimedia.org? The former isn't so easy as the latter and might not even be desirable on Traffic's side, hence my question.

We can go with the new DNS route for geoshapes, maybe it's even better on the application side for a proper switch with possibility to rollback.

Perfect! Thanks for accommodating to this! And yes, I have a similar opinion as you. I do expect that being able to change a config in the app to switch the hostname of the geoshapes endpoint will make the transition (and rollback if needed) easier and squarely on the hands of your team instead of needing cross team coordination in order to rollback edge cache configuration.

@akosiaris the initial geoshapes deployment-charts is created and ready to move forward: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/768678

I was able to benchmark it locally with Minikube, but the metrics seem to be bogus, I'll paste them here anyway and discuss how to move forward:

Results:
For some reason, Minikube metrics-server was returning bogus metrics for CPU (usage equals to 0 even though it requests at least 1 core)

Screen Shot 2022-03-16 at 12.43.42 PM.png (560×2 px, 494 KB)

With that, I believe the memory benchmark isn't accurate as well, but here is the data collected: 812Mi
Screen Shot 2022-03-16 at 12.44.14 PM.png (502×1 px, 259 KB)

The created service pod though stabilized around ~74 req/s

total_requests_per_second_1647595314.png (350×1 px, 47 KB)

This is a good thing since the current production service req/s for both clusters is ~20 req/s.

Screen Shot 2022-03-18 at 10.23.34 AM.png (508×1 px, 212 KB)

Now, how can we proceed?

@MSantos, my current understanding is that we are pausing work on this. Should we set to Stalled ?

FWIW, there has been parallel work in T216826: Move Kartotherian to Kubernetes to containerize the whole kartotherian service, which currently includes the geoshapes code. A draft helm chart is ready for review, which could provide just geoshapes or tiles with minor tweaks.