Page MenuHomePhabricator

Determining the plan for the maps-test cluster
Closed, ResolvedPublic

Description

The maps-test cluster is running on old hardware that needs to be replaced (it's reached it's end of life). We have 2 options:

  1. replace the hardware, re-image the servers, keep things as they are in regards to map data and configuration
  2. destroy the maps-test cluster, create a brand new test cluster on our cloud infrastructure

In more details:

replace hardware in current cluster

  • This is fairly easy and has a hardware cost attached to it, but not much cost in terms of the limited engineering resources of the Maps team.
  • Having a maps test cluster using real hardware and in the production zone is a unicorn; as almost all other applications run tests on cloud (wmflabs) infrastructure.

Note: the hardware cost for updating the test cluster has already been budgeted for.

create a new test cluster on cloud (wmflabs)

  • Almost all of our applications have test environments on cloud, which allows for more experimentation, is isolated from production, and is using less physical resources.
  • The current maximum disk size we can get on cloud does not allow to run the full OSM dataset and provides lower performances than dedicated hardware.

additional information:

  • Lower performances is not really an issue for us, as the test cluster will see much less traffic than production (obviously).
  • We will not be able to run performance tests representative of production, but this isn't an issue for any other application.
  • We deploy small / incremental enough changes that we should be able to spot issues fast enough on production.
  • Not having a full dataset means that we need to work with different OSM dumps, which might expose slightly different behaviours.
  • We won't be able to test map styles as effectively as previously, if we move the map test cluster to cloud—as different regions of the globe expose different mapping characteristics.
  • Moving to cloud probably requires some changes (application code, puppet, ...), but should not be too hard. My (@Gehel) time is limited, but maybe @Pnorman can help.
  • Moving to cloud could encounter additional unforeseen issues that we don't have engineering resources to 'fix'

Event Timeline

Gehel created this task.Jul 26 2017, 2:29 PM
Restricted Application added a project: Discovery. · View Herald TranscriptJul 26 2017, 2:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as High priority.Jul 26 2017, 2:33 PM
debt renamed this task from What is the plan for the maps-test cluster to Determining the plan for the maps-test cluster.Jul 26 2017, 2:46 PM
debt updated the task description. (Show Details)
debt added a subscriber: EBjune.EditedJul 27 2017, 7:05 PM

The current thinking is that we'll want to move to wmflabs/test...largest instance: 8CPU, 16GB RAM and 160GB disk

Gehel moved this task from Backlog to Needs review on the Maps-Sprint board.Jul 28 2017, 10:22 AM

After some discussions with the maps team, we think that we should move forward and relocate the maps test cluster on labs. We will need to work with only a subset of the full data, but we will be able to give @Pnorman full access to that cluster, which will help much in getting him involved more on the low level inner workings of maps.

chasemp added a subscriber: chasemp.Aug 7 2017, 8:00 PM

Let's chat folks! We can make special disk arrangements potentially and maybe work something out. I'm not sure how much is already budgeted here that we would reallocate for a "Cloud" solution but there are options.

Gehel added a comment.Aug 7 2017, 8:18 PM

We definitely don't need to have the same sizing as the current maps-test cluster (we expect a lot less traffic than production, and we can do with a subset of data). What we want to do is replicate the functionalities.

Nothing final, but a few ideas (@Pnorman should have a look at them and tell me where I'm completely wrong).

We can split the current monolithic deployment strategy in multiple components to make use of smaller and easier to size VMs:

  • postgresql master + slave -> fairly large disk (depending on the size of the subset we want to use)
  • 2x cassandra nodes -> fairly large disk (depending on the size of the subset we want to use)
  • 2x nodejs (tilerator + kartotherian) -> fairly small VMs, probably single CPU, mostly no disk and RAM
bd808 added a subscriber: bd808.Aug 7 2017, 8:29 PM

Things that the cloud-services-team would like to know:

  • Is there any FY17/18 hardware budget already reserved for refreshing the current prod machines?
  • What disk and I/O needs will your postgres and cassandra nodes have?
    • Creating custom sized images is possible; we just need to understand how the resources will be used.
  • Will this project provide services to the Beta Cluster and/or the general Wikimedia technical community via internal access in Cloud VPS/Toolforge?
  • Will this project be able to replace any or all of the services currently provided by the maps Cloud VPS project?
MaxSem added a subscriber: MaxSem.Aug 7 2017, 9:55 PM
  • Will this project be able to replace any or all of the services currently provided by the maps Cloud VPS project?

People using services provided by the maps project are encouraged to use the production tileserver instead.

What disk and I/O needs will your postgres and cassandra nodes have?

To load the full planet I'd generally specify 1TB of SSD storage. For an extract, it depends on how big. I'd guess 50-100GB with reasonable performance (>1k iops)

On a test cluster the load will be erratic, and won't need the parallel power of production.

Gehel added a comment.Aug 8 2017, 8:52 AM

Things that the cloud-services-team would like to know:

  • Is there any FY17/18 hardware budget already reserved for refreshing the current prod machines?

Yes, there is, not sure how much that represents.

  • What disk and I/O needs will your postgres and cassandra nodes have?
    • Creating custom sized images is possible; we just need to understand how the resources will be used.

To load the full planet I'd generally specify 1TB of SSD storage. For an extract, it depends on how big. I'd guess 50-100GB with reasonable performance (>1k iops)

So for the full dataset, it would probably be 2x 1TB for postgres and 2x 500GB for Cassandra. If it is easy to get that on a VM, great! If not, we can work with a subset of data, at least until the maps team grows again.

  • Will this project provide services to the Beta Cluster and/or the general Wikimedia technical community via internal access in Cloud VPS/Toolforge?

As @MaxSem said, no, this is expected to be use to test the maps service itself, not to provide a stable service that could be reused by others. Production is available for that.

  • Will this project be able to replace any or all of the services currently provided by the maps Cloud VPS project?

That project is not related to the Kartotherian / Tilerator project. The maps-team project is (not sure why the naming confusion, probably historical reasons). In that project, the following VMs will be decommissioned:

  • maps-cleartables.maps-team.eqiad.wmflabs

We might be able to decommission some of the other VMs in that project, but I'm not entirely sure what they are here for. So this will need some digging.

debt moved this task from Needs review to Done on the Maps-Sprint board.Aug 24 2017, 7:05 PM
debt closed this task as Resolved.Aug 31 2017, 7:07 PM
debt claimed this task.

Plan has been established and work is progressing.