Page MenuHomePhabricator

Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers
Open, Needs TriagePublic

Description

Cloud VPS Project Tested: n/a
Site/Location:eqiad
Number of systems: 2
Service: dse-k8s-ctrl
Networking Requirements: internal IP - no special requirements
Processor Requirements: 2
Memory: 4 GB
Disks: 20 GB
Other Requirements: none

This request is for a pair of servers to host the control plane for a new Kubernetes Cluster.

The design document for this new cluster is here.

Event Timeline

fyi: The design document isn't accesible and from the tickets alone it's unclear what this is about.

Thanks @Dzahn - I'm seeking to relax the permissions on the document, but I've added you specifically for now. There's no reason it can't be public, as far as I'm aware.

In summary, the machine learning and data engineering teams are looking to build a new Kubernetes cluster in eqiad.
This cluster is intended to run Kubeflow for ML and a number of other user-generated workloads like Spark.
The prefix dse indicates the ultimate aim of making this a shared computing resource for the data science and engineering teams, although we're only taking small steps in that direction at the moment.

We've created a project in Phabricator: DSE-Kubernetes-Cluster with a workboard and have begun to populate the backlog with tickets corresponding to the steps outlined here: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New

We're also working with serviceops to try to ensure that we follow best practices regarding cluster configuration.

Thank you @BTullis for all the details. Now I know what DSE means. If the doc could be public, even better. The project description at https://phabricator.wikimedia.org/project/profile/5959/ is also helpful though for the casual observer.

dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 --network private eqiad_C dse-k8s-ctrl1001
Ready to create Ganeti VM dse-k8s-ctrl1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row C with 2 vCPUs, 4GB of RAM, 20GB of disk in the private network.

Allocated IPv4 10.64.32.177/22
Set DNS name of IP 10.64.32.177/22 to dse-k8s-ctrl1001.eqiad.wmnet
Allocated IPv6 2620:0:861:103:10:64:32:177/64 with DNS name dse-k8s-ctrl1001.eqiad.wmnet

@BTullis I tried to create one for you but the cookbook failed at the DNS update step:

FAIL ...100% (14/14) [00:23<00:00,  1.68s/hosts]
100.0% (14/14) of nodes failed to execute command 'cd /srv/authdns/...nippets --deploy': authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002,6001-6002].wikimedia.org
0.0% (0/14) success ratio (< 100.0% threshold) for command: 'cd /srv/authdns/...nippets --deploy'. Aborting.
0.0% (0/14) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run spicerack.remote.RemoteHosts.run_sync: Cumin execution failed (exit_code=2)
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.

I picked "abort" and then got back to the same issue when it was trying to remove the DNS name and picked "abort" one more time. So:

END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1001.eqiad.wmnet on all recursors

https://phabricator.wikimedia.org/P30044

Ah, I think that this has now been addressed by the Infrastructure-Foundations team and a fix should be deployed on Monday.

Apologies for being vague @Dzahn - I'm happy to create these VMs myself (but equally happy if you would prefer to carry on doing so).

I was mainly seeking a thumbs-up from SRE and creating a ticket for the record.

T311290 has been named as the reason for that issue with the cookbook. Should be fixed already.