Page MenuHomePhabricator

Create separate k8s cluster for admin-owned applications
Open, MediumPublic

Description

Infrastructure apps required to run the Toolforge/Cloud VPS platform should not co-exist with user-created applications.

Dogfooding our infrastructure is a noble concept but it has some downsides, namely, the need to have more complex resource isolation between workloads, which at this moment is hard to achieve.

In our current state, our shared use of the same cluster for user-owned and admin-owned apps creates situations like the following:

Subject: ** PROBLEM alert - Toolforge/Toolforge Home Page is CRITICAL **
Date: Sat, 23 Mar 2019 18:08:50 +0000
From: shinken <shinken@shinken-02.shinken.eqiad.wmflabs>
To: gtirloni@wikimedia.org

Notification Type: PROBLEM

Service: Toolforge Home Page
Host: Toolforge
Address: tools.wmflabs.org
State: CRITICAL

Date/Time: Sat 23 Mar 18:08:49 UTC 2019

Notes URLs: 
Additional Info:

CRITICAL - Socket timeout after 10 seconds

This happens because a tool is running on the same node as tools.admin and it's using all CPU resources. During that time, users cannot access the Toolforge home page which could prevent them from seeking help or check the status of the system.

Separating workloads such as these is also a common concept in many places, including WMF (WMCS and Production servers).

List of admin-owned apps:

  • Toolforge Home Page
  • ...