Infrastructure apps required to run the Toolforge/Cloud VPS platform should not co-exist with user-created applications.
Dogfooding our infrastructure is a noble concept but it has some downsides, namely, the need to have more complex resource isolation between workloads, which at this moment is hard to achieve.
In our current state, our shared use of the same cluster for user-owned and admin-owned apps creates situations like the following:
Subject: ** PROBLEM alert - Toolforge/Toolforge Home Page is CRITICAL ** Date: Sat, 23 Mar 2019 18:08:50 +0000 From: shinken <shinken@shinken-02.shinken.eqiad.wmflabs> To: gtirloni@wikimedia.org Notification Type: PROBLEM Service: Toolforge Home Page Host: Toolforge Address: tools.wmflabs.org State: CRITICAL Date/Time: Sat 23 Mar 18:08:49 UTC 2019 Notes URLs: Additional Info: CRITICAL - Socket timeout after 10 seconds
This happens because a tool is running on the same node as tools.admin and it's using all CPU resources. During that time, users cannot access the Toolforge home page which could prevent them from seeking help or check the status of the system.
Separating workloads such as these is also a common concept in many places, including WMF (WMCS and Production servers).
List of admin-owned apps:
- Toolforge Home Page
- ...