The server at https://tools-static.wmflabs.org/ broke around 2017-09-13T13:24. Investigation showed that nginx related Puppet configuration had changed recently (rOPUPfb85f58). @BBlack confirmed that a recent version of nginx would be needed to be compatible with the new configuration. A simple apt-get install nginx-common fixed the server by upgrading from 1.11.3-1+wmf2 to 1.11.10-1+wmf3.
There are a couple of issues this highlights that we should find better means of addressing:
- Toolforge and Cloud-VPS broadly depend on unattended-upgrades to keep system packages up to date. This dependency can cause problems both by upgrading things that should not be upgraded (T159254) and, as was seen here, not upgrading things that should be upgraded. It would be useful to have some standard practices and/or monitoring systems to make it easier for any Cloud VPS tenant to know when there are packages that are in need of upgrade due to security or required functionality changes.
- Puppet changes to shared components (apache, nginx, Puppet, HHVM, Kubernetes, etc) which are used in WMF's main server clusters, Cloud VPS / Toolforge infrastructure, and other Cloud VPS projects could be announced better. Its unreasonable to expect all such changes to be reviewed by everyone who might be impacted, but it would be nice to find a lightweight communication method of making others aware of changes which could cause a loss of service if packages are not up to date or when other manual interventions are needed.
These issues are separate, but at least in my mind, related. Loss of communication was a risk identified in the process of separating the cloud-services-team from the main SRE team. I don't think that this separation is the root cause of either of these issues, but it does exacerbate any existing gaps in signaling that existed prior to the split. There have long been communications lag and tooling differences which cause problems for some Cloud VPS tenants who are tracking upstream changes from production closely. If we can find communication methods that scale, we will be better able to serve both inter-team needs and the needs of our users.
In the case of the nginx upgrade issue, there was even intra-team confusion and/or tooling failure. A prior merge of the same patch (rOPUP1811def) on 2017-07-13 caused https://tools.wmflabs.org/ to fail. At that time nginx was manually upgraded on the tools-proxy-* and project-proxy nodes, but the tools-static-* hosts were not upgraded. This shows that there was a lack of knowledge (by me) of the locations of nginx deploys that are critical to Cloud Services operation, and that we do not have any functional signaling mechanism for out of sync package versions within Toolforge and likely other Cloud VPS projects.
As written this ticket is not easily directly actionable, but it can serve as a point to gather relevant discussion of the broad issues. We should fork off sub tasks as more actionable issues are uncovered by the discussion.