Change Details

Since February 17 around 20:00Z, we're seeing a steep increase in the number of 5xx responses we get from any datacenter. Here is the graph for eqiad, where network/connectivity issues can't be the cause of such errors: {F3410964} Looking at the SAL, I just see a couple of deployments around that time. This needs investigation as some users are reporting more frequent timeouts when navigating our sites. Note that there was a second step up on February 23rd around 16:00Z, which made things even more worrying. It's important to note that this does /not/ come from upload caches so it might be Mediawiki-related. **Findings so far** - The surge comes from text caches - The surge is not tied to a specific datacenter and/or varnish backend, so we can exclude any hardware issue to a specific machine.