Page MenuHomePhabricator

Beta cluster is down
Closed, ResolvedPublic

Description

Beta cluster is down.

Screen Shot 2017-10-23 at 1.31.11 PM.png (422×1 px, 33 KB)

Screen Shot 2017-10-23 at 1.30.19 PM.png (518×1 px, 51 KB)

Cannot find any other report on this, feel fee to mark it as duplicate if there is any.

Event Timeline

Investigation happening in -releng irc channel.

I've only looked at deployment-cache-text04. Puppet had been broken there for ages due to several puppet changes.

Horizon has a prefix defined for deployment-cache-text* but there were also some puppet settings local to just 04. I moved all node-specific settings to the prefix, and resolved one puppet issue (caused by a rename of a role to a profile.)

Now the catalog compiles, but puppet fails while setting up a LE cert; after that varnish fails to start.

Adding SRE to ask for assistance with diagnosing/resolving this.

Since varnish is down, Puppet fails to trigger the letsencrypt certificate renewal since that attempts to reach http://beta.wmflabs.org/ :/

Puppet fails with:

   Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server.
/usr/local/sbin/acme_tiny.py --account-key /etc/acme/acct/acct.key --csr /etc/acme/csr/beta_wmflabs_org.pem --acme-dir /var/acme/challenge<< failed, exit code 1, stderr:

couldn't download http://beta.wmflabs.org/.well-known/acme-challenge/XXXXX

http://beta.wmflabs.org/ is not served so I have hacked up the nginx config:

/etc/nginx/sites-enabled/unified
server {
        listen 0.0.0.0:80;
        server_name beta.wmflabs.org;
        include /etc/acme/challenge-nginx.conf;
        access_log /var/log/nginx/betahttp.log;
        error_log /var/log/nginx/betahttp.log;
}

Restarted nginx. I can reach from the internal network but not from outside:

$ curl --verbose http://beta.wmflabs.org/
*   Trying 208.80.155.135...
* TCP_NODELAY set
* connect to 208.80.155.135 port 80 failed: Connection refused
* Failed to connect to beta.wmflabs.org port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to beta.wmflabs.org port 80: Connection refused

tcpdump on the instance shows no traffic on port 80.

root@deployment-cache-upload04 does not have any iptables rules

The instance security group https://horizon.wikimedia.org/project/instances/64c9625a-dace-4e69-a31c-c3b5401e3366/ has ALLOW 80:80/tcp from 0.0.0.0/0

So I guess something else is blocking traffic to 208.80.155.135:80

I give up for tonight. I wake up in six hours.

@Krenair / @BBlack are looking into it. They both know about Letsencrypt/Varnish.

This comment was removed by Paladox.

sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT

please don't. it will just conflict / be reverted by ferm or ferm service will be stopped leading to more manual things to fix later. we can just add a normal ferm rule to open port 80 like we do for everything else as well.

Between the three of us it's been brought back up.

HTTPS should now work again too. Need to commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml on the puppetmaster:

profile::cache::base::varnish_version: 5
nginx::variant: extras
cache::lua_support: true
cluster: cache_text
"cache::cluster": text
"profile::cache::ssl::unified::le_subjects":
    - beta.wmflabs.org
    - www.wikimedia.beta.wmflabs.org
    - www.wikipedia.beta.wmflabs.org
    - www.wikibooks.beta.wmflabs.org
    - www.wiktionary.beta.wmflabs.org
    - commons.wikimedia.beta.wmflabs.org
    - commons.m.wikimedia.beta.wmflabs.org
    - deployment.wikimedia.beta.wmflabs.org
    - deployment.m.wikimedia.beta.wmflabs.org
    - en.wikibooks.beta.wmflabs.org
    - en.m.wikibooks.beta.wmflabs.org
    - en.wikinews.beta.wmflabs.org
    - en.m.wikinews.beta.wmflabs.org
    - en.wikiquote.beta.wmflabs.org
    - en.m.wikiquote.beta.wmflabs.org
    - en.wikisource.beta.wmflabs.org
    - en.m.wikisource.beta.wmflabs.org
    - en.wikiversity.beta.wmflabs.org
    - en.m.wikiversity.beta.wmflabs.org
    - en.wikivoyage.beta.wmflabs.org
    - en.m.wikivoyage.beta.wmflabs.org
    - en.wiktionary.beta.wmflabs.org
    - en.m.wiktionary.beta.wmflabs.org
    - login.wikimedia.beta.wmflabs.org
    - login.m.wikimedia.beta.wmflabs.org
    - meta.wikimedia.beta.wmflabs.org
    - meta.m.wikimedia.beta.wmflabs.org
    - test.wikimedia.beta.wmflabs.org
    - test.m.wikimedia.beta.wmflabs.org
    - wikidata.beta.wmflabs.org
    - m.wikidata.beta.wmflabs.org
    - zero.wikimedia.beta.wmflabs.org
    - zero.m.wikimedia.beta.wmflabs.org
    - aa.wikipedia.beta.wmflabs.org
    - aa.m.wikipedia.beta.wmflabs.org
    - aa.zero.wikipedia.beta.wmflabs.org
    - ar.wikipedia.beta.wmflabs.org
    - ar.m.wikipedia.beta.wmflabs.org
    - ar.zero.wikipedia.beta.wmflabs.org
    - ca.wikipedia.beta.wmflabs.org
    - ca.m.wikipedia.beta.wmflabs.org
    - ca.zero.wikipedia.beta.wmflabs.org
    - de.wikipedia.beta.wmflabs.org
    - de.m.wikipedia.beta.wmflabs.org
    - de.zero.wikipedia.beta.wmflabs.org
    - de.wiktionary.beta.wmflabs.org
    - de.m.wiktionary.beta.wmflabs.org
    - en-rtl.wikipedia.beta.wmflabs.org
    - en-rtl.m.wikipedia.beta.wmflabs.org
    - en-rtl.zero.wikipedia.beta.wmflabs.org
    - en.wikipedia.beta.wmflabs.org
    - en.m.wikipedia.beta.wmflabs.org
    - en.zero.wikipedia.beta.wmflabs.org
    - eo.wikipedia.beta.wmflabs.org
    - eo.m.wikipedia.beta.wmflabs.org
    - eo.zero.wikipedia.beta.wmflabs.org
    - es.wikipedia.beta.wmflabs.org
    - es.m.wikipedia.beta.wmflabs.org
    - es.zero.wikipedia.beta.wmflabs.org
    - fa.wikipedia.beta.wmflabs.org
    - fa.m.wikipedia.beta.wmflabs.org
    - fa.zero.wikipedia.beta.wmflabs.org
    - he.wikipedia.beta.wmflabs.org
    - he.m.wikipedia.beta.wmflabs.org
    - he.zero.wikipedia.beta.wmflabs.org
    - he.wiktionary.beta.wmflabs.org
    - he.m.wiktionary.beta.wmflabs.org
    - hi.wikipedia.beta.wmflabs.org
    - hi.m.wikipedia.beta.wmflabs.org
    - hi.zero.wikipedia.beta.wmflabs.org
    - ja.wikipedia.beta.wmflabs.org
    - ja.m.wikipedia.beta.wmflabs.org
    - ja.zero.wikipedia.beta.wmflabs.org
    - ko.wikipedia.beta.wmflabs.org
    - ko.m.wikipedia.beta.wmflabs.org
    - ko.zero.wikipedia.beta.wmflabs.org
    - nl.wikipedia.beta.wmflabs.org
    - nl.m.wikipedia.beta.wmflabs.org
    - nl.zero.wikipedia.beta.wmflabs.org
    - ru.wikipedia.beta.wmflabs.org
    - ru.m.wikipedia.beta.wmflabs.org
    - ru.zero.wikipedia.beta.wmflabs.org
    - simple.wikipedia.beta.wmflabs.org
    - simple.m.wikipedia.beta.wmflabs.org
    - simple.zero.wikipedia.beta.wmflabs.org
    - sq.wikipedia.beta.wmflabs.org
    - sq.m.wikipedia.beta.wmflabs.org
    - sq.zero.wikipedia.beta.wmflabs.org
    - uk.wikipedia.beta.wmflabs.org
    - uk.m.wikipedia.beta.wmflabs.org
    - uk.zero.wikipedia.beta.wmflabs.org
    - zh.wikipedia.beta.wmflabs.org
    - zh.m.wikipedia.beta.wmflabs.org
    - zh.zero.wikipedia.beta.wmflabs.org
    - commons.wikipedia.beta.wmflabs.org

upload.beta.wmflabs.org refuses SSL connections right now, I see that it's not on that list

I guess we only fixed the text cache. Puppet fails on deployment-cache-upload04.deployment-prep.eqiad.wmflabs :(

Error: /Stage[main]/Nginx/Package[nginx-full]/ensure: change from absent to present failed
Job for nginx.service failed. See 'systemctl status nginx.service' and 'journalctl -xn' for details.
invoke-rc.d: initscript nginx, action "start" failed.
dpkg: error processing package nginx-full (--configure):
 subprocess installed post-installation script returned error exit status 1
Errors were encountered while processing:
 nginx-full
nginx: [emerg] unknown directive "lua_shared_dict" in /etc/nginx/sites-enabled/tlsproxy-prometheus:3
nginx: configuration file /etc/nginx/nginx.conf test failed

That is the same issue as T174746

I have applied a similar configuration in hiera for deployment-cache-upload04

While installing nginx-extra, the service failed to restart which blocks puppet:

nginx: [emerg] unknown directive "lua_shared_dict" in /etc/nginx/sites-enabled/tlsproxy-prometheus:3
nginx: configuration file /etc/nginx/nginx.conf test failed

I have deleted /etc/nginx/sites-enabled/tlsproxy-prometheus and puppet tweaked some configuration file:

--- /etc/nginx/nginx.conf	2017-09-14 15:36:09.894913804 +0000
+++ /tmp/puppet-file20171024-1587-1lmgkgv	2017-10-24 07:55:50.753855825 +0000
@@ -5,6 +5,8 @@
 worker_cpu_affinity 01 10;
 worker_rlimit_nofile 262144;
 
+load_module modules/ndk_http_module.so;
+load_module modules/ngx_http_lua_module.so;
 
 error_log  /var/log/nginx/error.log;
 pid        /run/nginx.pid;
@@ -77,6 +79,7 @@
     }
 
 
+    lua_package_path '/etc/nginx/lua/?.lua;;';

The LE acme challenge fails though :/

ValueError: Wrote file to /var/acme/challenge/xxx, but couldn't download http://upload.beta.wmflabs.org/.well-known/acme-challenge/xxx

In profile::cache::ssl::unified I have commented out the tlsproxy::localssl { 'unified': ... } to get the Varnish conf updated eg:

-	new cache_local = vslp.vslp();
+	new cache_local = directors.shard();

This way varnish managed to listen on port 80.

So at least http://upload.beta.wmflabs.org/.well-known/acme-challenge/XXX respond. But that is a 301 TLS Redirect to the https variant. It refused connection I had to restart nginx.

Eventually it did:

Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server.
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Notify[tlsproxy localssl default_server]/message: defined 'message' as 'tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server.'
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Letsencrypt::Cert::Integrated[beta.wmflabs.org]/Exec[acme-setup-acme-beta_wmflabs_org]/returns: executed successfully

The cert eventually managed to be generated, puppet has updated the varnish config. https://commons.wikimedia.beta.wmflabs.org/wiki/File:2016-02-01_005-IMG_8105-3.jpg serves a thumbnail \o/

Next error:

Notice: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns
Command failed with error code 106
Message from VCC-compiler:
Could not load VMOD std
    File name: libvmod_std.so
    dlerror: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so: undefined symbol: WS_ReserveLumps
('/etc/varnish/wikimedia-common_upload-backend.inc.vcl' Line 4 Pos 8)
import std;
-------###-

Running VCC-compiler failed, exited with 2
VCL compilation failed
Error: vcl.load vcl-83f4e4ad-31ec-4afb-a067-6db7bd8d9ac3 /etc/varnish/wikimedia_upload-backend.vcl failed/etc/varnish/wikimedia_upload-backend.vcl vcl-83f4e4ad-31ec-4afb-a067-6db7bd8d9ac3 reload failed
Error: /usr/share/varnish/reload-vcl  && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0]
Error: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns: change from notrun to 0 failed: /usr/share/varnish/reload-vcl  && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0]
# dpkg -S /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so
varnish: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so

# apt-cache policy varnish
varnish:
  Installed: 5.1.3-1wm1
  Candidate: 5.1.3-1wm1
  Version table:
 *** 5.1.3-1wm1 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/experimental amd64 Packages

I have restarted varnish-frontend and varnish and it passed:

Notice: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns: executed successfully

Change 386077 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] beta: hieradata for varnish caches

https://gerrit.wikimedia.org/r/386077

hashar triaged this task as Medium priority.Oct 24 2017, 8:39 AM

Status

https://gerrit.wikimedia.org/r/#/c/386077/4 cherry picked on the beta cluster puppetmaster

Puppet and Varnish are now happy on both deployment-cache-text04 and deployment-cache-upload04.

So I guess we wanna review the Gerrit patch. It is probably a good idea to have the hiera settings to all be in puppet.git.

Change 386077 merged by Dzahn:
[operations/puppet@production] beta: hieradata for varnish caches

https://gerrit.wikimedia.org/r/386077

Are we okay to close this now? Do we want to look into what caused the initial varnish upgrade?

That's a good question (re what caused the varnish upgrade) so I guess we should figure that out. The timing seems oddly non-deterministic (from my understanding).

greg claimed this task.

Well, let's close this and make a follow-up task.