Beta cluster is down.
Cannot find any other report on this, feel fee to mark it as duplicate if there is any.
Ryasmeen | |
Oct 23 2017, 8:32 PM |
F10389477: Screen Shot 2017-10-23 at 1.30.19 PM.png | |
Oct 23 2017, 8:32 PM |
F10389472: Screen Shot 2017-10-23 at 1.31.11 PM.png | |
Oct 23 2017, 8:32 PM |
Beta cluster is down.
Cannot find any other report on this, feel fee to mark it as duplicate if there is any.
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +112 -0 | beta: hieradata for varnish caches |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | hashar | T178841 Beta cluster is down | |||
Declined | None | T179197 Investigate what caused the unattended varnish upgrade in Beta Cluster |
I've only looked at deployment-cache-text04. Puppet had been broken there for ages due to several puppet changes.
Horizon has a prefix defined for deployment-cache-text* but there were also some puppet settings local to just 04. I moved all node-specific settings to the prefix, and resolved one puppet issue (caused by a rename of a role to a profile.)
Now the catalog compiles, but puppet fails while setting up a LE cert; after that varnish fails to start.
Since varnish is down, Puppet fails to trigger the letsencrypt certificate renewal since that attempts to reach http://beta.wmflabs.org/ :/
Puppet fails with:
Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server. /usr/local/sbin/acme_tiny.py --account-key /etc/acme/acct/acct.key --csr /etc/acme/csr/beta_wmflabs_org.pem --acme-dir /var/acme/challenge<< failed, exit code 1, stderr:
couldn't download http://beta.wmflabs.org/.well-known/acme-challenge/XXXXX
http://beta.wmflabs.org/ is not served so I have hacked up the nginx config:
server { listen 0.0.0.0:80; server_name beta.wmflabs.org; include /etc/acme/challenge-nginx.conf; access_log /var/log/nginx/betahttp.log; error_log /var/log/nginx/betahttp.log; }
Restarted nginx. I can reach from the internal network but not from outside:
$ curl --verbose http://beta.wmflabs.org/ * Trying 208.80.155.135... * TCP_NODELAY set * connect to 208.80.155.135 port 80 failed: Connection refused * Failed to connect to beta.wmflabs.org port 80: Connection refused * Closing connection 0 curl: (7) Failed to connect to beta.wmflabs.org port 80: Connection refused
tcpdump on the instance shows no traffic on port 80.
root@deployment-cache-upload04 does not have any iptables rules
The instance security group https://horizon.wikimedia.org/project/instances/64c9625a-dace-4e69-a31c-c3b5401e3366/ has ALLOW 80:80/tcp from 0.0.0.0/0
So I guess something else is blocking traffic to 208.80.155.135:80
I give up for tonight. I wake up in six hours.
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
please don't. it will just conflict / be reverted by ferm or ferm service will be stopped leading to more manual things to fix later. we can just add a normal ferm rule to open port 80 like we do for everything else as well.
HTTPS should now work again too. Need to commit hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml on the puppetmaster:
profile::cache::base::varnish_version: 5 nginx::variant: extras cache::lua_support: true cluster: cache_text "cache::cluster": text "profile::cache::ssl::unified::le_subjects": - beta.wmflabs.org - www.wikimedia.beta.wmflabs.org - www.wikipedia.beta.wmflabs.org - www.wikibooks.beta.wmflabs.org - www.wiktionary.beta.wmflabs.org - commons.wikimedia.beta.wmflabs.org - commons.m.wikimedia.beta.wmflabs.org - deployment.wikimedia.beta.wmflabs.org - deployment.m.wikimedia.beta.wmflabs.org - en.wikibooks.beta.wmflabs.org - en.m.wikibooks.beta.wmflabs.org - en.wikinews.beta.wmflabs.org - en.m.wikinews.beta.wmflabs.org - en.wikiquote.beta.wmflabs.org - en.m.wikiquote.beta.wmflabs.org - en.wikisource.beta.wmflabs.org - en.m.wikisource.beta.wmflabs.org - en.wikiversity.beta.wmflabs.org - en.m.wikiversity.beta.wmflabs.org - en.wikivoyage.beta.wmflabs.org - en.m.wikivoyage.beta.wmflabs.org - en.wiktionary.beta.wmflabs.org - en.m.wiktionary.beta.wmflabs.org - login.wikimedia.beta.wmflabs.org - login.m.wikimedia.beta.wmflabs.org - meta.wikimedia.beta.wmflabs.org - meta.m.wikimedia.beta.wmflabs.org - test.wikimedia.beta.wmflabs.org - test.m.wikimedia.beta.wmflabs.org - wikidata.beta.wmflabs.org - m.wikidata.beta.wmflabs.org - zero.wikimedia.beta.wmflabs.org - zero.m.wikimedia.beta.wmflabs.org - aa.wikipedia.beta.wmflabs.org - aa.m.wikipedia.beta.wmflabs.org - aa.zero.wikipedia.beta.wmflabs.org - ar.wikipedia.beta.wmflabs.org - ar.m.wikipedia.beta.wmflabs.org - ar.zero.wikipedia.beta.wmflabs.org - ca.wikipedia.beta.wmflabs.org - ca.m.wikipedia.beta.wmflabs.org - ca.zero.wikipedia.beta.wmflabs.org - de.wikipedia.beta.wmflabs.org - de.m.wikipedia.beta.wmflabs.org - de.zero.wikipedia.beta.wmflabs.org - de.wiktionary.beta.wmflabs.org - de.m.wiktionary.beta.wmflabs.org - en-rtl.wikipedia.beta.wmflabs.org - en-rtl.m.wikipedia.beta.wmflabs.org - en-rtl.zero.wikipedia.beta.wmflabs.org - en.wikipedia.beta.wmflabs.org - en.m.wikipedia.beta.wmflabs.org - en.zero.wikipedia.beta.wmflabs.org - eo.wikipedia.beta.wmflabs.org - eo.m.wikipedia.beta.wmflabs.org - eo.zero.wikipedia.beta.wmflabs.org - es.wikipedia.beta.wmflabs.org - es.m.wikipedia.beta.wmflabs.org - es.zero.wikipedia.beta.wmflabs.org - fa.wikipedia.beta.wmflabs.org - fa.m.wikipedia.beta.wmflabs.org - fa.zero.wikipedia.beta.wmflabs.org - he.wikipedia.beta.wmflabs.org - he.m.wikipedia.beta.wmflabs.org - he.zero.wikipedia.beta.wmflabs.org - he.wiktionary.beta.wmflabs.org - he.m.wiktionary.beta.wmflabs.org - hi.wikipedia.beta.wmflabs.org - hi.m.wikipedia.beta.wmflabs.org - hi.zero.wikipedia.beta.wmflabs.org - ja.wikipedia.beta.wmflabs.org - ja.m.wikipedia.beta.wmflabs.org - ja.zero.wikipedia.beta.wmflabs.org - ko.wikipedia.beta.wmflabs.org - ko.m.wikipedia.beta.wmflabs.org - ko.zero.wikipedia.beta.wmflabs.org - nl.wikipedia.beta.wmflabs.org - nl.m.wikipedia.beta.wmflabs.org - nl.zero.wikipedia.beta.wmflabs.org - ru.wikipedia.beta.wmflabs.org - ru.m.wikipedia.beta.wmflabs.org - ru.zero.wikipedia.beta.wmflabs.org - simple.wikipedia.beta.wmflabs.org - simple.m.wikipedia.beta.wmflabs.org - simple.zero.wikipedia.beta.wmflabs.org - sq.wikipedia.beta.wmflabs.org - sq.m.wikipedia.beta.wmflabs.org - sq.zero.wikipedia.beta.wmflabs.org - uk.wikipedia.beta.wmflabs.org - uk.m.wikipedia.beta.wmflabs.org - uk.zero.wikipedia.beta.wmflabs.org - zh.wikipedia.beta.wmflabs.org - zh.m.wikipedia.beta.wmflabs.org - zh.zero.wikipedia.beta.wmflabs.org - commons.wikipedia.beta.wmflabs.org
upload.beta.wmflabs.org refuses SSL connections right now, I see that it's not on that list
I guess we only fixed the text cache. Puppet fails on deployment-cache-upload04.deployment-prep.eqiad.wmflabs :(
Error: /Stage[main]/Nginx/Package[nginx-full]/ensure: change from absent to present failed Job for nginx.service failed. See 'systemctl status nginx.service' and 'journalctl -xn' for details. invoke-rc.d: initscript nginx, action "start" failed. dpkg: error processing package nginx-full (--configure): subprocess installed post-installation script returned error exit status 1 Errors were encountered while processing: nginx-full
nginx: [emerg] unknown directive "lua_shared_dict" in /etc/nginx/sites-enabled/tlsproxy-prometheus:3 nginx: configuration file /etc/nginx/nginx.conf test failed
That is the same issue as T174746
I have applied a similar configuration in hiera for deployment-cache-upload04
While installing nginx-extra, the service failed to restart which blocks puppet:
nginx: [emerg] unknown directive "lua_shared_dict" in /etc/nginx/sites-enabled/tlsproxy-prometheus:3 nginx: configuration file /etc/nginx/nginx.conf test failed
I have deleted /etc/nginx/sites-enabled/tlsproxy-prometheus and puppet tweaked some configuration file:
--- /etc/nginx/nginx.conf 2017-09-14 15:36:09.894913804 +0000 +++ /tmp/puppet-file20171024-1587-1lmgkgv 2017-10-24 07:55:50.753855825 +0000 @@ -5,6 +5,8 @@ worker_cpu_affinity 01 10; worker_rlimit_nofile 262144; +load_module modules/ndk_http_module.so; +load_module modules/ngx_http_lua_module.so; error_log /var/log/nginx/error.log; pid /run/nginx.pid; @@ -77,6 +79,7 @@ } + lua_package_path '/etc/nginx/lua/?.lua;;';
The LE acme challenge fails though :/
ValueError: Wrote file to /var/acme/challenge/xxx, but couldn't download http://upload.beta.wmflabs.org/.well-known/acme-challenge/xxx
In profile::cache::ssl::unified I have commented out the tlsproxy::localssl { 'unified': ... } to get the Varnish conf updated eg:
- new cache_local = vslp.vslp(); + new cache_local = directors.shard();
This way varnish managed to listen on port 80.
So at least http://upload.beta.wmflabs.org/.well-known/acme-challenge/XXX respond. But that is a 301 TLS Redirect to the https variant. It refused connection I had to restart nginx.
Eventually it did:
Notice: tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server. Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Notify[tlsproxy localssl default_server]/message: defined 'message' as 'tlsproxy::localssl instance unified with server name beta.wmflabs.org is the default server.' Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Letsencrypt::Cert::Integrated[beta.wmflabs.org]/Exec[acme-setup-acme-beta_wmflabs_org]/returns: executed successfully
The cert eventually managed to be generated, puppet has updated the varnish config. https://commons.wikimedia.beta.wmflabs.org/wiki/File:2016-02-01_005-IMG_8105-3.jpg serves a thumbnail \o/
Next error:
Notice: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns Command failed with error code 106 Message from VCC-compiler: Could not load VMOD std File name: libvmod_std.so dlerror: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so: undefined symbol: WS_ReserveLumps ('/etc/varnish/wikimedia-common_upload-backend.inc.vcl' Line 4 Pos 8) import std; -------###- Running VCC-compiler failed, exited with 2 VCL compilation failed Error: vcl.load vcl-83f4e4ad-31ec-4afb-a067-6db7bd8d9ac3 /etc/varnish/wikimedia_upload-backend.vcl failed/etc/varnish/wikimedia_upload-backend.vcl vcl-83f4e4ad-31ec-4afb-a067-6db7bd8d9ac3 reload failed Error: /usr/share/varnish/reload-vcl && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0] Error: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns: change from notrun to 0 failed: /usr/share/varnish/reload-vcl && (rm /var/tmp/reload-vcl-failed; true) returned 1 instead of one of [0]
# dpkg -S /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so varnish: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so # apt-cache policy varnish varnish: Installed: 5.1.3-1wm1 Candidate: 5.1.3-1wm1 Version table: *** 5.1.3-1wm1 0 1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/experimental amd64 Packages
I have restarted varnish-frontend and varnish and it passed:
Notice: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns: executed successfully
Change 386077 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] beta: hieradata for varnish caches
Mentioned in SAL (#wikimedia-releng) [2017-10-24T08:35:44Z] <hashar> beta: cherry pick https://gerrit.wikimedia.org/r/#/c/386077/4 "hieradata for varnish caches" - T178841
Status
https://gerrit.wikimedia.org/r/#/c/386077/4 cherry picked on the beta cluster puppetmaster
Puppet and Varnish are now happy on both deployment-cache-text04 and deployment-cache-upload04.
So I guess we wanna review the Gerrit patch. It is probably a good idea to have the hiera settings to all be in puppet.git.
Change 386077 merged by Dzahn:
[operations/puppet@production] beta: hieradata for varnish caches
Are we okay to close this now? Do we want to look into what caused the initial varnish upgrade?
That's a good question (re what caused the varnish upgrade) so I guess we should figure that out. The timing seems oddly non-deterministic (from my understanding).