Beta needs to be upgraded to Varnish 6
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	Nov 9 2020, 2:52 PM

Description

deployment-cache-text06.deployment-prep.eqiad.wmflabs seems to be non-functional, breaking Beta:

https://en.wikipedia.beta.wmflabs.org/

And deployment-cache-upload06.deployment-prep.eqiad.wmflabs is running into some VCL errors:

Command '['/usr/bin/varnishadm', '-n', 'frontend', 'vcl.load', 'vcl-7880e47a-623b-47e8-8c90-c5e4cd455c5b', '/etc/varnish/wikimedia_upload-frontend.vcl']' returned non-zero exit status 2.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Vgutierrez	T267561 Beta needs to be upgraded to Varnish 6
		Declined		None	T267578 Puppet disabled in beta cluster varnish deployment-cache-text06

Event Timeline

• Gilles created this task.Nov 9 2020, 2:52 PM

Restricted Application added a project: SRE. · View Herald TranscriptNov 9 2020, 2:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Gilles added a parent task: T267551: Beta Cluster is down.Nov 9 2020, 2:53 PM

• Gilles updated the task description. (Show Details)

RhinosF1 subscribed.Nov 9 2020, 3:00 PM

jijiki triaged this task as Medium priority.Nov 10 2020, 4:07 PM

I just tried to fix by installing varnish 6, but clearly (and obviously) it isn't that simple.

$ sudo apt-get remove varnish varnish-modules libvmod-netmapper
$ sudo apt-get install varnish
...
Setting up varnish (6.0.6-1wm2) ...
...

$ sudo puppet agent --enable
$ sudo puppet agent --test
...
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Class[Confd]: parameter 'srv_dns' expects a match for Stdlib::Fqdn = Pattern[/\A(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\z/], got '' (file: /etc/puppet/modules/confd/manifests/file.pp, line: 39, column: 5) (file: /etc/puppet/modules/profile/manifests/cache/varnish/frontend.pp, line: 87) on node deployment-cache-text06.deployment-prep.eqiad.wmflabs

Also, syslog has:

Nov 10 18:40:49 deployment-cache-text06 confd[3851]: 2020-11-10T18:40:49Z deployment-cache-text06 /usr/bin/confd[3851]: INFO SRV record set to _etcd._tcp.-scheme.
Nov 10 18:40:49 deployment-cache-text06 confd[3851]: 2020-11-10T18:40:49Z deployment-cache-text06 /usr/bin/confd[3851]: FATAL Cannot get nodes from SRV records lookup _etcd._tcp.-scheme.: no such host

I gave up here and disabled puppet again.

Razzi and I were about to use varnishkafka to test some Kafka webrequest partition rebalancing for T255973, but can't because varnish isn't working! :p

I guess that T267439 might be related.

Ryasmeen subscribed.Nov 10 2020, 7:45 PM

• jbolorinos-ctr subscribed.Nov 10 2020, 8:00 PM

• dpifke subscribed.Nov 10 2020, 8:13 PM

thcipriani added a subtask: T267578: Puppet disabled in beta cluster varnish deployment-cache-text06.Nov 10 2020, 10:22 PM

thcipriani removed a parent task: T267551: Beta Cluster is down.

hashar closed subtask T267578: Puppet disabled in beta cluster varnish deployment-cache-text06 as Declined.Nov 12 2020, 11:27 AM

hashar mentioned this in T267578: Puppet disabled in beta cluster varnish deployment-cache-text06.

modules/profile/manifests/cache/varnish/frontend.pp line 87 invokes confd::file which initalizes the confd module. It has:

modules/confd/manifests/init.pp

class confd(
...
    Stdlib::Fqdn     $srv_dns       = $facts['domain'],
) {
...

Somehow $facts['domain'] is empty or not resolved properly (even though facter domain shows an entry` or puppet apply with a dummy manifest does show it can be resolved).

On the instance puppet config page https://horizon.wikimedia.org/project/instances/6ec55bf4-600e-47a4-9526-16fd3316c512/ I have added:

Hiera Config

confd::srv_dns: deployment-prep.eqiad.wmflabs

Ran puppet and surely that has killed the instance due to Linux kernel OOM killer kicking in. From the console log :

[3639918.849631] Out of memory: Kill process 12621 ([TS_MAIN]) score 72 or sacrifice child
[3639918.853687] Killed process 12621 ([TS_MAIN]) total-vm:1307100kB, anon-rss:291592kB, file-rss:0kB, shmem-rss:0kB
[3668783.123857] Out of memory: Kill process 29213 ([TS_MAIN]) score 60 or sacrifice child
[3668783.127799] Killed process 29219 (traffic_crashlo) total-vm:15904kB, anon-rss:980kB, file-rss:28kB, shmem-rss:0kB
[3668788.095180] Out of memory: Kill process 29213 ([TS_MAIN]) score 60 or sacrifice child
[3668788.099651] Killed process 29213 ([TS_MAIN]) total-vm:1298900kB, anon-rss:243500kB, file-rss:0kB, shmem-rss:0kB
[3674220.052830] Out of memory: Kill process 13887 (cache-main) score 63 or sacrifice child
[3674220.057106] Killed process 13887 (cache-main) total-vm:584320kB, anon-rss:171316kB, file-rss:0kB, shmem-rss:82532kB
[3695937.672659] Out of memory: Kill process 8765 ([TS_MAIN]) score 60 or sacrifice child
[3695937.675896] Killed process 8771 (traffic_crashlo) total-vm:15904kB, anon-rss:984kB, file-rss:0kB, shmem-rss:0kB
[3695941.821309] Out of memory: Kill process 8765 ([TS_MAIN]) score 60 or sacrifice child
[3695941.825289] Killed process 8765 ([TS_MAIN]) total-vm:1298940kB, anon-rss:243904kB, file-rss:0kB, shmem-rss:0kB
[3708431.301766] Out of memory: Kill process 17280 (cache-main) score 85 or sacrifice child
[3708431.307516] Killed process 17280 (cache-main) total-vm:777852kB, anon-rss:261664kB, file-rss:0kB, shmem-rss:82512kB

I have moved the confd::srv_dns: deployment-prep.eqiad.wmflabs setting to a new puppet prefix deployment-cache at https://horizon.wikimedia.org/project/prefixpuppet/ . So that would apply to both deployment-cache-text06.deployment-prep.eqiad.wmflabs and deployment-cache-upload06.deployment-prep.eqiad.wmflabs.

I guess Varnish tries to pre allocate some memory that ends up exceeding what is available to the instance? The are g2.cores2.ram4.disk40 so 4GB of RAM.

I have hard rebooted the instance and disabled puppet again.

systemctl list-units --failed shows that varnish-frontend.service fails and from the journal:

systemd[1]: Starting varnish-frontend (Varnish HTTP Accelerator)...
varnish-frontend[834]: <jemalloc>: Invalid conf pair: lg_dirty_mult:8
varnish-frontend[834]: <jemalloc>: Invalid conf pair: lg_chunk:17
varnish-frontend[834]: Debug: Version: varnish-6.0.6 revision 29a1a8243dbef3d973aec28dc90403188c1dc8e7
varnishd[839]: Version: varnish-6.0.6 revision 29a1a8243dbef3d973aec28dc90403188c1dc8e7
varnish-frontend[834]: Debug: Platform: Linux,4.19.0-12-amd64,x86_64,-junix,-smalloc,-sdefault,-hcritbit
varnishd[839]: Platform: Linux,4.19.0-12-amd64,x86_64,-junix,-smalloc,-sdefault,-hcritbit
varnish-frontend[842]: No -T in shared memory
varnish-frontend[842]: Executing: "/usr/bin/varnishadm -n frontend vcl.load vcl-0a81a68e-0b5f-4d78-bf07-77d1712d63af /etc/varnish/wikimedia_misc-frontend.vcl"
varnish-frontend[842]: Traceback (most recent call last):
varnish-frontend[842]:   File "/usr/local/sbin/reload-vcl", line 179, in <module>
varnish-frontend[842]:     main()
varnish-frontend[842]:   File "/usr/local/sbin/reload-vcl", line 138, in main
varnish-frontend[842]:     separate_vcl_ids = [load(vadm_cmd, vcl_file) for vcl_file in args.separate_vcl_files]
varnish-frontend[842]:   File "/usr/local/sbin/reload-vcl", line 138, in <listcomp>
varnish-frontend[842]:     separate_vcl_ids = [load(vadm_cmd, vcl_file) for vcl_file in args.separate_vcl_files]
varnish-frontend[842]:   File "/usr/local/sbin/reload-vcl", line 123, in load
varnish-frontend[842]:     do_cmd(vcl_load_cmd)
varnish-frontend[842]:   File "/usr/local/sbin/reload-vcl", line 63, in do_cmd
varnish-frontend[842]:     subprocess.check_call(cmd)
varnish-frontend[842]:   File "/usr/lib/python3.7/subprocess.py", line 347, in check_call
varnish-frontend[842]:     raise CalledProcessError(retcode, cmd)
varnish-frontend[842]: subprocess.CalledProcessError: Command '['/usr/bin/varnishadm', '-n', 'frontend', 'vcl.load', 'vcl-0a81a68e-0b5f-4d78-bf07-77d1712d63af', '/etc/varnish/wikimedia_misc-frontend.vcl']' returned non-zero exit status 2.
systemd[1]: varnish-frontend.service: Control process exited, code=exited, status=1/FAILURE
varnishd[839]: Manager got SIGTERM
varnishd[839]: manager stopping child
varnishd[839]: manager dies
systemd[1]: varnish-frontend.service: Failed with result 'exit-code'.
systemd[1]: Failed to start varnish-frontend (Varnish HTTP Accelerator).
systemd[1]: varnish-frontend.service: Service RestartSec=100ms expired, scheduling restart.
systemd[1]: varnish-frontend.service: Scheduled restart job, restart counter is at 5.
systemd[1]: Stopped varnish-frontend (Varnish HTTP Accelerator).
systemd[1]: varnish-frontend.service: Start request repeated too quickly.
systemd[1]: varnish-frontend.service: Failed with result 'exit-code'.
systemd[1]: Failed to start varnish-frontend (Varnish HTTP Accelerator).

Seems the error message is No -T in shared memory, the parameter is for the management interface.

Not much else I can do unfortunately.

I was able to get further along by doing things manually as root on one of the instances, deployment-cache-text06.

I looked at the systemd unit file, /lib/systemd/system/varnish-frontend.service, and stole the arguments from there. As root I ran

MALLOC_CONF=lg_dirty_mult:8,lg_chunk:17 ; /usr/sbin/varnishd -a :80 -a :3120 -a :3121 -a :3122 -a :3123 -a :3124 -a :3125 -a :3126 -a :3127  -T 127.0.0.1:6082 -f '' -p thread_pool_min=250 -p thread_pool_max=500 -p thread_pool_timeout=120 -p vsl_reclen=2048 -p workspace_backend=128k -p vcc_allow_inline_c=true -S /etc/varnish/secret -s malloc,1G -p thread_pool_stack=131072 -p listen_depth=4096 -p vcc_err_unref=off -p http_req_size=24576 -p gzip_level=8  -p gzip_memlevel=9 -p default_ttl=3600 -n frontend -p cc_command="exec gcc -std=gnu99 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall -pthread -fpic -shared -Wl,-x -L/usr/local/lib/ -o %%o %%s -lmaxminddb" -P /run/varnish-frontend.pid

and it started ok. It's running varnish-6.0.6 according to its output.
Then I tried the reload-vcl.py script and it complained as before. The -T not in share memory, I think means that varnishadm, invoked by this script, is looking for the values of the host ip and port in memory, having been given the varnish instance name, and failing to find them. I ran varnishadm directly, as follows, specifying the values for -T:

/usr/bin/varnishadm -T 127.0.0.1:6082 -S /etc/varnish/secret  vcl.load  vcl-0a81a68e-0b5f-4d78-bf07-77d1712d63af  /etc/varnish/wikimedia_misc-frontend.vcl

This got us an authenticated connection, but the error message

Message from VCC-compiler:
Incompatible VMOD re2
	File name: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_re2.so
	VMOD wants ABI version 6.0
	varnishd provices ABI version 7.1
('/etc/varnish/alternate-domains.inc.vcl' Line 1 Pos 8)
import re2;
-------###-

Running VCC-compiler failed, exited with 2
VCL compilation failed
Command failed with error code 106

I guess that an update of some sort is likely needed, but I'm hesitant to proceed further without assistance from someone with varnish expertise. I have killed the running varnish instance so everything is as before.

ArielGlenn mentioned this in T267796: Beta cluster not working (Error: 502, Next Hop Connection Failed).Nov 12 2020, 2:16 PM

AlexisJazz merged a task: T267796: Beta cluster not working (Error: 502, Next Hop Connection Failed).Nov 12 2020, 2:28 PM

AlexisJazz subscribed.

Ariel pointed /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_re2.so is the wrong version. One this task or another related one, I remembered someone marked the related package on hold and surely:

$ apt-mark showhold
libvmod-re2

Trying to upgrade:

$ sudo apt upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
  libopts25 python3-debconf
Use 'sudo apt autoremove' to remove them.
The following packages will be upgraded:
  libvmod-re2
The following packages will be DOWNGRADED:
  libvarnishapi2
1 upgraded, 0 newly installed, 1 downgraded, 0 to remove and 0 not upgraded.
Need to get 194 kB of archives.
After this operation, 27.6 kB of additional disk space will be used.
Do you want to continue? [Y/n] n
Abort.
hashar@deployment-cache-text06:~$ apt-cache policy libvarnishapi2
libvarnishapi2:
  Installed: 6.1.1-1+deb10u1
  Candidate: 6.0.6-1wm2
  Version table:
 *** 6.1.1-1+deb10u1 500
        500 http://deb.debian.org/debian buster/main amd64 Packages
        500 http://security.debian.org buster/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     6.0.6-1wm2 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/component/varnish6 amd64 Packages

I went with the downgrade cause that would come from OUR component:

$ sudo apt-get upgrade -V
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
   libopts25 (1:5.18.12-4)
   python3-debconf (1.5.71)
Use 'sudo apt autoremove' to remove them.
The following packages will be upgraded:
   libvmod-re2 (1.3.1-3 => 1.5.3-1)
The following packages will be DOWNGRADED:
   libvarnishapi2 (6.1.1-1+deb10u1 => 6.0.6-1wm2)

$ sudo systemctl status varnish-frontend

● varnish-frontend.service - varnish-frontend (Varnish HTTP Accelerator)
   Loaded: loaded (/lib/systemd/system/varnish-frontend.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-11-12 14:30:47 UTC; 10s ago

!!!!!!!!!!!!!!!!!!!!!!!!!!

Hey it works now!

Mentioned in SAL (#wikimedia-releng) [2020-11-12T14:33:03Z] <hashar> Beta cluster is available again. The frontend cache had a mixup between varnish 5 running and config being for varnish 6. The package got upgraded and it seems to serve traffic now. # T267561

I have enabled puppet again and ran it. Something I spotted:

--- /lib/systemd/system/confd.service	2020-11-06 20:04:20.522441044 +0000
+++ /tmp/puppet-file20201112-16796-1e2pun7	2020-11-12 14:35:59.678862604 +0000
@@ -4,7 +4,7 @@
 [Service]
 User=root
 Environment="CONFD_BACKEND=etcd"
-Environment="CONFD_DISCOVERY=-srv-domain  -scheme https"
+Environment="CONFD_DISCOVERY=-srv-domain deployment-prep.eqiad.wmflabs -scheme https"
 Environment="CONFD_OPTS=-interval=3"
 ExecStart=/usr/bin/confd -backend $CONFD_BACKEND $CONFD_DISCOVERY $CONFD_OPTS
 Restart=on-failure

So I have dropped the hiera hack in horizon that was setting confd::srv_dns: deployment-prep.eqiad.wmflabs, but puppet ends up being broken again I thus set it back.

What is still broken:

Setting confd::srv_dns is required else confd::file fails in puppet
prometheus-varnishkafka-exporter and wmf_auto_restart_prometheus-varnishkafka-exporter which are for monitoring in production
mtail , not even sure that is needed on beta

The instance has role::cache::text , though any hiera setting defined in puppet for the role are NOT applied, they are only applied for production. Regardless it seems to work.

So I guess we can say that the issue is fixed!

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptNov 12 2020, 2:44 PM

hashar merged a task: T267551: Beta Cluster is down.Nov 13 2020, 10:27 AM

hashar added subscribers: zeljkofilipin, thcipriani, DannyS712 and 2 others.

Lucas_Werkmeister_WMDE mentioned this in T265976: Lexeme daily browser tests (against beta) flaky [timebox 8h].Nov 13 2020, 12:43 PM

AlexisJazz mentioned this in T267858: The certificate for upload.beta.wmflabs.org expired on November 13, 2020..Nov 14 2020, 3:17 PM

Vgutierrez mentioned this in T267006: Puppet failures on many hosts.Nov 16 2020, 4:23 PM

Re-opening as mentioned in https://phabricator.wikimedia.org/T267006#6624466 deployment-cache-upload06 has been omitted and needs to be upgraded to varnish 6