Page MenuHomePhabricator

Unable to upload files on Beta Commons
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
The special pages fail with the error message: “Could not acquire locks on server rdb1.” (Sometimes also “Could not acquire locks on server rdb2.”)

The API fails with the error code: lockmanager-fail-svr-acquire

What should have happened instead?:
Successful file upload.

Other information (browser name/version, screenshots, etc.):
Judging from Special:NewFiles or Special:Uploads, this might have been broken since December 2022.

Event Timeline

If I read LabsServices.php correctly, rdb1 is deployment-memc09.deployment-prep.eqiad1.wikimedia.cloud; I can SSH into that server and systemctl status memcached looks fine.

Okay, on beta-logs I see a “Redis exception connecting to "deployment-memc09.deployment-prep.eqiad1.wikimedia.cloud"” with the message “Connection timed out”.

But that host seems to be able to reach memc09 just fine in a manual test:

lucaswerkmeister@deployment-mediawiki11:~$ time { echo version; echo quit; } | nc deployment-memc09.deployment-prep.eqiad1.wikimedia.cloud 11211
VERSION 1.5.6

real    0m0,010s
user    0m0,001s
sys     0m0,008s

Hang on, I’ve been checking the wrong service. It’s a “Redis exception connecting”, coming from RedisConnectionPool. And while redis seems to be listening on port 6379…

lucaswerkmeister@deployment-memc09:~$ sudo lsof -iTCP -sTCP:LISTEN -n -P | grep 6379
prometheu   486  prometheus    3u  IPv4     17694      0t0  TCP 172.16.3.96:16379 (LISTEN)
redis-ser   524       redis    6u  IPv4     17522      0t0  TCP *:6379 (LISTEN)

…connecting to it from mediawiki11 doesn’t work:

lucaswerkmeister@deployment-mediawiki11:~$ time timeout 5s telnet deployment-memc09.deployment-prep.eqiad1.wikimedia.cloud 6379; echo $?
Trying 172.16.3.96...

real    0m5,005s
user    0m0,006s
sys     0m0,003s
124

If I’m reading the lsof output correctly, it should be listening on any host, and I can connect to it via a non-localhost IP, but only from the same host:

lucaswerkmeister@deployment-memc09:~$ { echo 'INFO'; echo 'QUIT'; } | nc -w5 172.16.3.96 6379; echo $?
-NOAUTH Authentication required.
+OK
0
lucaswerkmeister@deployment-mediawiki11:~$ { echo 'INFO'; echo 'QUIT'; } | nc -w5 172.16.3.96 6379; echo $?
1

Is there some firewall in between that’s blocking port 6379?

Hm, in Horizon I see “ALLOW IPv4 6379/tcp from 172.16.0.0/21” for memc09; mediawiki11 should be 172.16.3.203 according to ip a, which is in that range (it’s 172.16.0.1-172.16.7.254 according to DuckDuckGo).

I temporarily installed mtr-tiny on mwmaint02, but it didn’t really help. The connection between mwmaint02 and memc09 is direct, no intermediate hops (same for mediawiki11, presumably); a TCP traceroute (rather than ICMP) worked on port 11211 (-TP11211), but not on port 6379 (-TP6379).

As far as I can tell, all the other ports that are supposed to be allowed by the default security group are also not working:

lucaswerkmeister@deployment-mediawiki11:~$ for port in 22 5666 6379 8080 8423 8081 3811 8140 9100; do nc -w1 deployment-memc09.deployment-prep.eqiad1.wikimedia.cloud $port || echo port $port failed; done
port 22 failed
port 5666 failed
port 6379 failed
port 8080 failed
port 8423 failed
port 8081 failed
port 3811 failed
port 8140 failed
port 9100 failed

Only port 11211, from the cache policy, works. I don’t know why. (Edit: Of course memc09 isn’t listening on all those ports, but at least 22 and 9100 should work.)

But it also turns out I don’t have permission to change the deployment-prep policies anyway. I think this needs someone else’s expertise 🥺

deployment-memc09 has some iptables firewall rule with a default policy of DROP that would explain why some tests you are making are rejected?

From iptables --list -n -v:

pkts bytes target     prot opt in     out     source               destination         
2830  170K ACCEPT     tcp  --  *      *       172.16.0.0/21        0.0.0.0/0            tcp dpt:11211

The first field are packets matching a new connection and it is raising.

Still on memcached09, I see connection coming from deployment-mediawiki11 using sudo tcpdump -n 'tcp port 11211 and host 172.16.3.203'.

If I tried a connection to port 6379 (redis) from deployment-mediawiki11 to deployment-memc09, the last INPUT rule of iptables (which log and does not accept a packet) has an increased packet counts. They are logged (to syslog iirc) with log prefix [fw-in-drop] but I have no idea where they up being collected.

So essentially the MediaWiki redis_lock service points to memcached instances which do not allow Redis traffic, there should be a ferm rule added to them. I have no idea how/where to add it though :-\

The ferm rules defined by Puppet on deployment-memc09:

root@deployment-memc09:/var/lib/puppet/state# grep /etc/ferm/conf.d /var/lib/puppet/state/resources.txt|sort
file[/etc/ferm/conf.d]
file[/etc/ferm/conf.d/00_defs]
file[/etc/ferm/conf.d/01_cumin-project-defs]
file[/etc/ferm/conf.d/02_main]
file[/etc/ferm/conf.d/10_memcached]
file[/etc/ferm/conf.d/10_metricsinfra-prometheus-all]
file[/etc/ferm/conf.d/10_prometheus-all]
file[/etc/ferm/conf.d/10_ssh-from-bastion]
file[/etc/ferm/conf.d/10_ssh-from-cumin-masters]
file[/etc/ferm/conf.d/98_filter_log_filter-bootp]
file[/etc/ferm/conf.d/98_log-everything]

grepping those files, there is no definition of Redis known to Puppet but systemd has a redis-instance-tcp_6379.service (up for 4 months, since Feb 13). From cloud/instances.git nothing seem to have changed for the deployment-prep/*deployment-memc* files. So I guess something somehow has changed in operations/puppet.git.

I don't know who manages that stack in production. Maybe @jijiki would know :]

Perhaps we need something like this for Redis in Puppet? (Based on the similar block in modules/profile/manifests/idp/memcached.pp, but I have basically no idea what I’m doing.)

diff --git a/modules/redis/manifests/instance.pp b/modules/redis/manifests/instance.pp
index a3c92c9412..4a21a7b184 100644
--- a/modules/redis/manifests/instance.pp
+++ b/modules/redis/manifests/instance.pp
@@ -104,4 +104,13 @@ define redis::instance(
         content => systemd_template('redis-instance'),
         restart => false,
     }
+
+    ferm::service {'redis':
+        ensure  => $ensure,
+        desc    => 'Allow connections to redis',
+        proto   => 'tcp',
+        notrack => true,
+        port    => $port,
+        srange  => "@resolve((${apereo_cas::idp_nodes.join(' ')}))",
+    }
 }

Puppet does not even install memcached on deployment-memc09.deployment-prep.eqiad1.wikimedia.cloud.

The instance has a /etc/redis/tcp_6379.conf from March 7 2021 and looking at cloud/instance that matches the role::mediawiki::memcached role being applied to the instance.

In production, operations/mediawiki-config points to rdb* hosts with the Puppet role redis::misc::master which seems to point to T267581: Phase out "redis_sessions" cluster and away from memcached cluster. Patches seems to be from December 2022 which correspond to the last successful uploads on beta cluster.

Eventually I found https://gerrit.wikimedia.org/r/c/operations/puppet/+/864830 which removes redis from role::mediawiki::memcached:

commit a20a38fdc17bb5526e50a73783c782b27639495f
Author: Effie Mouzeli <effie@wikimedia.org>
Date:   Mon Dec 5 20:53:20 2022 +0200

    Redis sessions: Goodbye
    
    This is it! We have not been using redis sessions for a while now,
    part of the multi-dc work, so now we can completely remove anything
    related to this component.
    
    Bug: T267581
    Change-Id: I6875a0ccc4c3805a96bfa2859bed548e0af5f3c4

diff --git a/modules/role/manifests/mediawiki/memcached.pp b/modules/role/manifests/mediawiki/memcached.pp
index 70cc411c3d3..e62cf8e6937 100644
--- a/modules/role/manifests/mediawiki/memcached.pp
+++ b/modules/role/manifests/mediawiki/memcached.pp
@@ -2,7 +2,7 @@
 class role::mediawiki::memcached{
 
     system::role { 'mediawiki::memcached':
-        description => 'memcached+redis sessions',
+        description => 'memcached',
     }
 
     include ::profile::base::production
@@ -10,5 +10,4 @@ class role::mediawiki::memcached{
     include profile::memcached::instance
     include profile::memcached::memkeys
     include profile::memcached::performance
-    include profile::redis::multidc
 }

Since Puppet no more has profile::redis::multidc (which I assume creates redis::instance), the ferm rule has vanished (cause Puppet removes the configuration bits it does not know about under /etc/ferm.d/) and the Redis traffic is no more allowed.

So I guess we have the root cause.

I don't know anything about the migration task at T267581. I guess we can apply redis::misc::master to the deployment-memc* hosts and have the services collocated on the same instance. Alternatively create some new rdb instances with solely redis::misc::master (and update mediawiki-config to point to them).

@hashar or @jijiki any idea who might be able help solve this? The flickr foundation are working on a tool to copy freely licensed image from Flickr to commons (called flickypedia) and wanted to test against beta

I merely did the troubleshooting at T340908#8985768 , then I don't know anything about how memcached is setup or how to drive it through Puppet. I'd guessed @jijiki based on commit history or the team serviceops.

redis::multidc was a very complicated and not well maintained part of the infrastructure, so as soon as we moved MainStash out of redis T212129, we wanted it out of puppet and out of our servers as quickly as possible T267581.

The last bit that still required redis, and was hosted as well in redis::multidc, was LockManager Redis, a lock used during a file upload. Would it be possible (and simple) to ignore if a such a lock is acquired when we are uploading in Beta?

In production, rdb1 / rdb2 / rdb3 (which point to rdb1009 and rdb1011) use the role::redis::misc::master role, so in theory we just need to set up a new box with that role?

In production, rdb1 / rdb2 / rdb3 (which point to rdb1009 and rdb1011) use the role::redis::misc::master role, so in theory we just need to set up a new box with that role?

After muddling through T349937: New Cloud VPS instance has "Failed to start Execute cloud user/final script." in its error log, this is done.

tgr@deployment-rdb01:~$ sudo -i puppet agent -tv
...
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::redis::master::instances' (file: /etc/puppet/modules/profile/manifests/redis/master.pp, line: 2) on node deployment-rdb01.deployment-prep.eqiad1.wikimedia.cloud

I copied the hiera settings from deployment-maps:

profile::redis::master::instances:
- '6379'
profile::redis::master::settings:
  bind: 0.0.0.0
  maxmemory: 2gb
  save: ''
  stop-writes-on-bgsave-error: false

except with 128mb memory instead of 2gb since this is an 1GB box (free -m gave free: 162, available: 609) and it's only going to be used for locks anyway.

Change 969387 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/mediawiki-config@master] [beta] Use dedicated redis server

https://gerrit.wikimedia.org/r/969387

Mentioned in SAL (#wikimedia-cloud) [2023-10-27T21:26:33Z] <tgr> set up deployment-rdb01 for redis (T340908)

Change 969387 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Use dedicated redis server

https://gerrit.wikimedia.org/r/969387

Change 969393 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/mediawiki-config@master] [beta] Fix Redis configuration

https://gerrit.wikimedia.org/r/969393

Change 969393 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Fix Redis configuration

https://gerrit.wikimedia.org/r/969393

It still fails with Could not acquire locks on server rdb1. :( I'm trying to figure out why but beta DOS-ing its own error log (T349944) doesn't help.

PHP Warning: RedisException: Connection timed out, a firewall issue I guess?

Redis is listening for all IPs:

tgr@deployment-rdb01:~$ sudo lsof -iTCP -sTCP:LISTEN -n -P | grep redis
redis-ser 74039       redis    6u  IPv4 325687      0t0  TCP *:6379 (LISTEN)

iptables has a reasonable-looking rule, but it's not getting any new packets:

tgr@deployment-rdb01:~$ sudo iptables --list -n -v | grep 6379
   40  2382 ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:6379

sudo tcpdump -n 'tcp port 6379' doesn't show anything but it says

0 packets captured
2 packets received by filter
0 packets dropped by kernel

Puppet created /etc/ferm/conf.d/10_redis_master_role which contains

&SERVICE(tcp, (6379));
&NO_TRACK(tcp, (6379));

So the ferm issue seems fixed but something is still blocking connections.

iptables has a reasonable-looking rule, but it's not getting any new packets:

tgr@deployment-rdb01:~$ sudo iptables --list -n -v | grep 6379
   40  2382 ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:6379

Actually the packet count does increase when I do something like

tgr@deployment-mediawiki11:~$ time timeout 1s telnet deployment-rdb01.deployment-prep.eqiad1.wikimedia.cloud 6379; echo $?

(which fails). So does that mean it reaches iptables but gets filtered by something else behind it? No idea how to start debugging that.

Seems like it started working for no particular reason (something something caching?). LockManager.log now has WRONGPASS invalid username-password pair.

Tgr claimed this task.

Updated the password (no idea where the old one came from but LockManager seems to be the only thing using the redis password) and uploads now work: https://commons.wikimedia.beta.wmflabs.org/wiki/File:Test-T340908.png