Page MenuHomePhabricator

Login broken by memcached ferm rules being bypassed by hiera configuration
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

What happens?:

Various errors depending on local cookie state as far as I can tell:

  • HTTP 400 with "invalid returnUrlToken" from login.wikimedia.beta.wmcloud.org
  • HTTP 200 with "The provided authentication token is either expired or invalid." from auth.wikimedia.beta.wmcloud.org

I do seem to end up with an authenticated session at https://auth.wikimedia.beta.wmcloud.org/metawiki/wiki/Special:UserLogin in the HTTP 200 case.

What should have happened instead?:

Returned to https://meta.wikimedia.beta.wmcloud.org/wiki/Main_Page with an authenticated session.

Event Timeline

bd808 renamed this task from Logins seem not to work after swtich to *.beta.wmcloud.org canonical domains to Logins seem not to work after switch to *.beta.wmcloud.org canonical domains.Jul 12 2025, 12:13 AM

There are a lot of Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR errors in https://beta-logs.wmcloud.org. The ones that look something like Memcached error for key "global:centralauth-sul3-start:centralauth:9c276ff0c2ef3e09267af520bebeb8b2" on server "127.0.0.1:11213": SERVER ERROR seem potentially related.

I'm seeing the errors mentioned in the task description when trying to log in to https://en.wikipedia.beta.wmcloud.org/wiki/Main_Page

Personally, this isn't preventing me from doing anything urgent. Just adding another datapoint.

Yeah these are both memcached errors. Tokens are stored in the micro stash, which indeed seems broken.

mwsctgr@deployment-mwmaint03:~$ mwscript shell enwiki
Psy Shell v0.12.8 (PHP 8.1.32 — cli) by Justin Hileman
> $ms = MW::srv()->getMicroStash()
= Wikimedia\ObjectCache\MemcachedPeclBagOStuff {#974}

> $ms->set('test:foo', 'bar')
= false

> $ms->get('test:foo')
= false

The set command logs an unhelpful Memcached error for key "test:foo" on server "127.0.0.1:11213": SERVER ERROR.

Microstash uses the mcrouter-primary-dc object cache config, which is

[
    "class" => "MemcachedPeclBagOStuff",
    "serializer" => "php",
    "persistent" => false,
    "servers" => [
      "127.0.0.1:11213",
    ],
    "server_failure_limit" => 1000000000.0,
    "retry_timeout" => -1,
    "loggroup" => "memcached",
    "timeout" => 500000.0,
    "allow_tcp_nagle_delay" => false,
    "routingPrefix" => "/eqiad/mw/",
  ]

At a glance, mcrouter seems to be running and has the correct prefix:

tgr@deployment-mediawiki14:~$ ps -o command= --pid `systemctl show --property MainPID --value mcrouter` | cat
/usr/bin/mcrouter --debug-fifo-root /var/lib/mcrouter/fifos --stats-root /var/lib/mcrouter/stats -p 11213 --config file:/etc/mcrouter/config.json --route-prefix=eqiad/mw --cross-region-timeout-ms=250 --cross-cluster-timeout-ms=1000 --send-invalid-route-to-default --file-observer-poll-period-ms=1000 --file-observer-sleep-before-update-ms=100 --num-proxies=1 --probe-timeout-initial=60000 --timeouts-until-tko=10

A manual test provides marginally more information (seems like the error logging of MemcachedPeclBagOStuff is buggy):

tgr@deployment-mediawiki14:~$ cat << 'MC' | nc -NC 127.0.0.1 11213
set /eqiad/mw/test:foo 0 600 3
bar
MC
SERVER_ERROR Server unavailable. Reason: mc_res_connect_timeout

When connecting directly, it indeed times out:

tgr@deployment-mediawiki14:~$ jq '.pools.eqiad' /etc/mcrouter/config.json 
{
  "servers": [
    "deployment-memc11:11211:ascii:plain",
    "deployment-memc12:11211:ascii:plain"
  ]
}

tgr@deployment-mediawiki14:~$ cat << 'MC' | nc -NC deployment-memc13 11211
set test:foo 0 600 3
bar
MC

Locally, it works fine though:

tgr@deployment-memc11:~$ cat << 'MC' | nc -C 127.0.0.1 11211
set test:foo 0 600 3
bar
MC
STORED

tgr@deployment-memc11:~$ cat << 'MC' | nc -NC 127.0.0.1 11211
get test:foo        
MC
VALUE test:foo 0 3
bar
END

So, some kind of firewall issue?

Mentioned in SAL (#wikimedia-releng) [2025-07-12T22:24:00Z] <Krinkle> Add port 11212 to 'default' security group in deployment-prep, similar to Redis, ref T399349

Mentioned in SAL (#wikimedia-releng) [2025-07-12T22:24:19Z] <Krinkle> Add port 11212 for Memcached to 'default' security group in deployment-prep (TCP, IPv4+IPV6), similar to Redis, ref T399349

Well, that didn't help. I still can't get a packet through.

Locally
krinkle@deployment-memc13:~$ echo "delete krinkle0" | nc deployment-memc13 11211
NOT_FOUND
Remotely
krinkle@deployment-mediawiki14:~$ echo "delete krinkle0" | nc deployment-memc13 11211

^C
krinkle@deployment-mediawiki14:~$

From your logged message, you added 11212 rather than 11211.

So, some kind of firewall issue?

The security groups that @Krinkle modified should only affect traffic in/out of the project from the rest of the Cloud VPS network. I think the issue here may be that iptables is active and not allowing connections to memcached.

bd808@deployment-memc13:~$ sudo iptables -L -v -n
Chain INPUT (policy DROP 8668 packets, 1255K bytes)
 pkts bytes target     prot opt in     out     source               destination
97519   49M ACCEPT     0    --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
    7   420 ACCEPT     0    --  lo     *       0.0.0.0/0            0.0.0.0/0
    0     0 ACCEPT     0    --  *      *       0.0.0.0/0            0.0.0.0/0            PKTTYPE = multicast
   17  3280 DROP       6    --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW tcp flags:!0x17/0x02
    0     0 ACCEPT     1    --  *      *       0.0.0.0/0            0.0.0.0/0
    2   112 ACCEPT     0    --  *      *       172.16.6.65          0.0.0.0/0
    1    60 ACCEPT     0    --  *      *       172.16.0.229         0.0.0.0/0
    0     0 ACCEPT     6    --  *      *       172.16.1.220         0.0.0.0/0            tcp dpt:22
    5   664 ACCEPT     6    --  *      *       172.16.3.145         0.0.0.0/0            tcp dpt:22
    0     0 ACCEPT     6    --  *      *       172.16.5.168         0.0.0.0/0            tcp dpt:22
    0     0 ACCEPT     6    --  *      *       172.16.1.220         0.0.0.0/0            tcp dpt:22
    0     0 ACCEPT     6    --  *      *       172.16.2.62          0.0.0.0/0            tcp dpt:22
 2678  878K DROP       17   --  *      *       0.0.0.0/0            255.255.255.255      udp spt:67 dpt:68
 8668 1255K NFLOG      0    --  *      *       0.0.0.0/0            0.0.0.0/0            limit: avg 1/sec burst 5 nflog-prefix "[fw-in-drop]"

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 86965 packets, 118M bytes)
 pkts bytes target     prot opt in     out     source               destination

The profile::memcached::instance Puppet class has logic for creating firewall::service managed rules, but we have disabled that in project local hiera:

modules/profile/manifests/memcached/instance.pp
if $firewall_src_sets {
    firewall::service { 'memcached':
        proto    => 'tcp',
        port     => $port,
        src_sets => $firewall_src_sets,
    }
}
if $firewall_srange {
    firewall::service { 'memcached':
        proto  => 'tcp',
        port   => $port,
        srange => $firewall_srange,
    }
}

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/master/deployment-prep/deployment-memc.yaml

deployment-memc.yaml
profile::memcached::firewall_srange: null
profile::memcached::firewall_src_sets: null

It looks like I was actually the jerk who did this last month on 2025-06-05 (T396109) and 2025-06-12 (T396732).

bd808 renamed this task from Logins seem not to work after switch to *.beta.wmcloud.org canonical domains to Login broken by memcached local firewall.Jul 13 2025, 10:38 PM

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/531c91d329807702d416cfee13913a4443ac2531%5E%21/#F0

diff --git a/deployment-prep/deployment-memc.yaml b/deployment-prep/deployment-memc.yaml
index 91b3f01..f1294b6 100644
--- a/deployment-prep/deployment-memc.yaml
+++ b/deployment-prep/deployment-memc.yaml

@@ -1,2 +1,3 @@
 profile::memcached::firewall_srange: null
-profile::memcached::firewall_src_sets: null
+profile::memcached::firewall_src_sets:
+- DOMAIN_NETWORKS
bd808@deployment-memc13:~$ sudo -i puppet agent -tv
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-memc13.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(84f67c0cb6) gitpuppet - beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical'
Notice: /Stage[main]/Profile::Memcached::Instance/Firewall::Service[memcached]/Ferm::Service[memcached]/File[/etc/ferm/conf.d/10_memcached]/ensure: defined content as '{sha256}96cca0b97ea0b413deba468764bf4a7e87d925be15c7cc1002b2118dc61ec3b2'
Info: /Stage[main]/Profile::Memcached::Instance/Firewall::Service[memcached]/Ferm::Service[memcached]/File[/etc/ferm/conf.d/10_memcached]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Ferm/Service[ferm]: Triggered 'refresh' from 1 event
Notice: Applied catalog in 7.74 seconds
bd808@deployment-memc13:~$ sudo iptables -L -v -n | grep 11211
    5   300 ACCEPT     6    --  *      *       172.16.0.0/21        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.128.0/24      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.129.0/24      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.130.0/24      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.131.0/24      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.16.0/21       0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.24.0/24       0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.16.8.0/21        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.1.0/24        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.2.0/24        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.254.0/24      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.255.0/24      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.3.0/24        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.4.0/24        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       172.20.5.0/24        0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       185.15.56.0/25       0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       185.15.56.160/28     0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       185.15.57.0/29       0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       185.15.57.16/29      0.0.0.0/0            tcp dpt:11211
    0     0 ACCEPT     6    --  *      *       185.15.57.24/29      0.0.0.0/0            tcp dpt:11211

From your logged message, you added 11212 rather than 11211.

The security groups that @Krinkle modified should only affect traffic in/out of the project from the rest of the Cloud VPS network. […]

He, only a log-only typo. The rule was actually for 11211. In any case, I've removed them now.

After forcing a puppet run on deployment-memc11 and deployment-memc12 I seem to be able to login again.

Fixed!

krinkle@deployment-mediawiki14:~$ echo "delete krinkle0" | nc deployment-memc11 11211
NOT_FOUND
^C
krinkle@deployment-mediawiki14:~$ echo "delete krinkle0" | nc deployment-memc12 11211
NOT_FOUND
^C
krinkle@deployment-mediawiki14:~$ echo "delete krinkle0" | nc deployment-memc13 11211
NOT_FOUND
^C
bd808 claimed this task.

I think the moral of the story here is that "make puppet run" is a necessary, but not always sufficient action when getting Beta to work with upstream Puppet changes. In a more awesome world the folks who are changing Puppet for production would also walk through the same changes in Beta Cluster. Until that becomes common practice we will continue to be at the mercy of the patience and persistence of folks who volunteer their time (paid or not) to keep Beta working.

bd808 renamed this task from Login broken by memcached local firewall to Login broken by memcached ferm rules being bypassed by hiera configuration.Jul 13 2025, 10:54 PM