Page MenuHomePhabricator

[betalabs] memcached listens solely on 127.0.0.1 (was: Cannot create a new user account)
Open, HighPublic

Description

Attempting to create a new user account will persistently display "An account creation for this user name is already in progress. Please wait."

Root cause

As found by @Tgr @mmodell and @hashar :

memcached was only listening on 127.0.0.1 since Sep 8th 6:42. That is due to an upgrade of the Debian memcached package which restarted the service with the stock configuration (which only listens on 127.0.0.1).

Puppet ran and updated the service definition but does not restart memcached (for good reason since that would flush the cache and we do that manually instead).

Impact

Cache were disabled and beta cluster has been slow since Sep 8th. Notably detected by:

Fix

@hashar restarted memcached on all instances on Sep 25th around 7:00 to use the Puppet provided configuration.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2019, 11:18 PM
zeljkofilipin triaged this task as High priority.Tue, Sep 17, 11:19 AM
zeljkofilipin moved this task from Backlog 🔙 to In Progress 🔨 on the User-zeljkofilipin board.
zeljkofilipin added a subscriber: zeljkofilipin.

Special:ListUsers page says that the last users at en.wikipedia.beta.wmflabs.org were created on 2019-09-08.

I've checked https://logstash-beta.wmflabs.org. I have created a filter url is Special:CreateAccount looking at the last 30 days but the only thing it found is 14 INFO messages from 2019-09-13.

zeljkofilipin removed zeljkofilipin as the assignee of this task.Tue, Sep 17, 12:41 PM

I'm blocked on this. I need help to figure out what is broken.

zeljkofilipin raised the priority of this task from High to Unbreak Now!.Tue, Sep 17, 3:32 PM
zeljkofilipin added a subscriber: Jdforrester-WMF.

In discussion with @Jdforrester-WMF we've decided to make this a train blocker, until it's clear that it's not happening in production.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptTue, Sep 17, 3:32 PM

I would recommend to push 1.34.0-wmf.23 to testwiki and attempt to reproduce there. If that works in production, then that is probably beta that is to blame somehow.

Note that https://logstash-beta.wmflabs.org/ seems broken, there are no events in for _type:mediawiki . Apparently due to logstash input not being able to start. Filled as T233134

Tgr added a subscriber: Tgr.Tue, Sep 17, 9:15 PM

This is a double-submit protection using a memcached lock. Memcached on beta tends to be unreliable, that's the first place I'd look. (If we had logs, that is :)

Tgr added a comment.Tue, Sep 17, 9:24 PM
Sep 17 21:09:02 deployment-deploy02:  PHP Warning:  PHP Startup: Unable to load dynamic library 'geoip.so' (tried: /usr/lib/php/20170718/geoip.so (/usr/lib/php/20170718/geoip.so: cannot open shared object file: No such file or directory), /usr/lib/php/20170718/geoip.so.so (/usr/lib/php/20170718/geoip.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0
Sep 17 21:09:02 deployment-deploy02:  PHP Warning:  PHP Startup: Unable to load dynamic library 'msgpack.so' (tried: /usr/lib/php/20170718/msgpack.so (/usr/lib/php/20170718/msgpack.so: cannot open shared object file: No such file or directory), /usr/lib/php/20170718/msgpack.so.so (/usr/lib/php/20170718/msgpack.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0
Sep 17 21:09:02 deployment-deploy02:  PHP Warning:  PHP Startup: Unable to load dynamic library 'memcached.so' (tried: /usr/lib/php/20170718/memcached.so (/usr/lib/php/20170718/memcached.so: cannot open shared object file: No such file or directory), /usr/lib/php/20170718/memcached.so.so (/usr/lib/php/20170718/memcached.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0

I guess that explains it.

I checked testwiki wmf.23 - creating accounts seem to be ok; so it's only betalabs problem.

Jdforrester-WMF lowered the priority of this task from Unbreak Now! to High.Tue, Sep 17, 9:26 PM

Not a train-blocker, then.

mmodell claimed this task.Wed, Sep 18, 5:38 PM

@Tgr: where did you see these logs specifically?

There are two versions of php7 installed on deployment-deploy02, version 7.0 and 7.3, it seems. And there isn't a php7.3-memcached package, however, php-memcached claims to provide it.

Sep 17 21:09:02 deployment-deploy02:  PHP Warning:  PHP Startup: Unable to load dynamic library 'geoip.so' (tried: /usr/lib/php/20170718/geoip.so (/usr/lib/php/20170718/geoip.so: cannot open shared object file: No such file or directory), /usr/lib/php/20170718/geoip.so.so (/usr/lib/php/20170718/geoip.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0
Sep 17 21:09:02 deployment-deploy02:  PHP Warning:  PHP Startup: Unable to load dynamic library 'msgpack.so' (tried: /usr/lib/php/20170718/msgpack.so (/usr/lib/php/20170718/msgpack.so: cannot open shared object file: No such file or directory), /usr/lib/php/20170718/msgpack.so.so (/usr/lib/php/20170718/msgpack.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0
Sep 17 21:09:02 deployment-deploy02:  PHP Warning:  PHP Startup: Unable to load dynamic library 'memcached.so' (tried: /usr/lib/php/20170718/memcached.so (/usr/lib/php/20170718/memcached.so: cannot open shared object file: No such file or directory), /usr/lib/php/20170718/memcached.so.so (/usr/lib/php/20170718/memcached.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0

I guess that explains it.

Tgr added a comment.Fri, Sep 20, 12:25 AM

@Tgr: where did you see these logs specifically?

php7.2.log, since Special:Version claims Beta is using 7.2.

So I haven't made much headway on this. I don't see any .so files actually missing (The shared object files mentioned in the log snippet actually exist on deployment-deploy02 and on the app servers)

The error message about "account creation already in progress" happens as a result of failing to acquire a lock in AuthManager.php#1193. That implies a problem with memcached. It seems that mcrouter is running and listening on the right port (0.0.0.0:11213) but I'm not sure how to test it any further.

mmodell removed mmodell as the assignee of this task.Tue, Sep 24, 2:24 AM

I'm unassigning since I'm not actively working on this currently. I might take another stab at it later but right now I'm stumped.

I don't mind helping, but there is no php7.2.log anywhere I looked at!

There is no reason to have php7.3 anywhere, unless the instance is based on Buster but there is no Buster based instance in deployment-prep.

The php-memcached that comes with Debian Stretch is intended for php7.0, the .so is copied to the extension_dir /usr/lib/php/20151012.

To get php7.2 installed on Stretch, we have an apt configuration that adds stretch-wikimedia/component/php72. There is a php-memcached there which should takes precedence over the one from Debian Stretch. It has the .so for php7.2 and installs it under /usr/lib/php/20170718

So probably the package we ship should be renamed php7.2-memcached but I digress.


Sep 17 21:09:02 deployment-deploy02:
  PHP Warning:
    PHP Startup: Unable to load dynamic library 'memcached.so'
    (tried:
      /usr/lib/php/20170718/memcached.so (/usr/lib/php/20170718/memcached.so: cannot open shared object file: No such file or directory),
      /usr/lib/php/20170718/memcached.so.so (/usr/lib/php/20170718/memcached.so.so: cannot open shared object file: No such file or directory)
    ) in Unknown on line 0

And it is definitely there :] php7.2 -m shows the module just fine.

Tgr added a comment.Tue, Sep 24, 9:01 PM

I don't mind helping, but there is no php7.2.log anywhere I looked at!

tgr@deployment-fluorine02:~$ ls /srv/mw-log/php*
/srv/mw-log/php7.0.log  /srv/mw-log/php7.2-fpm.log  /srv/mw-log/php7.2.log

More to the point:

tgr@deployment-mediawiki-07:~$ mwscript shell.php enwiki -d 1
>>> $mc = ObjectCache::getLocalClusterInstance()
>>> $mc->setDebug(true)
>>> // shell.php -d 1 calls MediaWikiServices::resetGlobalInstance() but that does not reset ObjectCache loggers :/
>>> sudo $mc->logger = \MediaWiki\Logger\LoggerFactory::getInstance('memcached') 
>>> $mc->set('foo', 1)
[debug] [memcached] MemcachedPeclBagOStuff debug: set(foo)
[error] [memcached] Memcached error for key "foo" on server "127.0.0.1:11213": SERVER ERROR
=> false

that's not super informative, but clearly a memcached problem and seems like the log error about the memcached module not being found was a false trail.

Tgr added a comment.Tue, Sep 24, 9:10 PM
tgr@deployment-mediawiki-07:~$ telnet 127.0.0.1 11213
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set foo 0 100 3
abc
SERVER_ERROR unavailable

I guess that's mcrouter saying the server it is proxying to is unreachable?

I don't see memcaached listening on a public ip:

twentyafterfour@deployment-memc05:~$ sudo netstat -nlpt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      8954/sshd       
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      22280/exim4     
tcp        0      0 172.16.5.17:16379       0.0.0.0:*               LISTEN      486/prometheus-redi
tcp        0      0 172.16.5.17:5666        0.0.0.0:*               LISTEN      22748/nrpe      
tcp        0      0 127.0.0.1:11211         0.0.0.0:*               LISTEN      25934/memcached 
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      493/redis-server 0.
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      460/rpcbind     
tcp6       0      0 :::22                   :::*                    LISTEN      8954/sshd       
tcp6       0      0 ::1:25                  :::*                    LISTEN      22280/exim4     
tcp6       0      0 :::9150                 :::*                    LISTEN      494/prometheus-memc
tcp6       0      0 :::9536                 :::*                    LISTEN      30751/prometheus-ip
tcp6       0      0 :::9100                 :::*                    LISTEN      18254/prometheus-no
tcp6       0      0 :::111                  :::*                    LISTEN      460/rpcbind

Mentioned in SAL (#wikimedia-releng) [2019-09-25T07:14:03Z] <hashar> Restarted memcached on deployment-memc04, deployment-memc05 , deployment-memc06, deployment-memc07, deployment-imagescaler01 # T232796

The service is listening on 127.0.0.1 indeed:

systemctl status memcached
...
   CGroup: /system.slice/memcached.service
           └─25934 /usr/bin/memcached -vv -m 64 -p 11211 -u memcache -l 127.0.0.1

The memcached service definition seems to indicate it should listen on any port instead:

/lib/systemd/system/memcached.service
[Service]
ExecStart=/usr/bin/memcached -p 11211 -u nobody -m 3000 -c 25000 -l 0.0.0.0 -n 5 -f 1.05 -D : -R 200 -o slab_reassign

The file is from September 8th 06:46. Which smells like apt unattented upgrade reinstalling the memcached package and overriding the service. Lets verify:

/var/log/apt/history.log
Start-Date: 2019-09-08  06:42:46
Commandline: /usr/bin/unattended-upgrade
Upgrade: memcached:amd64 (1.4.21-1.1+deb8u2, 1.4.21-1.1+deb8u3)
End-Date: 2019-09-08  06:42:47
/var/log/apt/term.log
Log started: 2019-09-08  06:42:46
Preparing to unpack .../memcached_1.4.21-1.1+deb8u3_amd64.deb ...^M
Unpacking memcached (1.4.21-1.1+deb8u3) over (1.4.21-1.1+deb8u2) ...^M
Processing triggers for man-db (2.7.0.2-5) ...^M
Processing triggers for systemd (215-17+deb8u13) ...^M
Setting up memcached (1.4.21-1.1+deb8u3) ...^M
Log ended: 2019-09-08  06:42:47

Then at 06:48 puppet has replaced the Debian package provided file /lib/systemd/system/memcached.service, reloaded systemd (so it shows up in systemctl status has loaded) but has not restarted the service.

The manual fix is easy, restart the service to take in account the proper service definition:

# systemctl restart memcached
# systemctl status memcached
...

   CGroup: /system.slice/memcached.service
           └─19752 /usr/bin/memcached -p 11211 -u nobody -m 3000 -c 25000 -l 0.0.0.0 -n 5 -f 1.05 -D : -R 200 -o slab_reassign
# netstat -tlnp
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:11211           0.0.0.0:*               LISTEN      19752/memcached 
...

Kudos to @Tgr for finding the connection refused message and @mmodell for the 127.0.0.1 listening.

That is the classic issue of having the Debian package taking over Puppet which should be solved. That should be escalated to Operations to find a proper solution.

hashar renamed this task from [betalabs] Cannot create a new user account to [betalabs] memcached listens solely on 127.0.0.1 (was: Cannot create a new user account).Wed, Sep 25, 7:30 AM
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)

This task will be completed once someone figure out how to prevent the Debian package from overriding our service definition.

Tgr added a comment.Wed, Sep 25, 8:33 AM

Does that mean the Puppet service resource is not subscribed to changes in its own configuration file? In theory those should cause a restart.

Indeed, the systemd service definition solely reloads the systemd daemon:

modules/systemd/manifests/unit.pp
$exec_label = "systemd daemon-reload for ${unit_name}"
file { $path:
    ensure  => $ensure,
    content => $content,
    mode    => '0444',
    owner   => 'root',
    group   => 'root',
    notify  => Exec[$exec_label],
}

So once puppet has run, systemctl status memcached knows about the change but the service has not been restarted. So it is up to operators to manually restart the service. I guess in production we do that to prevent memcached from being magically restarted by puppet and instead the change is gradually rolled. Same would apply for the memcached package upgrade, I guess ops would upgrade the package, run puppet, start the service.

Tgr added a comment.Wed, Sep 25, 10:08 AM

Yeah that makes sense in production since you need to avoid cache stampedes and such. In beta it should be fine to do automatic restarts, but I guess beta and production share too much puppet code for that to be easily doable?

Tgr added a comment.Wed, Sep 25, 10:13 AM

Oh, I see unit.pp has a restart flag. So all that's needed is setting that differently on beta?

@hashar is this task resolved? I was able to create an account at enwikibeta.

It is not solved per T232796#5521468. We need to prevent it from happening again.


I am still trying to figure out how we ended up having:

/usr/bin/memcached -vv -m 64 -p 11211 -u memcache -l 127.0.0.1

And eventually I found the Debian Jessie package provides a service with a wrapper consuming /etc/memcached.conf:

ExecStart=/usr/share/memcached/scripts/systemd-memcached-wrapper /etc/memcached.conf

Which has:

$ grep -v -E ^# /etc/memcached.conf|egrep -v '^$'
-d
logfile /var/log/memcached.log
-vv
-m 64
-p 11211
-u memcache
-l 127.0.0.1

And match the command line that was running after the service started.

So I guess we have to fix puppet to provide the settings in /etc/memcached.conf and just use the Package provided service definition (which would have to be restored by puppet.