Page MenuHomePhabricator

Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep
Closed, ResolvedPublic

Description

Install the wikimedia mcrouter package and set it up to broadcast DELETE/SET for WANObjectCache.

Make sure the planned routing rules make sense:

{
  "pools": {
    "eqiad-mediawiki": {
      "servers": [
        "10.68.23.25:11211",
        "10.68.23.49:11211"
      ]
    },
    "codfw-mediawiki": {
      "servers": [
        "10.68.22.239:11211",
        "10.68.17.171:11211"
      ]
    }
  },
  "routes":
  [
    {
      "aliases": [ "/eqiad/mw/" ],
      "route": "PoolRoute|eqiad-mediawiki"
    },
    {
      "aliases": [ "/eqiad/mw-wan/" ],
      "route": {
        "type": "OperationSelectorRoute",
        "default_policy": "PoolRoute|eqiad-mediawiki",
        "operation_policies": {
          "set": "AllFastestRoute|Pool|eqiad-mediawiki",
          "delete": "AllFastestRoute|Pool|eqiad-mediawiki"
        }
      }
    },
    {
      "aliases": [ "/codfw/mw/" ],
      "route": "PoolRoute|codfw-mediawiki"
    },
    {
      "aliases": [ "/codfw/mw-wan/" ],
      "route": {
        "type": "OperationSelectorRoute",
        "default_policy": "PoolRoute|codfw-mediawiki",
        "operation_policies": {
          "set": "AllFastestRoute|Pool|codfw-mediawiki",
          "delete": "AllFastestRoute|Pool|codfw-mediawiki"
        }
      }
    }
  ]
}

Event Timeline

Krinkle removed aaron as the assignee of this task.Apr 5 2017, 7:37 PM
Krinkle lowered the priority of this task from Medium to Low.
Krinkle subscribed.

Lowering priority pending outcome of T156938: Investigate dynomite for WANObjectCache support.

Couple of questions:

  1. What is the scope that mcrouter will have given T134811 ? memcached only or Redis too? I am asking because ops is struggling to manage basic maintenance like host reboots or even simple restarts on mc* hosts due to nutcracker and the session cache. For example, rebooting one host will cause nutcracker to exclude (temporarily) the shard from the cluster and some user impact can arise (like session dropped, CSRF alerts, etc..). It would be great to have a way to handle this use case transparently.. I checked and mcrouter supports replicated shards, but it would need to be tested in depth.
  1. Related to the above - would it be possible to also think about deploying mcrouter in front of the jobqueues to limit issues like T125735 ?

I am available to help testing in case, let me know :)

Couple of questions:

  1. What is the scope that mcrouter will have given T134811 ? memcached only or Redis too? I am asking because ops is struggling to manage basic maintenance like host reboots or even simple restarts on mc* hosts due to nutcracker and the session cache. For example, rebooting one host will cause nutcracker to exclude (temporarily) the shard from the cluster and some user impact can arise (like session dropped, CSRF alerts, etc..). It would be great to have a way to handle this use case transparently.. I checked and mcrouter supports replicated shards, but it would need to be tested in depth.
  1. Related to the above - would it be possible to also think about deploying mcrouter in front of the jobqueues to limit issues like T125735 ?

I am available to help testing in case, let me know :)

If we were to use mcrouter in front of twemproxy+redis, I suppose we could use replication to help with maintenance. It's not really something I've considered in depth as I'm focused on just caching on this task.

mcrouter won't be usable for the jobqueue due to Lua usage.

Gilles renamed this task from Install and use mcrouter in deployment-prep to Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.Sep 8 2017, 9:44 AM

So, running mcrouter via screen -r with the config in /etc/mcrouter/mcrouter.json on tin seems to work fine. The pool replication works and the timings are comparable to twemproxy -- often better than twemproxy.

A setup with mcrouter alongside twemproxy should be done via puppet in deployment-prep which this config.

> aaron@deployment-tin:~mwscript eval.php enwiki
> 

> $cmr = ObjectCache::newFromParams( [ 'class' => 'MemcachedPeclBagOStuff', 'servers' => [ '127.0.0.1:11213' ], 'persistent' => false ] );

> $ctp = ObjectCache::getLocalClusterInstance();

> $fs = function ( $c ) { $bad = 0; $t = microtime(true); for ( $i=0; $i<3000; ++$i ) { $bad += (int)!$c->set( "key$i", 1, 60 );} var_dump( microtime(true) - $t, $bad ); }

> $fg = function ( $c ) { $bad = 0; $t = microtime(true); for ( $i=0; $i<3000; ++$i ) { $bad += (int)!$c->get( "key$i" ); } var_dump( microtime(true) - $t, $bad ); }

> $fd = function ( $c ) { $bad = 0; $t = microtime(true); for ( $i=0; $i<3000; ++$i ) { $bad += (int)!$c->delete( "key$i" ); } var_dump( microtime(true) - $t, $bad ); }

> $fa = function ( $c ) { $bad = 0; $t = microtime(true); for ( $i=0; $i<3000; ++$i ) { $bad += (int)!$c->add( "key$i", 1, 60 );} var_dump( microtime(true) - $t, $bad ); }

> echo "mcrouter (SET) [sec, failures]\n"; $fs($cmr); // mcrouter => memcached
mcrouter (SET) [sec, failures]
float(2.8776490688324)
int(0)

> echo "twemproxy (SET) [sec, failures]\n";$fs($ctp); // twemproxy => memcached
twemproxy (SET) [sec, failures]
float(3.5276031494141)
int(0)

> echo "mcrouter (GET) [sec, failures]\n"; $fg($cmr); // mcrouter => memcached
mcrouter (GET) [sec, failures]
float(3.6640191078186)
int(0)

> echo "twemproxy (GET) [sec, failures]\n";$fg($ctp); // twemproxy => memcached
twemproxy (GET) [sec, failures]
float(3.9909389019012)
int(0)

> echo "mcrouter (DELETE) [sec, failures]\n"; $fd($cmr); // mcrouter => memcached
mcrouter (DELETE) [sec, failures]
float(3.0156950950623)
int(0)

> echo "twemproxy (DELETE) [sec, failures]\n";$fd($ctp); // twemproxy => memcached
twemproxy (DELETE) [sec, failures]
float(3.3627481460571)
int(0)

> echo "mcrouter (ADD) [sec, failures]\n"; $fa($cmr); // mcrouter => memcached
mcrouter (ADD) [sec, failures]
float(3.6320049762726)
int(0)

> $fd($ctp); // clear out
float(3.3900511264801)
int(0)

> echo "twemproxy (ADD) [sec, failures]\n";$fa($ctp); // twemproxy => memcached
twemproxy (ADD) [sec, failures]
float(3.6700918674469)
int(0)

Some replication examples:

aaron@deployment-tin:~$ 
aaron@deployment-tin:~$ echo "SET (multi-DC)"
SET (multi-DC)
aaron@deployment-tin:~$ printf "set mykey 0 60 4\r\ndata\r\nquit\n" | \nc 127.0.0.1 11213
STORED
--
DC 1:
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.23.25 11211
VALUE mykey 0 4
data
END
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.23.49 11211
END
--
DC 2:
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.22.239 11211
VALUE mykey 0 4
data
END
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.17.171 11211
END
aaron@deployment-tin:~$ echo "DELETE (multi-DC)"
DELETE (multi-DC)
aaron@deployment-tin:~$ printf "delete mykey\r\nquit\n" | \nc 127.0.0.1 11213
DELETED
--
DC 1:
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.23.25 11211
END
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.23.49 11211
END
--
DC 2:
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.22.239 11211
END
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.17.171 11211
END
aaron@deployment-tin:~$ echo "ADD/INCR (single-DC)"
ADD/INCR (single-DC)
aaron@deployment-tin:~$ printf "add mykey 0 60 1\r\n1\r\nquit\n" | \nc 127.0.0.1 11213
STORED
aaron@deployment-tin:~$ printf "incr mykey 1\r\nquit\n" | \nc 127.0.0.1 11213
2
--
DC 1:
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.23.25 11211
VALUE mykey 0 1
2
END
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.23.49 11211
END
--
DC 2:
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.22.239 11211
END
aaron@deployment-tin:~$ printf "get mykey\nquit\n" | \nc 10.68.17.171 11211
END

I didn't check the code yet but from an evaluation and monitoring angle it'd be nice to get Prometheus metrics from mcrouter natively out of the box. If that's not the case yet then perhaps upstream might be interested in having support. If that fails too we'd have to write an external exporter to convert metrics into Prometheus format.

This, specifically https://gerrit.wikimedia.org/r/c/392221 being cherry-picked onto the beta puppetmaster, might be the reason for T190632: Puppet errors on deployment-mediawiki07. The reported puppet error is "Unable to locate package mcrouter". deployment-mediawiki07 is the only stretch appserver in beta right now (deployment-mediawiki0[456] are jessie and don't bark). So it seems like you've only built the package for jessie and not stretch for now? Could you amend the cherry-picked commit so that it will not try to use mcrouter for stretch instances as long as there's no package for it? Thanks.

This has now been running for a while (since Apr 17) with the new packages (both debian versions, though the stretch server isn't there anymore afaik).

both debian versions, though the stretch server isn't there anymore afaik

To the opposite, all app servers in deployment-prep have been replaced with stretch via T192071 (although you're right that the specific instance that made problems is gone). jfyi