Page MenuHomePhabricator

Storage solution for cross-datacenter tokens
Closed, ResolvedPublic

Description

Various extensions store tokens that can be set and claimed from different data centers.

Extensions include CentralAuth, OAuth, and ConfirmEdit.

I'd suggest using mcrouter or redis+envoy (see T277183 for redis/envoy plans to replace nutcracker). Essentially, token reads/writes would just go to the master DC via envoy/mcrouter prefix routes (using $wmfActiveDatacenter).

Event Timeline

Change 683022 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[operations/mediawiki-config@master] Add "mcrouter-master-dc" to $wgObjectCaches

https://gerrit.wikimedia.org/r/683022

Note that the backing store can be moved again later on, making it easy to use mcrouter first.

Change 683465 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[operations/mediawiki-config@master] Set $wgCentralAuthTokenCacheType to mcrouter-master-dc

https://gerrit.wikimedia.org/r/683465

aaron triaged this task as Medium priority.Jan 7 2022, 1:43 AM

For future reference: I was uncertain as to whether tokens are likely to remain for 1-minute in mcrouter given they are single use, received no demand, and my general impression that our memcached cluster is by default and by design always under eviction pressure (that is, we knowingly store more and for longer, than we know can fit).

So, I wrote a little script to try to emperically verify this.

1<?php
2/*
3[01:04 UTC] krinkle at mwmaint1002.eqiad.wmnet in ~
4$ mwscript eval.php --wiki aawiki
5> require '/home/krinkle/krinkle-tmp.php';
6Done setting 6806 keys
7Sleeping for 45 seconds...
8Checking...
9Checked 6806 keys.
10Done!
11*/
12
13use MediaWiki\MediaWikiServices;
14class KrinkleTmp {
15 const TIME_STORED = 60;
16 const TIME_SETTING = 5;
17 const TIME_WAIT = 45;
18 const MAX_KEYS = 10000;
19
20 public $stored = [];
21 public $wanCache;
22
23 public function __construct() {
24 $this->wanCache = MediaWikiServices::getInstance()->getMainWANObjectCache();
25 }
26
27 public function add(int $i): void {
28 $key = $this->wanCache->makeGlobalKey('krinkle-tmp', 'num' . $i);
29 $val = 'some_interesting_data_here' . $i;
30 $this->stored[] = [
31 'key' => $key,
32 'val' => $val,
33 'res' => $this->wanCache->set($key, $val, self::TIME_STORED),
34 'time' => microtime(true),
35 ];
36 }
37
38 public function check(array $entry): void {
39 $actual = $this->wanCache->get($entry['key']);
40 if ($actual !== $entry['val']) {
41 $now = microtime(true);
42 $this->report($entry, $now, $actual);
43 }
44 }
45
46 public function report(array $entry, $now, $actual): void {
47 print sprintf("Key %s was %s after %s seconds (res: %s)\n",
48 $entry['key'],
49 ($actual === false ? 'missing' : json_encode($actual)),
50 round($now - $entry['time']),
51 json_encode(entry['res'])
52 );
53 }
54
55 public function execute(): void {
56 $t1 = microtime(true);
57 $i = 0;
58 while (
59 (microtime(true) - $t1) < self::TIME_SETTING &&
60 $i < self::MAX_KEYS
61 ) {
62 $this->add($i);
63 $i++;
64 }
65
66 print "Done setting $i keys\n";
67 print "Sleeping for " . self::TIME_WAIT . " seconds...\n";
68 $this->wanCache->clearProcessCache();
69 sleep(self::TIME_WAIT);
70
71 print "Checking...\n";
72 $i = 0;
73 foreach ($this->stored as $entry) {
74 $this->check($entry);
75 $i++;
76 }
77 print "Checked $i keys.\n";
78 print "Done!\n";
79 }
80}
81
82$tmp = new KrinkleTmp();
83$tmp->execute();

The script stores several thousand tokens stored within a 5 second window, waits 45 seconds, and then checks that they're all there. I ran it several times, and never any loss.

It's not solid evidence, but at least anecdotally we know it can hold up without loss. My guess is that 1) we're not under as much pressure as I thought so probably our regular evictions are more towards the tail end e.g. cutting short TTLs from N days to N hours if unused, but we're not pushing out things stored less than a minute ago generally; and 2) Memcached's LRU eviction logic is quite good at prioritising to first evict older unused items before evicting new ones.

It doesn't feel great long-term, and I think this might bite us in terms of how gutter pools are used, and more generally that we don't operationally consider a partial failure or packet loss on Memcached as causing hard failures for edits or logins. But for the relatively small amount of data that CentralAuth needs here, I guess it's good enough for the initial transition. We could migrate it to a small dedicated mcrouter cluster at some point. Perhaps the same one that we'll replace the dc-local Redis with. That cluster would be for dc-local data, small in size, with (generally) no eviction happening, so we could monitor evictions like we do for Redis and consider non-zero evictions as something going wrong.

Background info:
https://github.com/memcached/memcached/blob/1.6.15/doc/new_lru.txt#L10-L24
https://github.com/memcached/memcached/wiki/UserInternals#when-are-items-evicted

Change 683022 merged by jenkins-bot:

[operations/mediawiki-config@master] mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches

https://gerrit.wikimedia.org/r/683022

Mentioned in SAL (#wikimedia-operations) [2022-06-22T01:13:38Z] <tstarling@deploy1002> Synchronized wmf-config/mc.php: g 807158 T278392 (duration: 03m 35s)

Benchmark of cross-DC memcached shows 1 RTT (33ms) latency for get and set, but 2 RTT for incr:

[0124][tstarling@mwmaint2002:~]$ mwscript mctest.php --wiki=enwiki --i 100 --cache=mcrouter-primary-dc
Warming up connections to cache servers...done
Single and batched operation profiling/test results:
127.0.0.1:11213
 add: 100/100 3300ms   set: 100/100 3300ms   get: 100/100 (3297ms)   delete: 100/100 (3294ms)	incr: 100/100 (6589ms)
 setMulti (IB): ✓ 3296ms   getMulti (IB): 100/100 37ms   changeTTLMulti (IB): ✓ 3298ms   deleteMulti (IB): ✓ 3295ms
 setMulti (DB): ✓ 154ms   getMulti (DB): 100/100 155ms   changeTTLMulti (DB): ✓ 3299ms   deleteMulti (DB): ✓ 34ms

Further testing showed that for a non-existent key (as used by mctest.php), mcrouter maps incr to incr+add. Incrementing a key that exists only requires 1 RTT.

Change 809326 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[operations/mediawiki-config@master] [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc

https://gerrit.wikimedia.org/r/809326

Change 683465 merged by jenkins-bot:

[operations/mediawiki-config@master] Move $wgCentralAuthTokenCacheType from redis_local to mcrouter

https://gerrit.wikimedia.org/r/683465

Mentioned in SAL (#wikimedia-operations) [2022-06-29T04:37:25Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: wgCentralAuthTokenCacheType -> mcrouter T278392 (duration: 03m 44s)

Change 809326 merged by jenkins-bot:

[operations/mediawiki-config@master] [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc

https://gerrit.wikimedia.org/r/809326

I tested cross-DC failover handling.

I used a loop of incrWithInit on mwmaint2002:

$c = ObjectCache::getInstance('mcrouter-primary-dc');
$c->set('test',0);
while ( true ) {
    printf("%-18f %s\n", microtime(true), $c->incrWithInit('test', 86400));
    sleep(1);
}

I dropped outbound TLS traffic but allowed unencrypted traffic as is used within the same DC:

iptables -v -A OUTPUT -p tcp --dport 11214 -j DROP

I left it like that for about 3 minutes, then deleted the rule.

As soon as the rule was applied, incrWithInit() started returning false. I somehow missed the fact that cross-DC routes, e.g. /eqiad/mw/ on a codfw server, do not use FailoverWithExptimeRoute, they are directly routed to the remote pool. There's no gutter pool.

After I deleted the iptables rule, it took another 30 seconds before it reconnected. The mcrouter log showed connection attempts every 60-90 seconds. That seems like a long time. So I investigated that and found that --probe-timeout-initial was raised from 3s to 60s for T255511. The rationale was that it's fine to use the gutter pool for a while. Unfortunately this is a global configuration variable, it can't be tuned down for routes that don't have a gutter pool.