Page MenuHomePhabricator

Beta Cluster: Unable to obtain lock via objectcache (memcached add() fails)
Closed, ResolvedPublic

Description

Following https://gerrit.wikimedia.org/r/442198 (per T195314#4317397), I'm trying to see whether Navigation Timing reports have increased in Beta. But it seems they remain at zero. Given the sample rate of 1/10, I figured I'd just try locally a few times to click on the main page, and eventually got sampled, but noticed the following error in the console.

Console
[NavigationTiming] Missing or empty schema

Looking at the network fetch for the schemas from load.php, it got the following:

https://en.wikipedia.beta.wmflabs.org/w/load.php?debug=false&lang=en&modules=schema.NavigationTiming|schema.SaveTiming

load.php
mw.loader.implement("schema.NavigationTiming@02ba1tn", function($, jQuery, require, module) {
    mediaWiki.eventLog.declareSchema("NavigationTiming", {
        "schema": {},
        "revision": 18156125
    });
});
mw.loader.implement("schema.SaveTiming@10h98u7", function($, jQuery, require, module) {
    mediaWiki.eventLog.declareSchema("SaveTiming", {
        "schema": {
            "title": "SaveTiming",
            "properties": {
                "saveTiming": {
[..]

Given it only affects the NavigationTiming schema, and not the SaveTiming schema in the same request, I suspected a recent change (maybe https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/NavigationTiming/+/441934/, which changed the revision ID). However, I that didn't seem to be the cause because:

  1. NavigationTiming was working fine on my localhost MediaWiki.
  2. And, the underlying API request that EventLogging makes from load.php works fine (https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=18156125&formatversion=2).

That request is triggered from EventLogging/RemoteSchema.php (source code). Which does the following:

  • get:1: Try memcGet() = false (key doesn't exist).
  • get:2: Try httpGet()
    • httpGet:1: Acquire cache->lock(), or return false.
    • httpGet:2: Fetch schema from meta.wikimedia.org over HTTP.
    • httpGet:3: Parse response as JSON.
    • httpGet:4: If fetch or parsing failed, log error and return false.
    • httpGet:5: Return parsed response.

There are no log messages about the request failing. And as mentioned, trying that request manually shows it works fine. Instead, what I find when trying to reproduce it via the REPL, is that it seems Memcached isn't allowing PHP to acquire a lock.

When manually calling $cache->get() with the lock key, it returns false suggesting the lock isn't taken. Yet, when calling $cache->add() with the lock key, it also returns false.

Terminal
beta-tin$ mwscript eval.php --wiki enwiki

> var_dump( $cache->get($key . ':lock') );
bool(false)

> var_dump( $cache->add($key . ':lock', 1, 20) );
bool(false)

> var_dump( $cache->add($key . ':lock', 1, 20) );
bool(false)

> var_dump( $cache->get($key . ':lock') );
bool(false)

If this is due an issue with MultiWrite/ReplicatedBagOStuff logic we added for mcrouter, then it might be the same root-cause as T198279.

See also:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle triaged this task as Unbreak Now! priority.Jun 27 2018, 1:25 AM
Krinkle updated the task description. (Show Details)
Krinkle renamed this task from EventLogging/RemoteSchema unable to obtain memcached lock to Beta Cluster: EventLogging/RemoteSchema unable to obtain memcached lock.Jun 27 2018, 1:31 AM
Krinkle updated the task description. (Show Details)
This comment was removed by aaron.

Hm.. well, the load.php url is still experiencing the same issue. I also see the same happen on CLI still.

I confirmed via Special:Version that the commit was applied (at 8faba225fefd7).

Terminal
krinkle@deployment-tin:~$ mwscript eval.php --wiki enwiki
> $s = new RemoteSchema('NavigationTiming', 18156125);
> var_dump( $s->get() );
bool(false)

> var_dump( $s->key );
string(74) "global:eventlogging-schema:https%3A//meta.wikimedia.org/w/api.php:18156125"

> echo get_class($s->cache);
ReplicatedBagOStuff
> var_dump( $s->cache->get( $s->key ) );
bool(false)

> var_dump( $s->cache->get( $s->key . ':lock', 1, 20 ) );
bool(false)

> var_dump( $s->cache->add( $s->key . ':lock', 1, 20 ) );
bool(false)

Digging a bit further, it seems the added key does get shown by get() but add() is returning false regardless. Demonstrated below:

get and set seem fine
> var_dump( $s->cache->get( 'krinkle:test' ) );
bool(false)
> var_dump( $s->cache->set( 'krinkle:test', 'x', 5 ) );
bool(true)
> var_dump( $s->cache->get( 'krinkle:test' ) );
string(1) "x"

> var_dump( $s->cache->get( 'krinkle:test' ) );
bool(false)
add has issues
> var_dump( $s->cache->get('krinkle:test') );
bool(false)
> var_dump( $s->cache->add('krinkle:test', 'x', 20) );
bool(false)

// Above: add() returns false.
// Below: get() shows the key did actually get added..

> var_dump( $s->cache->get( 'krinkle:test' ) );
string(1) "x"
> var_dump( $s->cache->get( 'krinkle:test' ) );
string(1) "x"
> var_dump( $s->cache->set( 'krinkle:test', 'y', 5 ) );
bool(true)
> var_dump( $s->cache->get( 'krinkle:test' ) );
string(1) "y"

> var_dump( $s->cache->get('krinkle:test') );
bool(false)

// And again:

> var_dump( $s->cache->add('krinkle:test', 'z', 20) );
bool(false)

> var_dump( $s->cache->get('krinkle:test') );
string(1) "z"

Maybe for certain keys, when the mc server used by nutcracker matches that of the mcrouter one, add() fails since it has to succeed on *all* backends for MultiWriteBagOStuff to return true and the second write would see the first one.

$s = new RemoteSchema('NavigationTiming', 18156125);
$s->cache

From source code which does wfGetCache( CACHE_ANYTHING );.

Krinkle renamed this task from Beta Cluster: EventLogging/RemoteSchema unable to obtain memcached lock to Beta Cluster: Unable to obtain lock via objectcache (memcached).Jun 28 2018, 1:35 AM
Krinkle renamed this task from Beta Cluster: Unable to obtain lock via objectcache (memcached) to Beta Cluster: Unable to obtain lock via objectcache (memcached add() fails).

I get:

> $s = new RemoteSchema('NavigationTiming', 18156125);

> var_dump( $s->cache->lock( $s->key, 1, 20 ) );
bool(true)

> var_dump( $s->cache->lock( $s->key, 1, 20 ) );
bool(false)

> var_dump( $s->cache->unlock( $s->key, 1, 20 ) );
bool(true)

> var_dump( $s->cache->add($key . ':lock', 1, 20) );
bool(true)

> var_dump( $s->cache->add($key . ':lock', 1, 20) );
bool(false)

> var_dump( $s->cache->get($key . ':lock') );
int(1)

Change 442813 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: make MultiWriteBagOStuff handle duplicate add() operations

https://gerrit.wikimedia.org/r/442813

Change 442902 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/EventLogging@master] Add better logging in RemoteSchema.php for failed locking and http

https://gerrit.wikimedia.org/r/442902

Change 442813 merged by jenkins-bot:
[mediawiki/core@master] objectcache: make MultiWriteBagOStuff handle duplicate add() operations

https://gerrit.wikimedia.org/r/442813

Change 442902 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Add better logging in RemoteSchema.php for failed locking and http

https://gerrit.wikimedia.org/r/442902

As of now, I get (as expected):

> $s = new RemoteSchema('NavigationTiming', 18156125);

> var_dump( $s->cache->get('krinkle:test' ) );
bool(false)

> var_dump( $s->cache->add('krinkle:test', 'x', 20) );
bool(true)

> var_dump( $s->cache->add('krinkle:test', 'x', 20) );
bool(false)

> var_dump( $s->cache->get('krinkle:test' ) );
string(1) "x"
Krinkle assigned this task to aaron.

Change 445005 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.10] objectcache: make MultiWriteBagOStuff handle duplicate add() operations

https://gerrit.wikimedia.org/r/445005

Change 445005 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.10] objectcache: make MultiWriteBagOStuff handle duplicate add() operations

https://gerrit.wikimedia.org/r/445005