Page MenuHomePhabricator

Title blacklist intermittently failing, allowing users to edit things they shouldn't be able to
Closed, ResolvedPublic

Description

https://en.wikipedia.org/w/index.php?title=Template:Editnotices/Page/European_Union&diff=prev&oldid=639828544 was an IP user editing Template:Editnotices/Page/European_Union, even though it's titleblacklist protected. The problem isn't occurring, and neither our blacklist, our whitelist, Meta's blacklist, nor Meta's whitelist has been changed recently. This doesn't appear to be an exploit, since other users (not with tboverride) on enwiki were able to edit the page as well, and didn't do anything special to do so. See https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Access_rights_failure for more details.

Event Timeline

Jackmcbarn raised the priority of this task from to High.
Jackmcbarn updated the task description. (Show Details)
Jackmcbarn added a project: TitleBlacklist.
Jackmcbarn subscribed.

Could be some sort of cache poisoning. If so, a cache miss log showing number of TB entries saved to the cache would help sort it out.

Change 182973 had a related patch set uploaded (by CSteipp):
Log cache misses of blacklist

https://gerrit.wikimedia.org/r/182973

Patch-For-Review

I poked at this again this morning, and happened to have an affected eval.php. TitleBlacklist::singleton()->getBlacklist() was returning a blacklist containing only the entries from m:Title blacklist, nothing from enwiki's blacklist. So cache poisoning seems likely, and since it's the whole blacklist missing I'd guess that the root cause is probably in the call to getBlacklist() rather than parseBlacklist().

I ran a script over each host in pybal apaches and api that did the getBlacklist() and parseBlacklist() calls via eval.php and checked the results, but it didn't seem that any one host consistently failed the check. I did get a failure on mw1226, but retesting only failed 3 times in 61 attempts. But when it did fail, the data output confirmed that it was loading the default for the MediaWiki:TitleBlacklist message rather than the customized version. The log messages output in this situation were:

[memcached] get(enwiki:messages:en:hash)
[memcached-serious] Memcached error for key "enwiki:messages:en:hash" on server "127.0.0.1:11212": CONNECTION FAILURE
[memcached] get(enwiki:messages:en)
[memcached-serious] Memcached error for key "enwiki:messages:en" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] add(enwiki:messages:en:status)
[memcached-serious] Memcached error for key "enwiki:messages:en:status" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] get(enwiki:messages:en:status)
[memcached-serious] Memcached error for key "enwiki:messages:en:status" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] add(enwiki:messages:en:lock)
[memcached-serious] Memcached error for key "enwiki:messages:en:lock" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] set(enwiki:test)
[memcached-serious] Memcached error for key "enwiki:test" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] delete(enwiki:messages:en:lock)
[memcached-serious] Memcached error for key "enwiki:messages:en:lock" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] get(enwiki:messages:en)
[memcached-serious] Memcached error for key "enwiki:messages:en" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[memcached] add(enwiki:messages:en:status)
[memcached-serious] Memcached error for key "enwiki:messages:en:status" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[MessageCache] MessageCache::load: Loading en... global cache is empty, waited for other thread to complete, global cache is empty, could not acquire status key., loading FAILED - cache is disabled

I see a lot of these messages in memcached-serious.log on fluorine, BTW.

These log messages are coming from MessageCache::load(), and it looks like MessageCache::getMsgFromNamespace() isn't going to fall back to the database if MessageCache::load() fails entirely.

So I see two issues here:

  • Intermittent memcached failures
  • MessageCache ignoring the database in case of the above

Change 182973 merged by Legoktm:
Log cache misses of blacklist

https://gerrit.wikimedia.org/r/182973

It can't load from the DB with uncontrolled concurrency, so it tries to acquire a global lock which would allow it to read from the DB, but that fails because of the timed retry. Timed retry is meant to be completely disabled, the retry time is set to -1.

Actually, timed retry does appear to be disabled -- the error message is incorrect. An unrelated failure after the retry period completes is incorrectly reported as MEMCACHED_SERVER_TEMPORARILY_DISABLED, because in_timeout is set at the relevant place in _memcached_connect().

csteipp subscribed.

Tim is doing most of the work on this now

Looks like the software's ability to enforce the title blacklist on enwiki is down again. Are there any updates on completing the fix?

This is now occurring on Commons as well.

File:10612668 1469368410002666 5095054231275383394 n.jpg

clearly fails the following regex when tested in PHP:

File:\d{4,}(_\d{6,}){2,}_(A|B|N|O|S)(?:_\(\d\))?\.\w+  <reupload|errmsg=titleblacklist-custom-filename>  #Facebook

@tstarling: Any progress/success finding out the reason? Any news to share?

The theory is that this is a duplicate of T75949. I'll add some notes there.

It is probably fixed, but I have enabled some extra logging to be sure. We will need 24 hours of logs to confirm the fix.

Confirmed, no instances of an empty blacklist being saved to cache were logged. Message cache failure is still a risk, not just for TB, it can also cause cache pollution. Perhaps a better policy would be to throw an exception if the message cache fails to load, rather than using the defaults.

It all seems to be on mw1118. The new log file is reporting 243 entries for all non-mw1118, and mostly 141 entries for mw1118.

Different errors in the log on fluorine this time, too:

2015-01-25 20:11:15 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR
2015-01-25 20:11:30 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR
2015-01-25 20:11:50 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR
2015-01-25 20:25:47 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR
2015-01-25 20:29:14 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR
2015-01-25 20:31:40 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR
2015-01-25 20:40:03 mw1118 commonswiki: Memcached error for key "commonswiki:title_blacklist_entries" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR

I depooled mw1118 for now so we can investigate it.