Page MenuHomePhabricator

Smart caching logic for handling cross-DC network outages
Closed, ResolvedPublic

Description

During partitions which cause unbounded DB lag increases:

a) Have WANObjectCache automatically use small TTLs if slave lag gets too close to the HOLDOFF threshold

This is useful for a full-network partition scenario between DCs. We could have a DC-local cached value trigger this. A one-thread-at-time deferred update can contact the master and wait for slaves to make sure things are OK. If it fails, the WAN cache will know (from reading the key) that something is wrong and lower all set() TTLs to stop stale data from getting stuck for the whole set() TTL (that could be days, weeks...).

b) Likewise with CDN expiry headers

c) Trigger wfReadOnly() in the slave DC to discourage form submission (if we use pt-heartbeat this will happen for free).

Event Timeline

aaron claimed this task.
aaron raised the priority of this task from to Medium.
aaron updated the task description. (Show Details)
aaron added a project: Sustainability.
aaron added subscribers: Krinkle, jcrespo, Glaisher and 12 others.

Change 240958 had a related patch set uploaded (by Aaron Schulz):
[WIP] Added WANObjectCache::setAcceptableLagMode() method

https://gerrit.wikimedia.org/r/240958

Change 240958 abandoned by Aaron Schulz:
[WIP] Added WANObjectCache handling for lagged-slave mode

Reason:
Going with a caller-aware route

https://gerrit.wikimedia.org/r/240958

Change 242783 had a related patch set uploaded (by Aaron Schulz):
[WIP] Added IDatabase::getSessionLagStatus() method

https://gerrit.wikimedia.org/r/242783

Change 242812 had a related patch set uploaded (by Aaron Schulz):
Lower CDN cache TTL when slave lag is high

https://gerrit.wikimedia.org/r/242812

aaron renamed this task from Caching and DB logic for handling cross-DC network outages to Smart caching logic for handling cross-DC network outages.Oct 2 2015, 6:45 PM

Change 242783 merged by jenkins-bot:
Make WANObjectCache sets account for slave lag

https://gerrit.wikimedia.org/r/242783

Change 242812 merged by jenkins-bot:
Lower CDN cache TTL when slave lag is high

https://gerrit.wikimedia.org/r/242812

Done already (with T111266 already tracking the pt-heartbeat bit).

This is causing my local install to throw stacktraces:

1( ! ) Notice: Undefined index: lag in /Users/hartman/Development/wikimedia-git/mediawiki-core/includes/db/Database.php on line 2636
2Call Stack
3# Time Memory Function Location
41 0.0010 243992 {main}( ) .../index.php:0
52 0.0011 244896 require( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/WebStart.php' ) .../index.php:40
63 0.0060 1685208 require_once( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/Setup.php' ) .../WebStart.php:133
74 0.0139 2811464 MediaWiki\Session\SessionManager::getGlobalSession( ) .../Setup.php:751
85 0.0139 2811624 WebRequest->getSession( ) .../SessionManager.php:121
96 0.0139 2811688 MediaWiki\Session\SessionManager->getSessionForRequest( ) .../WebRequest.php:703
107 0.0139 2811920 MediaWiki\Session\SessionManager->getSessionInfoForRequest( ) .../SessionManager.php:182
118 0.0145 2880592 MediaWiki\Session\CookieSessionProvider->provideSessionInfo( ) .../SessionManager.php:666
129 0.0148 2906464 MediaWiki\Session\UserInfo::newFromId( ) .../CookieSessionProvider.php:119
1310 0.0150 3019016 User->load( ) .../UserInfo.php:88
1411 0.0150 3019456 User->loadFromId( ) .../User.php:406
1512 0.0157 3056056 User->loadFromDatabase( ) .../User.php:441
1613 0.0176 3557616 DatabaseBase->selectRow( ) .../User.php:1274
1714 0.0176 3558152 DatabaseBase->select( ) .../Database.php:1293
1815 0.0179 3560136 DatabaseBase->query( ) .../Database.php:1234
1916 0.0179 3561600 DatabaseBase->begin( ) .../Database.php:811
20
21( ! ) Notice: Undefined index: since in /Users/hartman/Development/wikimedia-git/mediawiki-core/includes/db/Database.php on line 2636
22Call Stack
23# Time Memory Function Location
241 0.0010 243992 {main}( ) .../index.php:0
252 0.0011 244896 require( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/WebStart.php' ) .../index.php:40
263 0.0060 1685208 require_once( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/Setup.php' ) .../WebStart.php:133
274 0.0139 2811464 MediaWiki\Session\SessionManager::getGlobalSession( ) .../Setup.php:751
285 0.0139 2811624 WebRequest->getSession( ) .../SessionManager.php:121
296 0.0139 2811688 MediaWiki\Session\SessionManager->getSessionForRequest( ) .../WebRequest.php:703
307 0.0139 2811920 MediaWiki\Session\SessionManager->getSessionInfoForRequest( ) .../SessionManager.php:182
318 0.0145 2880592 MediaWiki\Session\CookieSessionProvider->provideSessionInfo( ) .../SessionManager.php:666
329 0.0148 2906464 MediaWiki\Session\UserInfo::newFromId( ) .../CookieSessionProvider.php:119
3310 0.0150 3019016 User->load( ) .../UserInfo.php:88
3411 0.0150 3019456 User->loadFromId( ) .../User.php:406
3512 0.0157 3056056 User->loadFromDatabase( ) .../User.php:441
3613 0.0176 3557616 DatabaseBase->selectRow( ) .../User.php:1274
3714 0.0176 3558152 DatabaseBase->select( ) .../Database.php:1293
3815 0.0179 3560136 DatabaseBase->query( ) .../Database.php:1234
3916 0.0179 3561600 DatabaseBase->begin( ) .../Database.php:811

Not yet figured out why.

And it's gone. No idea what was going on there.