Smart caching logic for handling cross-DC network outages
Closed, ResolvedPublic
Actions

Description

During partitions which cause unbounded DB lag increases:

a) Have WANObjectCache automatically use small TTLs if slave lag gets too close to the HOLDOFF threshold

This is useful for a full-network partition scenario between DCs. We could have a DC-local cached value trigger this. A one-thread-at-time deferred update can contact the master and wait for slaves to make sure things are OK. If it fails, the WAN cache will know (from reading the key) that something is wrong and lower all set() TTLs to stop stale data from getting stuck for the whole set() TTL (that could be days, weeks...).

b) Likewise with CDN expiry headers

c) Trigger wfReadOnly() in the slave DC to discourage form submission (if we use pt-heartbeat this will happen for free).

Details

Subject	Repo	Branch	Lines +/-
Lower CDN cache TTL when slave lag is high	mediawiki/core	master	+60 -5
Make WANObjectCache sets account for slave lag	mediawiki/core	master	+190 -56
[WIP] Added WANObjectCache handling for lagged-slave mode	mediawiki/core	master	+27 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
		Resolved		aaron	T113204 Smart caching logic for handling cross-DC network outages

Event Timeline

aaron created this task.Sep 21 2015, 12:01 AM

aaron claimed this task.

aaron raised the priority of this task from to Medium.

aaron updated the task description. (Show Details)

aaron added a project: Sustainability.

aaron added subscribers: Krinkle, • jcrespo, Glaisher and 12 others.

aaron moved this task from Tag to Doing on the Sustainability board.Sep 25 2015, 3:55 AM

Change 240958 had a related patch set uploaded (by Aaron Schulz):
[WIP] Added WANObjectCache::setAcceptableLagMode() method

https://gerrit.wikimedia.org/r/240958

gerritbot added a project: Patch-For-Review.Sep 25 2015, 3:57 AM

aaron added a project: Performance-Team.Sep 28 2015, 7:42 PM

aaron set Security to None.

Change 240958 abandoned by Aaron Schulz:
[WIP] Added WANObjectCache handling for lagged-slave mode

Reason:
Going with a caller-aware route

https://gerrit.wikimedia.org/r/240958

Change 242783 had a related patch set uploaded (by Aaron Schulz):
[WIP] Added IDatabase::getSessionLagStatus() method

https://gerrit.wikimedia.org/r/242783

aaron updated the task description. (Show Details)Oct 1 2015, 7:23 AM

Change 242812 had a related patch set uploaded (by Aaron Schulz):
Lower CDN cache TTL when slave lag is high

https://gerrit.wikimedia.org/r/242812

aaron renamed this task from Caching and DB logic for handling cross-DC network outages to Smart caching logic for handling cross-DC network outages.Oct 2 2015, 6:45 PM

aaron updated the task description. (Show Details)Oct 2 2015, 6:49 PM

aaron moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Oct 5 2015, 6:05 PM

Change 242783 merged by jenkins-bot:
Make WANObjectCache sets account for slave lag

https://gerrit.wikimedia.org/r/242783

aaron mentioned this in rMWdb0b9ef2649c: Make WANObjectCache sets account for slave lag.Oct 6 2015, 12:11 AM

ReleaseTaggerBot added projects: MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2015-10-06_(1.27.0-wmf.2)).Oct 6 2015, 1:01 AM

Change 242812 merged by jenkins-bot:
Lower CDN cache TTL when slave lag is high

https://gerrit.wikimedia.org/r/242812

aaron mentioned this in rMWc7b932af6bbc: Lower CDN cache TTL when slave lag is high.Oct 7 2015, 2:02 AM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-10-13_(1.27.0-wmf.3)).Oct 7 2015, 7:02 PM

Done already (with T111266 already tracking the pt-heartbeat bit).

This is causing my local install to throw stacktraces:

P3149 Broken getApproximateLagStatus()

1	( ! ) Notice: Undefined index: lag in /Users/hartman/Development/wikimedia-git/mediawiki-core/includes/db/Database.php on line 2636
2	Call Stack
3	# Time Memory Function Location
4	1 0.0010 243992 {main}( ) .../index.php:0
5	2 0.0011 244896 require( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/WebStart.php' ) .../index.php:40
6	3 0.0060 1685208 require_once( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/Setup.php' ) .../WebStart.php:133
7	4 0.0139 2811464 MediaWiki\Session\SessionManager::getGlobalSession( ) .../Setup.php:751
8	5 0.0139 2811624 WebRequest->getSession( ) .../SessionManager.php:121
9	6 0.0139 2811688 MediaWiki\Session\SessionManager->getSessionForRequest( ) .../WebRequest.php:703
10	7 0.0139 2811920 MediaWiki\Session\SessionManager->getSessionInfoForRequest( ) .../SessionManager.php:182
11	8 0.0145 2880592 MediaWiki\Session\CookieSessionProvider->provideSessionInfo( ) .../SessionManager.php:666
12	9 0.0148 2906464 MediaWiki\Session\UserInfo::newFromId( ) .../CookieSessionProvider.php:119
13	10 0.0150 3019016 User->load( ) .../UserInfo.php:88
14	11 0.0150 3019456 User->loadFromId( ) .../User.php:406
15	12 0.0157 3056056 User->loadFromDatabase( ) .../User.php:441
16	13 0.0176 3557616 DatabaseBase->selectRow( ) .../User.php:1274
17	14 0.0176 3558152 DatabaseBase->select( ) .../Database.php:1293
18	15 0.0179 3560136 DatabaseBase->query( ) .../Database.php:1234
19	16 0.0179 3561600 DatabaseBase->begin( ) .../Database.php:811
20
21	( ! ) Notice: Undefined index: since in /Users/hartman/Development/wikimedia-git/mediawiki-core/includes/db/Database.php on line 2636
22	Call Stack
23	# Time Memory Function Location
24	1 0.0010 243992 {main}( ) .../index.php:0
25	2 0.0011 244896 require( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/WebStart.php' ) .../index.php:40
26	3 0.0060 1685208 require_once( '/Users/hartman/Development/wikimedia-git/mediawiki-core/includes/Setup.php' ) .../WebStart.php:133
27	4 0.0139 2811464 MediaWiki\Session\SessionManager::getGlobalSession( ) .../Setup.php:751
28	5 0.0139 2811624 WebRequest->getSession( ) .../SessionManager.php:121
29	6 0.0139 2811688 MediaWiki\Session\SessionManager->getSessionForRequest( ) .../WebRequest.php:703
30	7 0.0139 2811920 MediaWiki\Session\SessionManager->getSessionInfoForRequest( ) .../SessionManager.php:182
31	8 0.0145 2880592 MediaWiki\Session\CookieSessionProvider->provideSessionInfo( ) .../SessionManager.php:666
32	9 0.0148 2906464 MediaWiki\Session\UserInfo::newFromId( ) .../CookieSessionProvider.php:119
33	10 0.0150 3019016 User->load( ) .../UserInfo.php:88
34	11 0.0150 3019456 User->loadFromId( ) .../User.php:406
35	12 0.0157 3056056 User->loadFromDatabase( ) .../User.php:441
36	13 0.0176 3557616 DatabaseBase->selectRow( ) .../User.php:1274
37	14 0.0176 3558152 DatabaseBase->select( ) .../Database.php:1293
38	15 0.0179 3560136 DatabaseBase->query( ) .../Database.php:1234
39	16 0.0179 3561600 DatabaseBase->begin( ) .../Database.php:811

Not yet figured out why.

bd808 unsubscribed.May 19 2016, 11:33 PM

And it's gone. No idea what was going on there.

Smart caching logic for handling cross-DC network outagesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Smart caching logic for handling cross-DC network outages
Closed, ResolvedPublic
Actions

Related Objects
Search...