Spike: Avoid use of merge() in Flow caches
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	aaron
	Mar 26 2015, 2:41 PM

Description

As discussed IRL, these will need to change for all the upcoming multi-Data Center stuff.

Details

	Subject	Repo	Branch	Lines +/-
	Expire Flow caches after 1 day	operations/mediawiki-config	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T128228 Improve how transactions work for board moves
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	SBisson	T120009 Flow: Use WAN cache delete() and replica-based filling to avoid merge()
Resolved	matthiasmullie	T94029 Spike: Avoid use of merge() in Flow caches

Event Timeline

aaron created this task.Mar 26 2015, 2:41 PM

aaron claimed this task.

aaron raised the priority of this task from to Medium.

aaron updated the task description. (Show Details)

aaron added projects: Epic, MediaWiki-Core-Team.

aaron added subscribers: • Gilles, • GWicke, mark and 8 others.

aaron removed a project: Epic.Mar 26 2015, 2:47 PM

aaron set Security to None.

EBernhardson added a project: Collaboration-Team-Triage.Mar 27 2015, 5:43 PM

• DannyH moved this task from Untriaged to Team discussion on the Collaboration-Team-Triage board.Mar 27 2015, 5:43 PM

• DannyH updated the task description. (Show Details)Mar 30 2015, 6:06 PM

• Mattflaschen-WMF updated the task description. (Show Details)Mar 30 2015, 6:17 PM

• Mattflaschen-WMF mentioned this in T94028: DB master connections requested by Flow on GET/HEAD requests.

• DannyH renamed this task from Avoid use of merge() in Flow caches to Spike: Avoid use of merge() in Flow caches.Mar 30 2015, 6:18 PM

• DannyH edited a custom field.

• DannyH moved this task from Team discussion to Ready for next sprint on the Collaboration-Team-Triage board.

• Mattflaschen-WMF mentioned this in T94405: The header of User_talk:Amire80 in he.wikipedia cannot be edited.Mar 30 2015, 11:30 PM

• Gilles edited projects, added Sustainability; removed MediaWiki-Core-Team.Apr 7 2015, 7:30 AM

aaron removed aaron as the assignee of this task.May 23 2015, 12:47 AM

aaron added a subscriber: Legoktm.

Not sure if there is an additional task for this (other than T94028: DB master connections requested by Flow on GET/HEAD requests), but we've talked about changing the backend to drop the Index layer as a solution to this.

• Mattflaschen-WMF moved this task from Ready for next sprint to Near-Term Interest on the Collaboration-Team-Triage board.Aug 7 2015, 10:06 PM

• DannyH moved this task from Near-Term Interest to Freezer on the Collaboration-Team-Triage board.Aug 21 2015, 9:17 PM

Is this on anyone's plate to work on. The use of merge() is not just slow (like master queries) but totally broken for multi DC MW. I wonder if it would just be easier to remove the caching layers mostly rather than try to replace them.

Stripping out the caching layer is probably the easiest way to go about this, and tbh i don't think the caching layer in Flow ended up being incredibly useful.

In T94029#1634640, @aaron wrote:

Is this on anyone's plate to work on. The use of merge() is not just slow (like master queries) but totally broken for multi DC MW. I wonder if it would just be easier to remove the caching layers mostly rather than try to replace them.

Yes. I already scheduled a meeting on Wednesday to discuss this. This is both because it's still blocking multi-DC and because other issues are still occasionally cropping up (though not that often). E.g.

There were issues when making a schema change for T107204: Separate reference tables by wiki.
The Index solution to make schema changes (rEFLWb8ea7b84be25: Fix removal/addition of categorylinks) easier through normalization worked, but caused another issue with history (T111494) that I'm now solving.
There are still other rare issues with Index, e.g. T112230: History pagination does not work properly (I think that pre-dated the normalization fix).

I think where we ended up in the meeting was that we should gradually reduce the memcached time to live, preferably while keeping performance metrics, and rewriting the database fallback queries.

The other possible approach is to replace merge with delete. WANObjectCache has special provisions to allow a delete() to win even if a stale set() happens shortly after the delete(), but before the slave DB is up to date (hence, stale).

In T94029#1690380, @Mattflaschen wrote:

The other possible approach is to replace merge with delete. WANObjectCache has special provisions to allow a delete() to win even if a stale set() happens shortly after the delete(), but before the slave DB is up to date (hence, stale).

It can't use merge(), so this seems sensible.

matthiasmullie claimed this task.Oct 15 2015, 10:36 AM

matthiasmullie added a project: Collaboration-Team-Archive-2015-2016.

matthiasmullie moved this task from Untriaged to In Development on the Collaboration-Team-Archive-2015-2016 board.

Implementing WANObjectCache is likely going to be more work than I had anticipated.
We use BufferedBagOStuff, which:

keeps all data fetched from/stored to cache in memory for follow-up calls - we rely on that behavior (= request same data more than once)
it lets us to commit() all cache operations at once

We would probably have to refactor significant pieces of code. And since we're considering throwing away the cache layer...

I just submitted https://gerrit.wikimedia.org/r/#/c/247575/, which will make our code stop backfilling cache from DB.
That also means we no longer have to read from DB_MASTER (which we did to ensure data in cache is current)
And this also means we should be able to start lowering $wgFlowCacheTime until we can set $wgFlowUseMemcache = false.
Cache will gradually hold less data & we can monitor how badly that affects the DB.

Current summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)

db1029 : replication family

Host	IPv4	Release	RAM	Up	Act.	QPS	Rep	Lag	Tree
db1029	10.64.16.18	5.5.30	64G	945d	0s	2392	-	-	masters: n/a slaves: db1031, db2009
db1031	10.64.16.20	5.5.30	64G	945d	3s	466	Yes	0s	masters: db1029 slaves: n/a
db2009	10.192.0.12	10.0.15	64G	307d	9s	1	Yes	0s	masters: db1029 slaves: n/a

Slow query log: https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1029

matthiasmullie moved this task from In Development to Blocked on the Collaboration-Team-Archive-2015-2016 board.Oct 20 2015, 3:07 PM

Blocked on https://gerrit.wikimedia.org/r/#/c/247575/. Only after that has merged, we can start lowering cache TTL to examine if getting rid of the cache is viable.

In T94029#1758417, @matthiasmullie wrote:

Blocked on https://gerrit.wikimedia.org/r/#/c/247575/. Only after that has merged, we can start lowering cache TTL to examine if getting rid of the cache is viable.

It's been merged. I'm moving it back into dev so you can continue with the next steps.

SBisson moved this task from Blocked to In Development on the Collaboration-Team-Archive-2015-2016 board.Oct 28 2015, 11:11 AM

Change 249402 had a related patch set uploaded (by Matthias Mullie):
Expire Flow caches after 1 day

https://gerrit.wikimedia.org/r/249402

gerritbot added a project: Patch-For-Review.Oct 28 2015, 2:18 PM

Moved back to blocked. First patch to lower cache TTL is on Gerrit, but I'd like to first see the impact of https://gerrit.wikimedia.org/r/#/c/247575/, once it hits production. That'll take another week.

Current summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)

db1029 : replication family

Host	IPv4	Release	RAM	Up	Act.	QPS	Rep	Lag	Tree
db1029	10.64.16.18	5.5.30	64G	955d	9s	2531	-	-	masters: n/a slaves: db1031, db2009, dbstore1002
db1031	10.64.16.20	5.5.30	64G	955d	2s	464	Yes	0s	masters: db1029 slaves: n/a
db2009	10.192.0.12	10.0.15	64G	317d	8s	1	Yes	0s	masters: db1029 slaves: n/a

Summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)
Slow query log: https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1029

Prior to https://gerrit.wikimedia.org/r/#/c/247575/:

Host	IPv4	Release	RAM	Up	Act.	QPS	Rep	Lag	Tree
db1029	10.64.16.18	5.5.30	64G	945d	0s	2392	-	-	masters: n/a slaves: db1031, db2009
db1031	10.64.16.20	5.5.30	64G	945d	3s	466	Yes	0s	masters: db1029 slaves: n/a
db2009	10.192.0.12	10.0.15	64G	307d	9s	1	Yes	0s	masters: db1029 slaves: n/a

After:

Host	IPv4	Release	RAM	Up	Act.	QPS	Rep	Lag	Tree
db1029	10.64.16.18	5.5.30	64G	966d	5s	979	-	-	masters: n/a slaves: db1031, db2009, dbstore1001, dbstore1002
db1031	10.64.16.20	5.5.30	64G	966d	8s	1938	Yes	0s	masters: db1029 slaves: n/a
db2009	10.192.0.12	10.0.15	64G	328d	4s	1	Yes	0s	masters: db1029 slaves: dbstore2001, dbstore2002

Traffic diff
DB_MASTER:

Screen Shot 2015-11-10 at 15.51.23.png (598×1 px, 140 KB)

DB_SLAVE:

Screen Shot 2015-11-10 at 15.51.10.png (604×1 px, 135 KB)

Queries to master have decreased a lot.
I'm currently not seeing worrying stats or slow queries on flow_* columns.
Let's proceed & lower cache TTL: https://gerrit.wikimedia.org/r/249402

matthiasmullie moved this task from Blocked to Needs Review on the Collaboration-Team-Archive-2015-2016 board.Nov 10 2015, 3:00 PM

I'd like to mitigate at least some of T118434: Reduce Flow DB queries on Special:Contributions before we do this. Right now, that bug's impact is probably reduced in production because of the caches.

See T114550#1830388.

• Mattflaschen-WMF mentioned this in T92420: Unblock multi data center work - remove usage of cache CAS operations.Dec 1 2015, 7:18 PM

• Mattflaschen-WMF merged a task: T92420: Unblock multi data center work - remove usage of cache CAS operations.

In today's meeting, we decided to try the WANCache delete()/locally populate from slave approach.

• Mattflaschen-WMF updated the task description. (Show Details)Dec 1 2015, 7:27 PM

• Mattflaschen-WMF updated the task description. (Show Details)Dec 1 2015, 7:31 PM

Catrope moved this task from Needs Review to Blocked on the Collaboration-Team-Archive-2015-2016 board.Dec 16 2015, 2:04 AM

matthiasmullie mentioned this in T114550: Flow talk page on mediawiki.org takes 4 seconds to load.Dec 16 2015, 7:04 PM

• Mattflaschen-WMF mentioned this in T128228: Improve how transactions work for board moves.Feb 26 2016, 10:40 PM

• jmatazzoni closed this task as Resolved.Apr 12 2016, 6:46 PM

Was this closed by mistake?

We decided there were too many tasks about this. This was originally a spike to investigate this. There are remaining tasks about it, e.g. T120009: Flow: Use WAN cache delete() and replica-based filling to avoid merge().

Updated the dependencies.

Change 249402 abandoned by Mattflaschen:
Expire Flow caches after 1 day