Spike: Avoid use of merge() in Flow caches
Closed, ResolvedPublic3 Story Points

Description

As discussed IRL, these will need to change for all the upcoming multi-Data Center stuff.

See also T94028: DB master connections requested by Flow on GET/HEAD requests

aaron created this task.Mar 26 2015, 2:41 PM
aaron updated the task description. (Show Details)
aaron raised the priority of this task from to Normal.
aaron claimed this task.
aaron added projects: Epic, MediaWiki-Core-Team.
aaron added subscribers: Gilles, GWicke, mark and 8 others.
aaron removed a project: Epic.Mar 26 2015, 2:47 PM
aaron set Security to None.
EBernhardson raised the priority of this task from Normal to High.
DannyH updated the task description. (Show Details)Mar 30 2015, 6:06 PM
DannyH renamed this task from Avoid use of merge() in Flow caches to Spike: Avoid use of merge() in Flow caches.Mar 30 2015, 6:18 PM
DannyH edited a custom field.
aaron removed aaron as the assignee of this task.May 23 2015, 12:47 AM
aaron added a subscriber: Legoktm.

Not sure if there is an additional task for this (other than T94028: DB master connections requested by Flow on GET/HEAD requests), but we've talked about changing the backend to drop the Index layer as a solution to this.

aaron added a comment.Sep 13 2015, 2:33 AM

Is this on anyone's plate to work on. The use of merge() is not just slow (like master queries) but totally broken for multi DC MW. I wonder if it would just be easier to remove the caching layers mostly rather than try to replace them.

Stripping out the caching layer is probably the easiest way to go about this, and tbh i don't think the caching layer in Flow ended up being incredibly useful.

Is this on anyone's plate to work on. The use of merge() is not just slow (like master queries) but totally broken for multi DC MW. I wonder if it would just be easier to remove the caching layers mostly rather than try to replace them.

Yes. I already scheduled a meeting on Wednesday to discuss this. This is both because it's still blocking multi-DC and because other issues are still occasionally cropping up (though not that often). E.g.

I think where we ended up in the meeting was that we should gradually reduce the memcached time to live, preferably while keeping performance metrics, and rewriting the database fallback queries.

The other possible approach is to replace merge with delete. WANObjectCache has special provisions to allow a delete() to win even if a stale set() happens shortly after the delete(), but before the slave DB is up to date (hence, stale).

aaron added a comment.Oct 1 2015, 8:44 PM
In T94029#1690380, @Mattflaschen wrote:

The other possible approach is to replace merge with delete. WANObjectCache has special provisions to allow a delete() to win even if a stale set() happens shortly after the delete(), but before the slave DB is up to date (hence, stale).

It can't use merge(), so this seems sensible.

Implementing WANObjectCache is likely going to be more work than I had anticipated.
We use BufferedBagOStuff, which:

  • keeps all data fetched from/stored to cache in memory for follow-up calls - we rely on that behavior (= request same data more than once)
  • it lets us to commit() all cache operations at once

We would probably have to refactor significant pieces of code. And since we're considering throwing away the cache layer...

I just submitted https://gerrit.wikimedia.org/r/#/c/247575/, which will make our code stop backfilling cache from DB.
That also means we no longer have to read from DB_MASTER (which we did to ensure data in cache is current)
And this also means we should be able to start lowering $wgFlowCacheTime until we can set $wgFlowUseMemcache = false.
Cache will gradually hold less data & we can monitor how badly that affects the DB.


Current summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)

db1029 : replication family

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G945d0s2392--masters: n/a slaves: db1031, db2009
db103110.64.16.205.5.3064G945d3s466Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G307d9s1Yes0smasters: db1029 slaves: n/a

Slow query log: https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1029

Blocked on https://gerrit.wikimedia.org/r/#/c/247575/. Only after that has merged, we can start lowering cache TTL to examine if getting rid of the cache is viable.

Blocked on https://gerrit.wikimedia.org/r/#/c/247575/. Only after that has merged, we can start lowering cache TTL to examine if getting rid of the cache is viable.

It's been merged. I'm moving it back into dev so you can continue with the next steps.

Change 249402 had a related patch set uploaded (by Matthias Mullie):
Expire Flow caches after 1 day

https://gerrit.wikimedia.org/r/249402

Moved back to blocked. First patch to lower cache TTL is on Gerrit, but I'd like to first see the impact of https://gerrit.wikimedia.org/r/#/c/247575/, once it hits production. That'll take another week.

Current summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)

db1029 : replication family

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G955d9s2531--masters: n/a slaves: db1031, db2009, dbstore1002
db103110.64.16.205.5.3064G955d2s464Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G317d8s1Yes0smasters: db1029 slaves: n/a

Summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)
Slow query log: https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1029


Prior to https://gerrit.wikimedia.org/r/#/c/247575/:

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G945d0s2392--masters: n/a slaves: db1031, db2009
db103110.64.16.205.5.3064G945d3s466Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G307d9s1Yes0smasters: db1029 slaves: n/a

After:

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G966d5s979--masters: n/a slaves: db1031, db2009, dbstore1001, dbstore1002
db103110.64.16.205.5.3064G966d8s1938Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G328d4s1Yes0smasters: db1029 slaves: dbstore2001, dbstore2002

Traffic diff
DB_MASTER:


DB_SLAVE:


Queries to master have decreased a lot.
I'm currently not seeing worrying stats or slow queries on flow_* columns.
Let's proceed & lower cache TTL: https://gerrit.wikimedia.org/r/249402

I'd like to mitigate at least some of T118434: Reduce Flow DB queries on Special:Contributions before we do this. Right now, that bug's impact is probably reduced in production because of the caches.

In today's meeting, we decided to try the WANCache delete()/locally populate from slave approach.

jmatazzoni closed this task as Resolved.Apr 12 2016, 6:46 PM
aaron added a comment.Apr 12 2016, 8:57 PM

Was this closed by mistake?

We decided there were too many tasks about this. This was originally a spike to investigate this. There are remaining tasks about it, e.g. T120009: Flow: Use WAN cache delete() and slave populations to avoid merge().

Change 249402 abandoned by Mattflaschen:
Expire Flow caches after 1 day

https://gerrit.wikimedia.org/r/249402