Page MenuHomePhabricator

Spike: Avoid use of merge() in Flow caches
Closed, ResolvedPublic3 Estimated Story Points

Description

As discussed IRL, these will need to change for all the upcoming multi-Data Center stuff.

See also T94028: DB master connections requested by Flow on GET/HEAD requests

Event Timeline

aaron claimed this task.
aaron raised the priority of this task from to Medium.
aaron updated the task description. (Show Details)
aaron added projects: Epic, MediaWiki-Core-Team.
aaron added subscribers: Gilles, GWicke, mark and 8 others.
aaron set Security to None.
EBernhardson raised the priority of this task from Medium to High.Mar 27 2015, 5:43 PM
DannyH renamed this task from Avoid use of merge() in Flow caches to Spike: Avoid use of merge() in Flow caches.Mar 30 2015, 6:18 PM
DannyH edited a custom field.
aaron removed aaron as the assignee of this task.May 23 2015, 12:47 AM
aaron added a subscriber: Legoktm.

Not sure if there is an additional task for this (other than T94028: DB master connections requested by Flow on GET/HEAD requests), but we've talked about changing the backend to drop the Index layer as a solution to this.

Is this on anyone's plate to work on. The use of merge() is not just slow (like master queries) but totally broken for multi DC MW. I wonder if it would just be easier to remove the caching layers mostly rather than try to replace them.

Stripping out the caching layer is probably the easiest way to go about this, and tbh i don't think the caching layer in Flow ended up being incredibly useful.

Is this on anyone's plate to work on. The use of merge() is not just slow (like master queries) but totally broken for multi DC MW. I wonder if it would just be easier to remove the caching layers mostly rather than try to replace them.

Yes. I already scheduled a meeting on Wednesday to discuss this. This is both because it's still blocking multi-DC and because other issues are still occasionally cropping up (though not that often). E.g.

I think where we ended up in the meeting was that we should gradually reduce the memcached time to live, preferably while keeping performance metrics, and rewriting the database fallback queries.

The other possible approach is to replace merge with delete. WANObjectCache has special provisions to allow a delete() to win even if a stale set() happens shortly after the delete(), but before the slave DB is up to date (hence, stale).

In T94029#1690380, @Mattflaschen wrote:

The other possible approach is to replace merge with delete. WANObjectCache has special provisions to allow a delete() to win even if a stale set() happens shortly after the delete(), but before the slave DB is up to date (hence, stale).

It can't use merge(), so this seems sensible.

Implementing WANObjectCache is likely going to be more work than I had anticipated.
We use BufferedBagOStuff, which:

  • keeps all data fetched from/stored to cache in memory for follow-up calls - we rely on that behavior (= request same data more than once)
  • it lets us to commit() all cache operations at once

We would probably have to refactor significant pieces of code. And since we're considering throwing away the cache layer...

I just submitted https://gerrit.wikimedia.org/r/#/c/247575/, which will make our code stop backfilling cache from DB.
That also means we no longer have to read from DB_MASTER (which we did to ensure data in cache is current)
And this also means we should be able to start lowering $wgFlowCacheTime until we can set $wgFlowUseMemcache = false.
Cache will gradually hold less data & we can monitor how badly that affects the DB.


Current summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)

db1029 : replication family

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G945d0s2392--masters: n/a slaves: db1031, db2009
db103110.64.16.205.5.3064G945d3s466Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G307d9s1Yes0smasters: db1029 slaves: n/a

Slow query log: https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1029

Blocked on https://gerrit.wikimedia.org/r/#/c/247575/. Only after that has merged, we can start lowering cache TTL to examine if getting rid of the cache is viable.

Blocked on https://gerrit.wikimedia.org/r/#/c/247575/. Only after that has merged, we can start lowering cache TTL to examine if getting rid of the cache is viable.

It's been merged. I'm moving it back into dev so you can continue with the next steps.

Change 249402 had a related patch set uploaded (by Matthias Mullie):
Expire Flow caches after 1 day

https://gerrit.wikimedia.org/r/249402

Moved back to blocked. First patch to lower cache TTL is on Gerrit, but I'd like to first see the impact of https://gerrit.wikimedia.org/r/#/c/247575/, once it hits production. That'll take another week.

Current summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)

db1029 : replication family

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G955d9s2531--masters: n/a slaves: db1031, db2009, dbstore1002
db103110.64.16.205.5.3064G955d2s464Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G317d8s1Yes0smasters: db1029 slaves: n/a

Summary (https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306)
Slow query log: https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1029


Prior to https://gerrit.wikimedia.org/r/#/c/247575/:

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G945d0s2392--masters: n/a slaves: db1031, db2009
db103110.64.16.205.5.3064G945d3s466Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G307d9s1Yes0smasters: db1029 slaves: n/a

After:

HostIPv4ReleaseRAMUpAct.QPSRepLagTree
db102910.64.16.185.5.3064G966d5s979--masters: n/a slaves: db1031, db2009, dbstore1001, dbstore1002
db103110.64.16.205.5.3064G966d8s1938Yes0smasters: db1029 slaves: n/a
db200910.192.0.1210.0.1564G328d4s1Yes0smasters: db1029 slaves: dbstore2001, dbstore2002

Traffic diff
DB_MASTER:

Screen Shot 2015-11-10 at 15.51.23.png (598×1 px, 140 KB)

DB_SLAVE:
Screen Shot 2015-11-10 at 15.51.10.png (604×1 px, 135 KB)


Queries to master have decreased a lot.
I'm currently not seeing worrying stats or slow queries on flow_* columns.
Let's proceed & lower cache TTL: https://gerrit.wikimedia.org/r/249402

I'd like to mitigate at least some of T118434: Reduce Flow DB queries on Special:Contributions before we do this. Right now, that bug's impact is probably reduced in production because of the caches.

In today's meeting, we decided to try the WANCache delete()/locally populate from slave approach.

We decided there were too many tasks about this. This was originally a spike to investigate this. There are remaining tasks about it, e.g. T120009: Flow: Use WAN cache delete() and replica-based filling to avoid merge().

Change 249402 abandoned by Mattflaschen:
Expire Flow caches after 1 day

https://gerrit.wikimedia.org/r/249402