Fix inefficient CacheAwarePropertyInfoStore memcached access pattern
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hoo
	Apr 27 2015, 11:32 PM

Referenced Files

	F29052071: image.png
	May 14 2019, 5:11 PM

	F28597246: Screen Shot 2019-04-09 at 8.09.44 AM.png
	Apr 9 2019, 6:11 AM

	F27209311: image.png
	Nov 13 2018, 9:44 AM

	F26296567: image.png
	Oct 3 2018, 9:24 PM

Subscribers

View All 18 Subscribers

Description

While this is not a problem now, I see PropertyInfoStore becoming a performance problem rather sooner than later, especially if we implement identifier linking on top of it and P553 finally gets split up (which it really really should).

We should have an implementation that doesn't require loading all infos from memcached (CachingPropertyInfoStore) but also doesn't end up making to many round trips to some form of storage.

This will probably be a problem very similar to the one we had with SiteList, but having a static file cache isn't a possible solution here.

https://grafana.wikimedia.org/d/000000574/t204083-investigation shows the excessive traffic moving around the various memcached hosts for the last 1 year.

Final acceptance criteria

TBA other points
Write an ADR for what was done in Wikibase - 4 hours spike

Details

Subject	Repo	Branch	Lines +/-
Increase APC cache for PropertyInfoLookup from 15 to 20s	mediawiki/extensions/Wikibase	master	+1 -1
Double on server cache for PropertyInfoStore	mediawiki/extensions/Wikibase	master	+1 -1
Increase APC cache for PropertyInfoLookup from 15 to 20s	mediawiki/extensions/Wikibase	wmf/1.33.0-wmf.20	+1 -1
Increase APC cache for PropertyInfoLookup from 15 to 20s	mediawiki/extensions/Wikibase	wmf/1.33.0-wmf.21	+1 -1
Double on server cache for PropertyInfoStore	mediawiki/extensions/Wikibase	wmf/1.33.0-wmf.21	+1 -1
Double on server cache for PropertyInfoStore	mediawiki/extensions/Wikibase	wmf/1.33.0-wmf.20	+1 -1
Add docs/adr/0002-CacheAwarePropertyInfoStore-caching.md	mediawiki/extensions/Wikibase	master	+43 -0
CachingPropertyInfoLookup local server cache 10s -> 15s	mediawiki/extensions/Wikibase	master	+1 -1
SqlStore wrap PropertyInfo classes in an APC cache too	mediawiki/extensions/Wikibase	master	+16 -6
PropertyInfoStore cache usage debug logging for wmf.20	mediawiki/extensions/Wikibase	wmf/1.32.0-wmf.20	+10 -4
Debug logging for T97368	operations/mediawiki-config	master	+1 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	Addshore	T97368 Fix inefficient CacheAwarePropertyInfoStore memcached access pattern
Resolved	Addshore	T209343 Add APC caching to CacheAwarePropertyInfoStore
Resolved	Addshore	T218115 [Story] Implement per property caching for CacheAwarePropertyInfoStore
Resolved	Ladsgroup	T218122 Cache property info with one key per property
Resolved	• alaa_wmde	T218124 Read from key-per-property cache
Resolved	Addshore	T218197 CacheAwarePropertyInfoStore & CachingPropertyInfoLookup should use WANObjectCache instead of BagOStuff

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Jonas set Due Date to Oct 4 2018, 7:00 AM.Sep 26 2018, 10:12 AM

Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptSep 26 2018, 10:12 AM

This patch should go live on wikidata.org with the train on Wednesday 3rd Oct

The 10 second on server cache of the data resulted in a pretty significant drop in traffic with no significant increase on another server:

Krinkle removed Due Date.Oct 5 2018, 5:42 PM

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptOct 5 2018, 5:42 PM

Setting this to normal now as the on server caching has reduced the traffic to the individual server, although it is still higher than other servers, although only 2-3x greater rather than 4+x greater.

The on server cache should provide steadier rates of access and spikes in mw requests should not result in spike in these requests to memcached.
We could increase the local cache from 10 seconds to 5 seconds and should see a decrease in cache traffic again.

We could also think about sharding the data in some way.
We could

Have n keys containing the same data, and randomly choose one to load the data from when needed. This would share the load across the cache cluster more evenly.
Shard based on data type. The lookups don't know what data type they want to load when requesting the data, so they would just have to load a shard, see if the property lies within the shard before trying the next one. This again would reduce the traffic on a single server but would result in more requests overall to get the same data.
Shard based on property ID. Simply speaking we could shove the first 1-500 properties in 1 key, the next 500-1000 properties in another key. This would distribute the data among multiple cache servers reducing traffic. Also it would allow us to know which shard we want to look at instead of having to guess. Instead of hard coding this / using the numeric representations of property IDs we could also shard based on some hash function on the serialized property ID.
have a key per property, I don't think this is a great idea, this would result it a much greater number of hits, single requests might need info on any number of properties. Currently the code doesn;t have a way of knowing how many properties we are going to request before we start requesting them. Of course we could re work this..

Addshore moved this task from Blocked to Triaged High (100+) on the [DEPRECATED] wdwb-tech board.Oct 8 2018, 12:36 PM

greg moved this task from Active Situation to Follow-up prevention on the Wikimedia-Incident board.Oct 16 2018, 6:49 PM

Change 470567 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] CachingPropertyInfoLookup local server cache 10s -> 15s

https://gerrit.wikimedia.org/r/470567

gerritbot removed a project: Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Oct 30 2018, 11:15 AM

Addshore moved this task from Test (Verification) to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Oct 30 2018, 11:15 AM

Addshore edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

Addshore moved this task from Peer Review to Test (Verification) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Nov 1 2018, 10:22 AM

Change 470567 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] CachingPropertyInfoLookup local server cache 10s -> 15s

https://gerrit.wikimedia.org/r/470567

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.3; 2018-11-06).Nov 1 2018, 4:00 PM

@Addshore can you close this?

So the on server caching has reduced the traffic fairly dramatically, although the key is still used lots / the data transfer is expensive.

It looks like the traffic for the cache key has halved, which should be fine for now.
I think this can leave the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) but will remain open on the Wikidata board as this issue will just continue to grow as the number of properties increases and also as the number of app servers increases.

Unassigning until we look at this again

Addshore moved this task from In Progress to Needs Work on the Wikidata-Campsite board.Nov 13 2018, 9:57 AM

Addshore moved this task from in progress to needs discussion or investigation on the Wikidata board.

Addshore moved this task from Needs Work to Ready to estimate on the Wikidata-Campsite board.Dec 18 2018, 2:28 PM

Addshore renamed this task from Investigate more efficient memcached solution for CacheAwarePropertyInfoStore to Investigate more efficient memcached solution for CacheAwarePropertyInfoStore.Dec 18 2018, 2:33 PM

Addshore updated the task description. (Show Details)

Addshore moved this task from Ready to estimate to Wikidata-Campsite-Iteration-∞ (On Hold) on the Wikidata-Campsite board.

Addshore edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

Picked back up by the campsite to write an ADR for what was done and why & how we might need to change this in the future.

Addshore claimed this task.Dec 19 2018, 11:08 AM

Addshore moved this task from Test (Verification) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Addshore moved this task from Doing to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Dec 19 2018, 12:16 PM

Change 480731 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Add docs/adr/0002-CacheAwarePropertyInfoStore-caching.md

https://gerrit.wikimedia.org/r/480731

Addshore moved this task from needs discussion or investigation to in progress on the Wikidata board.Dec 19 2018, 12:45 PM

Resolved for now, although at some point in the future we will have to revisit this

Change 480731 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add docs/adr/0002-CacheAwarePropertyInfoStore-caching.md

https://gerrit.wikimedia.org/r/480731

ReleaseTaggerBot edited projects, added MW-1.33-notes (1.33.0-wmf.12; 2019-01-08); removed MW-1.33-notes (1.33.0-wmf.3; 2018-11-06).Jan 7 2019, 11:00 PM

Addshore updated the task description. (Show Details)Jan 17 2019, 4:17 PM

Quick comment: I noticed while checking some memcached metrics that wikibase_shared/1_33_0-wmf_20-wikidatawiki-hhvm:CacheAwarePropertyInfoStore generates 30~60MB/s of GET traffic since the last deployment of mediawiki:

https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=140&panelId=60&fullscreen

Confirmed also via memkeys. Given what happened in T204083, is this meant to generate this kind of traffic volume? If so, will it increase over time, stay steady, etc..? I don't see impact on the network interface or correlation with timeouts, but better safe and ask :)

elukey reopened this task as Open.Mar 10 2019, 6:57 PM

Addshore edited projects, added Wikidata-Campsite; removed Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Mar 10 2019, 7:01 PM

Correction: the last bursts of mcrouter's timeouts seems to match a high set rate for slab 140 on mc1022, that is exactly where the key is:

https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&from=1552231823147&to=1552232692113&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=All&panelId=49&fullscreen

https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&from=1552231924314&to=1552232637659&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All

Will add more data tomorrow :)

I found another occurrence of timeouts in mcrouter with SETs for this key, it might be due to TTL expiring or similar. Reducing the traffic volume to memcached is surely a good thing.

Addshore renamed this task from Investigate more efficient memcached solution for CacheAwarePropertyInfoStore to Fix inefficient CacheAwarePropertyInfoStore memcached access pattern.Mar 12 2019, 12:49 PM

Addshore raised the priority of this task from Medium to High.

Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.

Addshore moved this task from Needs Work to Ready to estimate on the Wikidata-Campsite board.Mar 12 2019, 2:02 PM

Another big SET bursts to mc1022 caused timeouts for the MW appservers.. Don't meant to pressure you guys but is there any plan to fix this?

The SET bursts should be due to a key either being evicted or expired:

elukey@mc1022:~$ cat slab_140 | grep -i wikidata
ITEM wikibase_shared/1_33_0-wmf_20-wikidatawiki-hhvm:CacheAwarePropertyInfoStore [100621 b; 1552491470 s]
ITEM wikibase_shared/1_33_0-wmf_20-wikidatawiki-hhvm:WikiPageEntityRevisionLookup:Q56701805 [99611 b; 1552430549 s]

elukey@mc1022:~$ date -d @1552491470
Wed Mar 13 15:37:50 UTC 2019          <================= TTL 24h should match with this being just re-created

elukey@mc1022:~$ date -d @1552430549
Tue Mar 12 22:42:29 UTC 2019

elukey@mc1022:~$ date
Tue Mar 12 15:39:31 UTC 2019

Now I am wondering how the key gets re-set, it might be something to tune in the usage of the MWWanCache. Adding @aaron as FYI :)

Mcrouter timeouts/TKO registered: https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&from=1552402906212&to=1552405067164

Slab 140 SET traffic: https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=49&fullscreen&orgId=1&from=1552403014459&to=1552404751954&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=All

It differs from T203786 since the appservers (not apis) seems to be hit. It is a bad luck that this happens on the same host though, mc1022 (but of course different slabs).

@Addshore can you write a brief summary about what it changed recently? I tried to read the task but I got lost in PHP :(

Addshore removed Addshore as the assignee of this task.Mar 12 2019, 3:49 PM

Addshore moved this task from Ready to estimate to Blocked / Waiting / External on the Wikidata-Campsite board.

Change 495928 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495928

Change 495929 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495929

Change 495930 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495930

Change 495930 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495930

Change 495929 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495929

Mentioned in SAL (#wikimedia-operations) [2019-03-12T16:42:32Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.20/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Double on server cache for PropertyInfoStore (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-03-12T16:43:48Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.21/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Double on server cache for PropertyInfoStore (duration: 00m 55s)

Change 495941 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495941

Change 495942 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495942

Change 495943 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495943

Change 495942 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495942

Change 495943 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495943

thiemowmde unsubscribed.Mar 12 2019, 5:05 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-12T17:10:13Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.20/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Increase APC cache for PropertyInfoLookup from 15 to 20s (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-03-12T17:11:20Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.21/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Increase APC cache for PropertyInfoLookup from 15 to 20s (duration: 00m 55s)

In T97368#5017978, @elukey wrote:

Another big SET bursts to mc1022 caused timeouts for the MW appservers.. Don't meant to pressure you guys but is there any plan to fix this?

T218115 is the first step of the plan and is now being worked on.

Per our IRC conversation it sounds like we will want to switch this to use WANObjectCache? (I'll wait for your full comment here).

For now as the pipe on the machine with this key on nears saturation at spikes I increased the APC cache of for the key a little bit too (see the deployments above)

To summarize what I am currently seeing:

after the last mw deployment/train, the key got renamed and moved to mc1022 (was before on mc1021). This can be seen in https://grafana.wikimedia.org/d/000000316/memcache?panelId=45&fullscreen&orgId=1&from=1551860994585&to=1552057232883
when the key expires or gets evicted from the memcache LRU, then a bursts of SETs hits mc1022. This matches with mcrouter's timeouts, that eventually lead to the shard being marked as temporary down for all memcached traffic and hence spikes in mediawiki exceptions.
there is a constant high volume of GETs that raises the mc1022's baseline a lot and gets us closer to what the 1G NIC on the host can handle. We haven't observed any sign of packet dropped due to saturation but in theory a normal increase of api/appservers traffic could lead to that.

To solve the bursts of SETs we could use WANObjectCache (https://doc.wikimedia.org/mediawiki-core/master/php/classWANObjectCache.html) instead of using BagOfStuff directly, since it offers some nice functionalities to prevent, for example, that all the api/appservers try to SET a key if not there. We did the same with the Mediawiki Translate extension in T203786, solving a lot of problems.

Change 495928 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495928

Addshore mentioned this in T218197: CacheAwarePropertyInfoStore & CachingPropertyInfoLookup should use WANObjectCache instead of BagOStuff.Mar 13 2019, 11:00 AM

elukey added a project: User-Elukey.Mar 14 2019, 8:12 AM

ReleaseTaggerBot edited projects, added MW-1.33-notes (1.33.0-wmf.21; 2019-03-12); removed MW-1.33-notes (1.33.0-wmf.12; 2019-01-08).Mar 17 2019, 12:03 AM

ReleaseTaggerBot edited projects, added MW-1.33-notes (1.33.0-wmf.22; 2019-03-19); removed MW-1.33-notes (1.33.0-wmf.21; 2019-03-12).Mar 17 2019, 1:02 AM

Change 495941 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495941

We should see changes in 1.33.0-wmf.24.
It looks like the train didn't make it past group0 this week, so no change really spotted yet.

The train was deployed yesterday, I can see an awesome improvement (green bar dropping):

Screen Shot 2019-04-09 at 8.09.44 AM.png (1×1 px, 589 KB)

Thanks a lot for all the work!

Amazing, I think we can resolve this then!

Addshore closed subtask T218197: CacheAwarePropertyInfoStore & CachingPropertyInfoLookup should use WANObjectCache instead of BagOStuff as Resolved.Apr 9 2019, 9:32 AM

Addshore closed subtask T218115: [Story] Implement per property caching for CacheAwarePropertyInfoStore as Resolved.Apr 9 2019, 9:34 AM

I just flicked back to https://grafana.wikimedia.org/d/000000574/t204083-investigation today and noticed that as of 1.34.wmf.3 a pattern similar to what we were seeing before is back?
https://tools.wmflabs.org/sal/log/AWp6A7RAEHTBTPG-zSOo

Not sure if this has been spotted anywhere else, ping @elukey, maybe you can shed some light on this?

From a quick look with memkeys on mc1033, these are the top talkers keys:

memcache key                                                                                                                        calls   objsize   req/sec  bw(kbps)
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P248                                                                    401463        92   8921.40   6566.15
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P2093                                                                   338923        85   7531.62   5121.50
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P4656                                                                    36029        82    800.64    525.22
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P214                                                                     35361       150    785.80    942.96

Is it something that was changed recently?

daniel unsubscribed.May 14 2019, 9:41 AM

Addshore added a subscriber: • alaa_wmde.May 14 2019, 3:42 PM

That all looks fine.
I added a new panel to https://grafana.wikimedia.org/d/000000574/t204083-investigation?orgId=1 showing the stats for WANObjectCache and these keys.

Those metrics show no real increase in requests for the keys for the spike in traffic to mc1033, perhaps it is something else?

I won't comment any more here until we have a new fresh clean ticket tracking the apparently new issue.

elukey mentioned this in T223310: Investigate increase in tx bandwidth usage for mc1033.May 14 2019, 5:33 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 28 2020, 10:16 PM

Krinkle mentioned this in T244340: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached).Jun 4 2020, 11:37 PM

Krinkle mentioned this in T254536: CacheAwarePropertyInfoStore performs 4000 Memc ops/s (APC not working?).

Fix inefficient CacheAwarePropertyInfoStore memcached access patternClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Fix inefficient CacheAwarePropertyInfoStore memcached access pattern
Closed, ResolvedPublic
Actions

Related Objects
Search...