Page MenuHomePhabricator

Fix inefficient CacheAwarePropertyInfoStore memcached access pattern
Closed, ResolvedPublic

Description

While this is not a problem now, I see PropertyInfoStore becoming a performance problem rather sooner than later, especially if we implement identifier linking on top of it and P553 finally gets split up (which it really really should).

We should have an implementation that doesn't require loading all infos from memcached (CachingPropertyInfoStore) but also doesn't end up making to many round trips to some form of storage.

This will probably be a problem very similar to the one we had with SiteList, but having a static file cache isn't a possible solution here.

https://grafana.wikimedia.org/d/000000574/t204083-investigation shows the excessive traffic moving around the various memcached hosts for the last 1 year.

Final acceptance criteria

  • TBA other points
  • Write an ADR for what was done in Wikibase - 4 hours spike

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Jonas It looks like this didn't get into the 23 branch, so it will not be deployed until next week.

โ€ข Jonas set Due Date to Oct 4 2018, 7:00 AM.Sep 26 2018, 10:12 AM
Restricted Application changed the subtype of this task from "Task" to "Deadline". ยท View Herald TranscriptSep 26 2018, 10:12 AM

This patch should go live on wikidata.org with the train on Wednesday 3rd Oct

Addshore added a comment.EditedOct 3 2018, 9:24 PM

The 10 second on server cache of the data resulted in a pretty significant drop in traffic with no significant increase on another server:

Krinkle removed Due Date.Oct 5 2018, 5:42 PM
Restricted Application changed the subtype of this task from "Deadline" to "Task". ยท View Herald TranscriptOct 5 2018, 5:42 PM
Addshore lowered the priority of this task from High to Normal.Oct 8 2018, 12:28 PM

Setting this to normal now as the on server caching has reduced the traffic to the individual server, although it is still higher than other servers, although only 2-3x greater rather than 4+x greater.

The on server cache should provide steadier rates of access and spikes in mw requests should not result in spike in these requests to memcached.
We could increase the local cache from 10 seconds to 5 seconds and should see a decrease in cache traffic again.

We could also think about sharding the data in some way.
We could

  • Have n keys containing the same data, and randomly choose one to load the data from when needed. This would share the load across the cache cluster more evenly.
  • Shard based on data type. The lookups don't know what data type they want to load when requesting the data, so they would just have to load a shard, see if the property lies within the shard before trying the next one. This again would reduce the traffic on a single server but would result in more requests overall to get the same data.
  • Shard based on property ID. Simply speaking we could shove the first 1-500 properties in 1 key, the next 500-1000 properties in another key. This would distribute the data among multiple cache servers reducing traffic. Also it would allow us to know which shard we want to look at instead of having to guess. Instead of hard coding this / using the numeric representations of property IDs we could also shard based on some hash function on the serialized property ID.
  • have a key per property, I don't think this is a great idea, this would result it a much greater number of hits, single requests might need info on any number of properties. Currently the code doesn;t have a way of knowing how many properties we are going to request before we start requesting them. Of course we could re work this..

Change 470567 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] CachingPropertyInfoLookup local server cache 10s -> 15s

https://gerrit.wikimedia.org/r/470567

Change 470567 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] CachingPropertyInfoLookup local server cache 10s -> 15s

https://gerrit.wikimedia.org/r/470567

So the on server caching has reduced the traffic fairly dramatically, although the key is still used lots / the data transfer is expensive.

It looks like the traffic for the cache key has halved, which should be fine for now.
I think this can leave the Wikidata-Campsite (Wikidata-Campsite-Iteration-โˆž) but will remain open on the Wikidata board as this issue will just continue to grow as the number of properties increases and also as the number of app servers increases.

Addshore renamed this task from Investigate more efficient memcached solution for CacheAwarePropertyInfoStore to Investigate more efficient memcached solution for CacheAwarePropertyInfoStore.Dec 18 2018, 2:33 PM
Addshore updated the task description. (Show Details)

Picked back up by the campsite to write an ADR for what was done and why & how we might need to change this in the future.

Change 480731 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Add docs/adr/0002-CacheAwarePropertyInfoStore-caching.md

https://gerrit.wikimedia.org/r/480731

Addshore closed this task as Resolved.Jan 7 2019, 9:39 PM

Resolved for now, although at some point in the future we will have to revisit this

Change 480731 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add docs/adr/0002-CacheAwarePropertyInfoStore-caching.md

https://gerrit.wikimedia.org/r/480731

Addshore updated the task description. (Show Details)Jan 17 2019, 4:17 PM
elukey added a subscriber: elukey.Mar 10 2019, 6:57 PM

Quick comment: I noticed while checking some memcached metrics that wikibase_shared/1_33_0-wmf_20-wikidatawiki-hhvm:CacheAwarePropertyInfoStore generates 30~60MB/s of GET traffic since the last deployment of mediawiki:

https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=140&panelId=60&fullscreen

Confirmed also via memkeys. Given what happened in T204083, is this meant to generate this kind of traffic volume? If so, will it increase over time, stay steady, etc..? I don't see impact on the network interface or correlation with timeouts, but better safe and ask :)

elukey reopened this task as Open.Mar 10 2019, 6:57 PM

I found another occurrence of timeouts in mcrouter with SETs for this key, it might be due to TTL expiring or similar. Reducing the traffic volume to memcached is surely a good thing.

Addshore renamed this task from Investigate more efficient memcached solution for CacheAwarePropertyInfoStore to Fix inefficient CacheAwarePropertyInfoStore memcached access pattern.Mar 12 2019, 12:49 PM
Addshore raised the priority of this task from Normal to High.
Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.
elukey added a subscriber: aaron.EditedMar 12 2019, 3:42 PM

Another big SET bursts to mc1022 caused timeouts for the MW appservers.. Don't meant to pressure you guys but is there any plan to fix this?

The SET bursts should be due to a key either being evicted or expired:

elukey@mc1022:~$ cat slab_140 | grep -i wikidata
ITEM wikibase_shared/1_33_0-wmf_20-wikidatawiki-hhvm:CacheAwarePropertyInfoStore [100621 b; 1552491470 s]
ITEM wikibase_shared/1_33_0-wmf_20-wikidatawiki-hhvm:WikiPageEntityRevisionLookup:Q56701805 [99611 b; 1552430549 s]

elukey@mc1022:~$ date -d @1552491470
Wed Mar 13 15:37:50 UTC 2019          <================= TTL 24h should match with this being just re-created

elukey@mc1022:~$ date -d @1552430549
Tue Mar 12 22:42:29 UTC 2019

elukey@mc1022:~$ date
Tue Mar 12 15:39:31 UTC 2019

Now I am wondering how the key gets re-set, it might be something to tune in the usage of the MWWanCache. Adding @aaron as FYI :)

Mcrouter timeouts/TKO registered: https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&from=1552402906212&to=1552405067164

Slab 140 SET traffic: https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=49&fullscreen&orgId=1&from=1552403014459&to=1552404751954&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=All

It differs from T203786 since the appservers (not apis) seems to be hit. It is a bad luck that this happens on the same host though, mc1022 (but of course different slabs).

@Addshore can you write a brief summary about what it changed recently? I tried to read the task but I got lost in PHP :(

Addshore removed Addshore as the assignee of this task.Mar 12 2019, 3:49 PM

Change 495928 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495928

Change 495929 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495929

Change 495930 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495930

Change 495930 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495930

Change 495929 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495929

Mentioned in SAL (#wikimedia-operations) [2019-03-12T16:42:32Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.20/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Double on server cache for PropertyInfoStore (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-03-12T16:43:48Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.21/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Double on server cache for PropertyInfoStore (duration: 00m 55s)

Change 495941 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495941

Change 495942 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495942

Change 495943 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495943

Change 495942 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.21] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495942

Change 495943 merged by Addshore:
[mediawiki/extensions/Wikibase@wmf/1.33.0-wmf.20] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495943

Mentioned in SAL (#wikimedia-operations) [2019-03-12T17:10:13Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.20/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Increase APC cache for PropertyInfoLookup from 15 to 20s (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-03-12T17:11:20Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.21/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: T97368 Increase APC cache for PropertyInfoLookup from 15 to 20s (duration: 00m 55s)

Another big SET bursts to mc1022 caused timeouts for the MW appservers.. Don't meant to pressure you guys but is there any plan to fix this?

T218115 is the first step of the plan and is now being worked on.

Per our IRC conversation it sounds like we will want to switch this to use WANObjectCache? (I'll wait for your full comment here).

For now as the pipe on the machine with this key on nears saturation at spikes I increased the APC cache of for the key a little bit too (see the deployments above)

To summarize what I am currently seeing:

  • after the last mw deployment/train, the key got renamed and moved to mc1022 (was before on mc1021). This can be seen in https://grafana.wikimedia.org/d/000000316/memcache?panelId=45&fullscreen&orgId=1&from=1551860994585&to=1552057232883
  • when the key expires or gets evicted from the memcache LRU, then a bursts of SETs hits mc1022. This matches with mcrouter's timeouts, that eventually lead to the shard being marked as temporary down for all memcached traffic and hence spikes in mediawiki exceptions.
  • there is a constant high volume of GETs that raises the mc1022's baseline a lot and gets us closer to what the 1G NIC on the host can handle. We haven't observed any sign of packet dropped due to saturation but in theory a normal increase of api/appservers traffic could lead to that.

To solve the bursts of SETs we could use WANObjectCache (https://doc.wikimedia.org/mediawiki-core/master/php/classWANObjectCache.html) instead of using BagOfStuff directly, since it offers some nice functionalities to prevent, for example, that all the api/appservers try to SET a key if not there. We did the same with the Mediawiki Translate extension in T203786, solving a lot of problems.

Change 495928 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Double on server cache for PropertyInfoStore

https://gerrit.wikimedia.org/r/495928

Change 495941 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Increase APC cache for PropertyInfoLookup from 15 to 20s

https://gerrit.wikimedia.org/r/495941

We should see changes in 1.33.0-wmf.24.
It looks like the train didn't make it past group0 this week, so no change really spotted yet.

elukey added a comment.Apr 9 2019, 6:11 AM

The train was deployed yesterday, I can see an awesome improvement (green bar dropping):

Thanks a lot for all the work!

Addshore closed this task as Resolved.Apr 9 2019, 9:31 AM
Addshore claimed this task.

Amazing, I think we can resolve this then!

I just flicked back to https://grafana.wikimedia.org/d/000000574/t204083-investigation today and noticed that as of 1.34.wmf.3 a pattern similar to what we were seeing before is back?
https://tools.wmflabs.org/sal/log/AWp6A7RAEHTBTPG-zSOo

Not sure if this has been spotted anywhere else, ping @elukey, maybe you can shed some light on this?

From a quick look with memkeys on mc1033, these are the top talkers keys:

memcache key                                                                                                                        calls   objsize   req/sec  bw(kbps)
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P248                                                                    401463        92   8921.40   6566.15
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P2093                                                                   338923        85   7531.62   5121.50
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P4656                                                                    36029        82    800.64    525.22
WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P214                                                                     35361       150    785.80    942.96

Is it something that was changed recently?

daniel removed a subscriber: daniel.May 14 2019, 9:41 AM
Addshore added a comment.EditedMay 14 2019, 5:11 PM

That all looks fine.
I added a new panel to https://grafana.wikimedia.org/d/000000574/t204083-investigation?orgId=1 showing the stats for WANObjectCache and these keys.

Those metrics show no real increase in requests for the keys for the spike in traffic to mc1033, perhaps it is something else?

I won't comment any more here until we have a new fresh clean ticket tracking the apparently new issue.