Page MenuHomePhabricator

WANCache:commonswiki:pagetranslation:sourcepages key is generating a lot of traffic
Closed, ResolvedPublic4 Estimated Story PointsSecurity

Assigned To
Authored By
jijiki
Jun 3 2024, 9:05 AM
Referenced Files
F57120958: image.png
Aug 6 2024, 2:19 PM
F56821391: image.png
Jul 31 2024, 2:35 PM
F56821334: image.png
Jul 31 2024, 2:35 PM
F56672329: image.png
Jul 25 2024, 10:56 AM
F56646390: image.png
Jul 24 2024, 2:59 PM
F56201895: T366455-fix.patch
Jul 3 2024, 2:21 PM
F54897725: image.png
Jun 3 2024, 9:05 AM
F54897072: image.png
Jun 3 2024, 9:05 AM

Description

I have noticed that WANCache:commonswiki:pagetranslation:sourcepages key has been our top key in traffic for a few weeks now. I am wondering if this temporary, and if it is not, if there is anyway we can shard this key or find another way to mitigate this problem.

image.png (932×257 px, 95 KB)

image.png (952×277 px, 95 KB)

https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&viewPanel=58

https://grafana.wikimedia.org/d/000000316/memcache

I added the security tag as this is an issue that has the potential of causing latency problems if exploited.

Details

Author Affiliation
WMF Technology
Related Changes in Gerrit:

Event Timeline

jijiki triaged this task as High priority.

I've helped Effie identify the extension and maintainer. Untagging our team as this isn't in scope of any components we own/maintain.

@Language-Team: As I understand it, the issue here is not with call frequency but with response size. We have far "hotter" keys, but the multiplication of value-size * call-rate is sufficiently high that it saturates the individual memc host that this key maps to. The recommendation is thus to evaluate if the data could be split over multiple separate keys. Even if they are fetched together unconditionally, this would help because they'd be routed to separate hosts. If the code path is very latency sensitive, you can use getMulti to fetch these together, but 2-3 calls serially should generally be fine.

You can get additional visiblity from MW-BagOStuff and WANObjectCache at https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group.

Note: I might be the only one in Language team with security task access, and given maintainers were not subscribed for this task, this could have gone unnoticed for a long time.

Code is in \MediaWiki\Extension\Translate\PageTranslation\TranslatablePage::isSourcePage. Value stored like looks something like:

$a = [ 234324 => true, 234234 => true ];
echo serialize( $a );
> a:2:{i:234324;b:1;i:234234;b:1;}

Might be even worse if the keys are strings.

Some optimizations in order of expected benefits (smallest to largest):

  • Store as a list of integers
  • Store as a string of [choose your favorite separator]-separated numbers
  • Store as vector of bigints (binary, not sure if that is allowed)
  • Hard-coded sharding (e.g always split into 4 values)

Code is in \MediaWiki\Extension\Translate\PageTranslation\TranslatablePage::isSourcePage. Value stored like looks something like:

$a = [ 234324 => true, 234234 => true ];
echo serialize( $a );
> a:2:{i:234324;b:1;i:234234;b:1;}

Might be even worse if the keys are strings.

Some optimizations in order of expected benefits (smallest to largest):

  • Store as a list of integers
  • Store as a string of [choose your favorite separator]-separated numbers
  • Store as vector of bigints (binary, not sure if that is allowed)
  • Hard-coded sharding (e.g always split into 4 values)

From my end, even splitting this traffic between 2-4 memcached servers (aka sharding it more), would be just great.
I think we should sort this soon, given that in the event of abnormal load (eg heavy traffic, other keys hosted on that memcached server become very popular), it has the potential of causing a lot of additional latency, as it may saturate the shard's network.

Decided to store as comma separated string so that it could be segmented.

WANObjectCache: Store source page ids as string and set as segmentable
                                                                      
Storing the message bundle and translatable page ids as comma         
separated strings in the cache. Also adding a trailing comma in       
order to make contains search simpler.                                
                                                                      
Update cache version and also mark the cache as segmentable.          
                                                                      
Bug: T366455                                                          
Change-Id: I1d5af304bdc1d3313c555e14a187aa722cd81723

Please review.

Decided to store as comma separated string so that it could be segmented.

WANObjectCache: Store source page ids as string and set as segmentable
                                                                      
Storing the message bundle and translatable page ids as comma         
separated strings in the cache. Also adding a trailing comma in       
order to make contains search simpler.                                
                                                                      
Update cache version and also mark the cache as segmentable.          
                                                                      
Bug: T366455                                                          
Change-Id: I1d5af304bdc1d3313c555e14a187aa722cd81723

Please review.

It would be great if you could please add a URL to point to the change, thank you!

Decided to store as comma separated string so that it could be segmented.

WANObjectCache: Store source page ids as string and set as segmentable
                                                                      
Storing the message bundle and translatable page ids as comma         
separated strings in the cache. Also adding a trailing comma in       
order to make contains search simpler.                                
                                                                      
Update cache version and also mark the cache as segmentable.          
                                                                      
Bug: T366455                                                          
Change-Id: I1d5af304bdc1d3313c555e14a187aa722cd81723

Please review.

It would be great if you could please add a URL to point to the change, thank you!

Since this is a sensitive issue with the security tag set, I figured that we'd apply the patch on production servers before applying it via Gerrit. But let me know if I can submit it directly via Gerrit.

Since this is a sensitive issue with the security tag set, I figured that we'd apply the patch on production servers before applying it via Gerrit. But let me know if I can submit it directly via Gerrit.

Thank you very much for this, I think it is alright to have it in gerrit. Regarding the change itself:

While this would work if the key was large (over 500k), this is not the case here I am afraid. Our problems start with how often this key is fetched by mediawiki, which in combination of both value size+traffic.

That being said, echoing what @Nikerabbit said earlier, we should either hard code shard this into 4 different keys, eg WANCache:commonswiki:pagetranslation:sourcepages-1, WANCache:commonswiki:pagetranslation:sourcepages-2 etc, or investigate what other options we have (eg apcu?)

I am happy to discuss this on IRC or in a meeting too if that will help!

I submitted: 1053611: WANObjectCache: Store source page ids as string and set as segmentable | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1053611

My thoughts:

  • Using string reduces the size of the cache on commons from 35,653 to 21,548 serialized characters
  • I also introduced a static variable to act like a cache so atleast during a single process, we'll not be hitting the cache repeatedly.

This will reduce the size of the value in the cache, and will also reduce the number of times we hit the cache. It may not get rid of the problem completely but should help.

I'd like to think about the potential implications of using separate keys. We'd have to store N keys in a single cache, and then use a new cache key. We'll have to resort to using WANObjectCache::set instead of WANObjectCache::getWithSetCallback which I'm not keen on.

abi_ set the point value for this task to 4.

I submitted: 1053611: WANObjectCache: Store source page ids as string and set as segmentable | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1053611

My thoughts:

  • Using string reduces the size of the cache on commons from 35,653 to 21,548 serialized characters
  • I also introduced a static variable to act like a cache so atleast during a single process, we'll not be hitting the cache repeatedly.

This will reduce the size of the value in the cache, and will also reduce the number of times we hit the cache. It may not get rid of the problem completely but should help.

I'd like to think about the potential implications of using separate keys. We'd have to store N keys in a single cache, and then use a new cache key. We'll have to resort to using WANObjectCache::set instead of WANObjectCache::getWithSetCallback which I'm not keen on.

Thank you! If this change makes the next train, I will post here how things have worked out

I abandoned the previous patch on Timo's suggestions and chose a different approach using WANObjectCache::getMultiWithSetCallback. The patch is here: 1054515: TranslatablePage: Split translatable page id cache into multiple shards | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1054515

The cache will now be divided into 3 different caches. To refresh the cache, we're now purging the values instead of marking them as stale.

To sum up, the translate extension uses the built-in utilities to communicate with memcached, such as BagOStoff and WANCache.

Many thanks to @Krinkle for showing us around a bit to how those things work, @abi_'s patch looks ready for merging

Unfortunately the backports (wmf.14, wmf.15) had to be reverted (wmf.14, wmf.15, master) because memcached started behaving much worse than before – e.g. TX bandwidth usage shot up:

image.png (1,888×786 px, 221 KB)

(@jijiki can probably explain the problems a lot better.)

From looking at the patch, my best guess would be that this is not directly because of the sharding (the patch only introduced three shards – two of which actually seemed to end up on the same host, presumably by coincidence – but some metrics went up by much more than a factor of three), but rather because of the changed cache (in)validation mechanism, removing checkKeys and instead using $cache->delete() in clearSourcePageCache(). But I don’t really understand WANObjectCache well enough to go further from there. (I’ve uploaded an alternative checkKeys version, but I have no idea if it even makes sense.)

Unfortunately the backports (wmf.14, wmf.15) had to be reverted (wmf.14, wmf.15, master) because memcached started behaving much worse than before – e.g. TX bandwidth usage shot up:

image.png (1,888×786 px, 221 KB)

(@jijiki can probably explain the problems a lot better.)

From looking at the patch, my best guess would be that this is not directly because of the sharding (the patch only introduced three shards – two of which actually seemed to end up on the same host, presumably by coincidence – but some metrics went up by much more than a factor of three), but rather because of the changed cache (in)validation mechanism, removing checkKeys and instead using $cache->delete() in clearSourcePageCache(). But I don’t really understand WANObjectCache well enough to go further from there. (I’ve uploaded an alternative checkKeys version, but I have no idea if it even makes sense.)

Thinking about this some more, this could be the reason for the unexpected behavior. The cache is cleared on various events:

  1. Translatable page is moved
  2. Translatable page is unmarked
  3. Page is marked for translation
  4. Page has <translate> tag added

So purging the cache instead of marking it as stale could lead to a stampede-like situation.

Unfortunately the backports (wmf.14, wmf.15) had to be reverted (wmf.14, wmf.15, master) because memcached started behaving much worse than before – e.g. TX bandwidth usage shot up:

image.png (1,888×786 px, 221 KB)

(@jijiki can probably explain the problems a lot better.)

Looking at the graphs, there is not really much more to add rather than, we managed to triple the traffic. Just to add the following graphs

https://grafana.wikimedia.org/goto/LiDh0yXSR?orgId=1

image.png (2,970×1,324 px, 278 KB)

We seem to making also 3 times the requests, so, there is something in the code making more calls than it should?

@abi_ in the previous version, you had mentioned that key contents are stored in a var, so to avoid many requests to memcached.

Lastly, as we have mentioned this in another discussion, it is worth exploring the possibility of using APCu here, which would speed things up and reduce the number or requests to the memcached clusters

The next attempt also had to be reverted. (Just mentioning it since this task has no gerritbot / stashbot comments because security.)

Forgive me if this is a stupid question, but why are we putting all the translatable page IDs in a single cached value at all? If TranslatablePage::isSourcePage() is called with a single $page, could we put that page’s ID in the cache key and (in the “set” callback) only query for this page ID as well? (The cache was apparently added in Reduce query bloat, back in 2010, and as far as I can tell the previous state was to have no caching at all. That doesn’t mean that per-page caching was never tried, but if it was, I couldn’t find a record of it.)

[mediawiki/extensions/Translate] WANObjectCache: Store source page ids as string and set as segmentable

We tried out the getMulti approach but it caused large spikes in traffic. I think we can use this as a temp solution until we figure out the issues/mistakes with that approach. See T366455#10014164

I'm not sure how segmentable would help. For two reasons:

  1. Segmentation is a feature for splitting up values that are larger than can fit into a single memcached blob. This is something you enable if today your cache keys are experiencing a high "miss" rate, paired with "TOOBIG" warning in the logs. In other words, the value is ignored by Memcached and never stored. The threshold that segmentable => true checks for is, afaik, well above the size of your actual values. Thus this would effectively a be no-op. I have not verified this for Translate specifically. Feel free to verify this locally if you like. If you see in the debug log locally that your patch results in several separate keys being fetched from this function, let me know! Perhaps I've missed or forgotten something.
  1. Segmentation works by splitting a value that is too large, into N chunks that are each of a permissable size. Again, in your case, I believe this would do nothing, since the value smaller fits in a single chunk. But let's say I'm wrong and segmentation would be triggered. Here is how it works: We split the value, store each chunk in separate keys like <key>-chunk1, <key>-chunk2, until we've stored the entire value, and then under the real key, we store a special placeholder value that instructs BagOStuff to read and combine these other named keys. This means that the request rate for your actual key will remain identical.

This is why I suggested getMulti. Not because of a preference in coding style, but because it means you have separate top-level keys (and you know staticially in code how many there are and what each chunk's key is), which means our deterministic hash will shard them to separate memcached hosts, and thus the multiple of req rate * value size will be 1/N smaller, and spread over multiple hosts.

The fact that the patch caused the request rate on a multiple servers to go very far up, suggests something in our thinking was wrong. I don't know yet what, but that might be worth looking into it.

Alternatively, given the rate is so high, perhaps as Effie suggested, it makes sense to consider a more server-local cache. The reason this task exists and why there are no other tasks like it, is that when data is in such high demand, there is usually a correlation with a weakening in guruantees. Right now the Translate extension is guruanteeing that it can purge/invalidate the value in a way that is immediately reflected in the local DC. Is that really required, or could it be tolerated that the value is stale by e.g. 10 seconds? I suspect that would be fine, in which case you can store it in APCU via MediaWiki->getLocalServerCache first, and only use WANObjectCache as the second layer. You would purge or touch only the WAN cache.

srvCache =services->getLocalServerCache

srvCache->getWithSet(key, 10, () =>
  wanCache->getWithSet(key, , () =>
    $this->loadThing();
  )
)

The reason I suspect this is fine is because in a environment as big as WMF, these guruantees likely already don't hold up because:

  • On cache miss, you load from a DB replica. We tolerate DB replication lag for allowed upto 6 seconds.
  • WANCache broadcasts is expected to purge secondary DCs within ~10 seconds.
  • For code that needs the latest data, e.g. during a POST request that makes an edit, should bypass the cache, and/or verify in some way that it is up-to-date.

I'm not sure how segmentable would help.

I will get rid of that. Effie mentioned that we need to deal with soon. The idea with that patch is to reduce the size of the cache, to have a short respite from the high bandwidth usage while we work on a more long term solution using either WANObjectCache::getMulti or MediaWikiServices::getLocalServerObjectCache

I'd like to try the other approaches in subsequent patches.

If WANObjectCache: Store source page ids as string can be used as a really temporary bandaid, we could consider having a go indeed.

However, echoing what @Krinkle said, if APCu can be used here, it would permanently solve this issue. @abi_ how soon can you start working on a long term solution?

I see three long term approaches that have been discussed here:

  1. Store a cache entry per translatable page: This would result in a lot of cache entries. Problem is if we are missing an entry, we don't know if its because its been deleted or its not a translatable page or is the cache outdated.
  2. Using getMulti to spread the cache into smaller entries: While the approach makes sense, we're seeing large spikes in traffic when we deployed this solution on production
  3. Use a local cache with 10-20 second expiry, and when the value is stale hit the WAN cache to fetch the value.

For now I'll try the 3rd approach as a long term solution.

If WANObjectCache: Store source page ids as string can be used as a really temporary bandaid, we could consider having a go indeed.

However, echoing what @Krinkle said, if APCu can be used here, it would permanently solve this issue. @abi_ how soon can you start working on a long term solution?

I'll get started on it today or tomorrow. Should have a patch for review soon after that.

If WANObjectCache: Store source page ids as string can be used as a really temporary bandaid, we could consider having a go indeed.

However, echoing what @Krinkle said, if APCu can be used here, it would permanently solve this issue. @abi_ how soon can you start working on a long term solution?

I'll get started on it today or tomorrow. Should have a patch for review soon after that.

Two patches for review:

  1. 1053611: TranslatablePage: Store source page ids as string in WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1053611 - Low risk, expected to offer some respite.
  2. 1058204: TranslatablePage: Use local cache to reduce calls to the WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1058204 - Builds on top of previous patch. Uses the approach of storing the value in the Local cache for a short duration. Bit riskier but should help reduce traffic further.

TranslatablePage::isSourcePage is a heavily used method, and caching issues here might cause subtle problems that will be difficult to identify immediately.

We deployed the following patch: 1053611: TranslatablePage: Store source page ids as string in WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1053611

Doesn't appear to have much of an impact on TX bandwidth usage.

image.png (1,917×799 px, 521 KB)

We deployed at the highlighted point in the image above.

The total hit good latency for the key has dropped:

image.png (1,338×304 px, 20 KB)

Lets keep monitoring to see if it has helped in anyway but atleast it doesn't seem to have caused any spikes

In the meantime I have submitted a patch to use the local cache that could use a review: 1058204: TranslatablePage: Use local cache to reduce calls to the WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1058204

We deployed the following patch: 1053611: TranslatablePage: Store source page ids as string in WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1053611

Lets keep monitoring to see if it has helped in anyway but atleast it doesn't seem to have caused any spikes

In the meantime I have submitted a patch to use the local cache that could use a review: 1058204: TranslatablePage: Use local cache to reduce calls to the WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1058204

Indeed it looks slightly better than before, thanks for this! I trust that real impact will be visible after 1058204: TranslatablePage: Use local cache to reduce calls to the WAN cache. Let me know how I can help!

Post deployment of: 1058204: TranslatablePage: Use local cache to reduce calls to the WAN cache the load on memcache seems to have eased up quite a bit:

image.png (2,513×1,104 px, 530 KB)

See: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&viewPanel=58&from=now-1h&to=now

I'm planning to tweak the $ttl for the local cache to 5-6 seconds in order to reduce the amount of time we see the stale data.

@abi_ this is grand!

Looking at the following graphs:

@Krinkle you think we may close it? Anything standing out that we are not seeing?

jijiki lowered the priority of this task from High to Medium.Aug 7 2024, 9:09 AM
jijiki removed a project: Security.
Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Aug 8 2024, 3:19 PM
Reedy changed the edit policy from "Custom Policy" to "All Users".

Thanks for your help Timo and Effie. Marking this as resolved.