Page MenuHomePhabricator

Unable to store parser output in StashEdit (Memcached error: ITEM TOO BIG)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: W6EtzQrAEG4AADcpupoAAADE

message
Memcached error for key "{memcached-key}" on server "{memcached-server}": ITEM TOO BIG

Failed to cache parser output for key '{cachekey}' ('{title}').
Sample
channel: memcached, level: ERROR

> Memcached error for key "nlwiki:prepared-edit:9b3e941f53710ae6070ea3e0047d1741:31bf20ff0cd7eafb4b2bca0f76e9b7ef3a6decfe:795f570f7c680526ee619d5ae5e99feb" on server "127.0.0.1:11213": ITEM TOO BIG


channel: StashEdit, level: ERROR

> Failed to cache parser output for key 'nlwiki:prepared-edit:9b3e941f53710ae6070ea3e0047d1741:31bf20ff0cd7eafb4b2bca0f76e9b7ef3a6decfe:795f570f7c680526ee619d5ae5e99feb' ('Lijst van Radio 2-Top 2000's').

Notes

This is a regression and it started around 12 September. In the last 7 days, the above two errors each happened 3,500 times since 12 September. Whereas similar issues only happened <130 times in the 23 days before that.

I suspect something was changed or deployed around that time that made the ParserOutput object significantly bigger, causing it to exceed the threshold of what we can store in Memcached.

  • Impact: Slower edits - stash is unable to function, requiring edit to be parsed a second time between auto-stash and save.
  • Impact: Slower views - parser output may need to come from sql-parsercache fallback instead of memcached.

Event Timeline

Krinkle triaged this task as Unbreak Now! priority.Sep 18 2018, 5:22 PM

I don't see anything recent in https://phabricator.wikimedia.org/source/mediawiki/history/master/includes/parser/ParserOutput.php that strikes me as relevant. I added the $mWrapperDivClasses field, which should be tiny. The same patch changed to parser to emit slightly less HTML. If anything, that should have made ParserOutput objects smaller.

Did we also fail to cache during saving, or only when stashing?

Adding $mWrapperDivClasses did break php serialization when reading from the parser cache in some cases. Maybe this is somehow related?

To find out what is getting so big, I suggest to add something like the following to includes/api/ApiStashEdit.php on a debug host:

file_put_contents( "/tmp/T204742.stashInfo.$key.txt", print_r( $stashInfo, true ) );

This should go right before the line that loggs the "Failed to cache parser output" message, in wmf.20 that's line 232. I couldn't figure out how to patch that in myself, sorry.

Anyway, one this is in place, it should be possible to reproduce the error on a debug box by visiting https://en.wikibooks.org/w/index.php?title=Mirad_Version_2&oldid=3466476&action=edit and triggering stashing. When the API responds with something like {"stashedit":{"status":"error_cache","texthash":"0256b8fb47c43bbb3df3a9efe6178e5efb16c693"}}, there should now be a file in the /tmp/ folder that contains the data that could not be stashed for some reason.

Should the output of print_r not be sufficient to spot the issue, I suggest to try var_export and serialize.

I note that in ParserCache, we have the line $this->mMemc->set( $parserOutputKey, $parserOutput, $expire ); with no check of the return value. Perhaps we should also log when that returns false, to see if we also fail to write to the ParserCache, not just the edit stash.

Ok, I recovered a dump of the data that is failing to be stashed using the method above. It's 4379975 bytes or print_r output, and 4298008 bytes serialized. What's the memcached limit? 4MB doesn't seem too terrible...

Turns out, these files are too large for phabricator.... I could upload them somewhere later if need be, but at a glance, there's nothing wrong with them.

I see a bunch of ITEM TOO BIG errors from the blob store cache, too. If big pages don't fir into the edit stash, nor the parser cache, nor the blob store cache, they'll be really slow...

Krinkle lowered the priority of this task from Unbreak Now! to High.Sep 20 2018, 4:52 PM
Krinkle moved this task from Apr 2019 / 1.33.wmf.25+ to Older on the Wikimedia-production-error board.

For the record, a log line showing the equivalent issue for the parser cache:
Memcached error for key "plwiki:pcache:idhash:4336033-0!canonical" on server "/var/run/nutcracker/nutcracker.sock:0": ITEM TOO BIG

Lowering prio to "high", since this seems to be simply a matter of large pages hitting internal limits, and not be caused by some bug.

Recommendation: increase memcached limits.

Found this task when reviewing memcached errors on logstash for an unrelated issue.. The memcached limit is 1MB, that is the maximum slab class size that can be allocated in the version that we are running (the Debian Jessie one). There is a task to upgrade memcached to a new version (T213089) but it will require time and efforts. The recent versions of memcache do not impose anymore a fixed/hardcoded limit but it is a feature that would need to used very wisely from what I've read in various user reports.

Note: the last comment mentions nutcracker, but we should be only using mcrouter nowadays. I can still see errors related to nutcracker from mediawiki in logstash though, T214275 has been opened to investigate.

mcrouter offers the possibility to break/split keys after a certain threshold, in multiple pieces/keys that can be aggregated when a special GET is issued. Not sure how reliable the functionality is, just adding notes/ideas.

Is there any chance to figure out other resolution paths rather than increasing memcache limits?

Reporting some thoughts after a chat with @Krinkle during all hands. It would be good to instruct mediawiki to avoid logging an exception in case the stash edit is bigger than the memcached limit, to avoid adding noise to the exception logs (if possible). This would help in not getting derailed during memcached error reviews/investigations. This would be good even if we'd moved the slab page limit to 4MB (currently fixed to 1MB, but newer versions of memcache allows it).

I'd like the limit to be bumped to around $wgMaxArticleSize after upgrade (2MB).

Following the procedure at https://github.com/facebook/mcrouter/issues/26 , I'd be OK with experimenting with --big-value-split-threshold . It would only be enable for things that would fail otherwise anyway. It still puts everything in one server. One thing I don't see is a way to enforce, for sanity, the size limit once you do that.

Change 493775 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make ApiStashEdit use a separate key for the parser output due to size

https://gerrit.wikimedia.org/r/493775

Change 493775 merged by jenkins-bot:
[mediawiki/core@master] Make ApiStashEdit use a separate key for the parser output due to size

https://gerrit.wikimedia.org/r/493775

I'd like the limit to be bumped to around $wgMaxArticleSize after upgrade (2MB).

Following the procedure at https://github.com/facebook/mcrouter/issues/26 , I'd be OK with experimenting with --big-value-split-threshold . It would only be enable for things that would fail otherwise anyway. It still puts everything in one server. One thing I don't see is a way to enforce, for sanity, the size limit once you do that.

This is something that we need to test and discuss, I am not super confident about having mcrouter doing this kind of mangling of the keys for us. For example I am afraid of getting into a state in which checking keys on the mc10XX shards become impossible due to this fragmentation, but I might be too pessimistic :)
EDIT: I got only now (re-reading the comment) that you'd like to experiment it only for keys with payload > 1M, could be a good test for deployment-prep. Lemme know if you want some help testing!

Change 493775 merged by jenkins-bot:
[mediawiki/core@master] Make ApiStashEdit use a separate key for the parser output due to size

https://gerrit.wikimedia.org/r/493775

If I got it correctly, this change should break prepared-edit's key into smaller pieces right @aaron?

Two pieces, one of them still huge. I'm working on a generic segmentation wrapper for BagOStuff atm, but that will take longer to do (lots of ways to arrange the classes, hard to choice from).

Two pieces, one of them still huge. I'm working on a generic segmentation wrapper for BagOStuff atm, but that will take longer to do (lots of ways to arrange the classes, hard to choice from).

That would presumably involve defining a set blob size on the MediaWiki side, right? (Whereas this is configurable on the MC side).

Ultimately, this issue would need one of the following solutions:

  1. Tell MW the blob size, and split it up in chunks smaller than that.
  2. Tell MW the blob size, and don't use the MC tier for those values in the multiwrite/multi-tier bacgostuff, use only the SQL layer that we have for this already.
  3. Don't tell MW the blob size, and ensure that the "TOO BIG" error is not logged to Logstash (or with debug/info severity). Gracefully and automatically degrade to the same behaviour as solution 2 (SQL only). Downside is a wasted roundtrip on SET. Upside is dynamically growable, configurable, and more fault-tolerant to future changes.

I'm torn between two and three, but I'm a bit pessimistic about one - is the added complexity and maintenance cost of splitting up in a generic way with MC worth the benefit compared to an ES roundtrip? How common are these in production? It would appear they the issue does not happen very frequently on the whole. On the other hand, thinking about it in the abstract, I would assume most popular articles have their ParserOutput object serialisation larger than 1 MB, but.. maybe not.

Which SQL layer? We have SQLBagOStuff using pcxxxx servers, which I suppose could be co-opted for this. As for ES, I wouldn't want to spam a bunch of one-off blobs in there since it is meant to be append-only. Depending on how much code it is I don't mind a bit of complexity. I'm still experimenting around with different ways to do it, but I don't think it has to be that complex.

It's also useful as a tunable option for spreading I/O accross cache nodes rather than only being used to get around hard limits.

Sorry about the confusion. I mixed up parser cache (which caches ParserOutput objects with Memcached+SQL via multi-backend BagOStuff), and edit stash (which also caches ParserOutput objects, but only temporary ones, not yet associated with a rev ID).

The edit stash is naturally not persisted and also not stored anywhere else. My idea was we could forego the "extra" tier of Memcached (over SQL) for blobs that are too large. But there isn't any SQL tier for edit stash, and I entirely agree that it doesn't make sense to add that, either (I wasn't proposing that).

Carry on :)

Change 495321 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] [WIP] objectcache: add object segmentation support to BagOStuff

https://gerrit.wikimedia.org/r/495321

Change 495321 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: add object segmentation support to BagOStuff

https://gerrit.wikimedia.org/r/495321

Change 495321 merged by jenkins-bot:
[mediawiki/core@master] objectcache: add object segmentation support to BagOStuff

https://gerrit.wikimedia.org/r/495321

Change 517679 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] parsercache: use WRITE_ALLOW_SEGMENTS for cached ParserOutput values

https://gerrit.wikimedia.org/r/517679

Change 517679 merged by Krinkle:
[mediawiki/core@master] parsercache: use WRITE_ALLOW_SEGMENTS for cached ParserOutput values

https://gerrit.wikimedia.org/r/517679

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

Change 858437 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] PageEditStash: Serialize ad-hoc to restore WRITE_ALLOW_SEGMENTS

https://gerrit.wikimedia.org/r/858437

Change 858437 merged by jenkins-bot:

[mediawiki/core@master] PageEditStash: Serialize ad-hoc to restore WRITE_ALLOW_SEGMENTS

https://gerrit.wikimedia.org/r/858437