Page MenuHomePhabricator

Ensure that stashing backend for the VE API has sufficient capacity
Closed, ResolvedPublic

Description

When switching the VisualEditor API to DirectParsoidClient, we will start using the stashing backend configured in the ParsoidCacheConfig setting under the StashType key, which defaults to the backend configured in MainStash setting.

This is currently configured to be:

$wgObjectCaches['db-mainstash'] = [
	'class' => 'SqlBagOStuff',
	'cluster' => 'extension2',
	'dbDomain' => 'mainstash',
	'globalKeyLbDomain' => 'mainstash',
	'tableName' => 'objectstash',
	'multiPrimaryMode' => true,
	'purgePeriod' => 100,
	'purgeLimit' => 1000,
	'reportDupes' => false
];

It is not clear that this backend has sufficient capacity for handling stashing for VE edits across all sites. We may want to configure a simillar backend on the cluster that holds the ParserCache.

Based on the metrics about the stash backend used by RESTBase, we estimate a need for about 140GB of storage capacity for the VE edit stash. This is based on the measurement of 100 writes per second and a 24 hours TTL, with an average of 20KB per entry.

NOTE: this cache is write-heavy, we see about 10x more writes than reads!

See also {T309016: Determine storage requirements for stashing parsoid output for VE edits}

Event Timeline

Krinkle updated the task description. (Show Details)

Hey @lbowmaker do you have any timelines for this task? This would help us prioritize our backlog, thanks!

Hey @lbowmaker do you have any timelines for this task? This would help us prioritize our backlog, thanks!

We'll need to start using the stashing function on some low traffic wikis in put in place the appropriate metrics. Then we can estimate the capacity required for all wikis. Since we have to take the rool-out slowly and we need a while to gather sufficient data, I'd expect this to take about four weeks.

x2 was 0.5GB before this, dumping 140GB on it could cause all sorts of issues such as replication issues and such. Here are some ideas:

  • Shorten the ttl: 24 hour is way too long for edit stash, noone is going to take 24 hours to make an edit
  • compress the entry before storage, that'd reduce the size to half
  • make sure it's going to actually be 140GB, I'm honestly not sure. Do we really have 100 reqs a second consistently all the time? 20KB per entry is honestly quite large too.

x2 was 0.5GB before this, dumping 140GB on it could cause all sorts of issues such as replication issues and such. Here are some ideas:

  • Shorten the ttl: 24 hour is way too long for edit stash, noone is going to take 24 hours to make an edit

If the stash expires, the edit fails. People leave tabs open while they for lunch or take a nap... This would be a product decision. I don't think we can change this short term.

  • compress the entry before storage, that'd reduce the size to half

Is 50% going to make much of a difference?

  • make sure it's going to actually be 140GB, I'm honestly not sure. Do we really have 100 reqs a second consistently all the time? 20KB per entry is honestly quite large too.

These are the stats I got from @Eevans for the corresponding Cassandra store. But there is only one way to find out for sure: try it.

If it turns out we can't use MainStash/X2 for this, how about we use a mysql stahs on a table that lives on the cluster that also hosts ParserCache? How hard would that be?

From IRC:

Amir1> I'm happy with letting enwiki go but we might need to revert it right after if metrics don't look good

I think we'll try enwiki on Monday, considering that we haven't seen so much as a blip from all the edits on all small and medium wikis. If things go bad, we can easily and safely switch enwiki back to using restbase.

x2 was 0.5GB before this, dumping 140GB on it could cause all sorts of issues such as replication issues and such. Here are some ideas:

  • Shorten the ttl: 24 hour is way too long for edit stash, noone is going to take 24 hours to make an edit

If the stash expires, the edit fails. People leave tabs open while they for lunch or take a nap... This would be a product decision. I don't think we can change this short term.

Who takes a nap at middle of edit? And 24 hours for a nap? :P That's more of an anesthesia

Jokes aside. Two things come to mind:

  • Majority of people don't take a nap during their edits, is it going to remove the entry once the edit is saved? If that's the case, I'm sure this won't be 140GB
  • I highly recommend adding some metric on how long it usually takes to save an edit and then we can look into what would be a reasonable number. Currently we are speculating about people's sleeping habits.
  • compress the entry before storage, that'd reduce the size to half

Is 50% going to make much of a difference?

Yes. A big one, on network, memory, latency of mariadb committing the change, replication, etc.

  • make sure it's going to actually be 140GB, I'm honestly not sure. Do we really have 100 reqs a second consistently all the time? 20KB per entry is honestly quite large too.

These are the stats I got from @Eevans for the corresponding Cassandra store. But there is only one way to find out for sure: try it.

Indeed

And to emphasize and make sure we all are the same page (I feel there is a misunderstanding here): I'm not blocking this and I still think x2 is a good place to store it. But I think it needs optimizations.

  • Majority of people don't take a nap during their edits, is it going to remove the entry once the edit is saved? If that's the case, I'm sure this won't be 140GB
  • I highly recommend adding some metric on how long it usually takes to save an edit and then we can look into what would be a reasonable number. Currently we are speculating about people's sleeping habits.

We know that about 90% of edits are just abandoned (by closing the tab, hitting the back button, etc). We need to stash context when the editor is opened, but we don't have a good way to get a signal when it's killed. We can do cleanup on a successful save, but it will only help in 10% of all cases.

Is 50% going to make much of a difference?

Yes. A big one, on network, memory, latency of mariadb committing the change, replication, etc.

Ok. MemcachedBagOStuff implements this kind of thing. We may be able to pull it into a trait or something and re-use it.

x2 was 0.5GB before this, dumping 140GB on it could cause all sorts of issues such as replication issues and such. Here are some ideas:

  • Shorten the ttl: 24 hour is way too long for edit stash, noone is going to take 24 hours to make an edit

If the stash expires, the edit fails. People leave tabs open while they for lunch or take a nap... This would be a product decision. I don't think we can change this short term.

Thank you for flagging this, @daniel and I appreciate you acting on the instinct to be curious about edit session lengths, @Ladsgroup.

Here is a ticket for us to investigate time-to-save by platform: T338634.

Note: as far as I can tell, the last time we calculated time-to-save was in 2018 by way of T202137.

x2 was 0.5GB before this, dumping 140GB on it could cause all sorts of issues such as replication issues and such. Here are some ideas:

  • Shorten the ttl: 24 hour is way too long for edit stash, noone is going to take 24 hours to make an edit

If the stash expires, the edit fails. People leave tabs open while they for lunch or take a nap... This would be a product decision. I don't think we can change this short term.

  • compress the entry before storage, that'd reduce the size to half

Is 50% going to make much of a difference?

  • make sure it's going to actually be 140GB, I'm honestly not sure. Do we really have 100 reqs a second consistently all the time? 20KB per entry is honestly quite large too.

These are the stats I got from @Eevans for the corresponding Cassandra store. But there is only one way to find out for sure: try it.

Do you remember what the context was here? What stats specifically? I just (double-)checked storage, and it currently looks to be ~32GB for enwiki.

enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw is RESTBase for enwiki parsoidphp-stash (don't ask), and the figure on the graph is "live storage" (the amount of space the tables are taking up on-disk), in the codfw datacenter. That doesn't account for replication (we use 3-way replication), so 28GB/3 would be ~9.3GB w/o. The compression ratio for those tables looks to be about 29.5% though, so figure ~32GB of actual data.

Deployment to enwiki is looking good. Stash writes went from ~20 per minute to ~60 per minute. This makes me wonder if our original estimate of 100 writes per seconds doesn't have the wrong unit... perhaps it's a total of 100 writes per minute? That's about 1.7 people per second opening VE, or switching modes. That sounds about right...

Screenshot 2023-06-12 154111.png (357×906 px, 70 KB)

Do you remember what the context was here? What stats specifically? I just (double-)checked storage, and it currently looks to be ~32GB for enwiki.

The context was, I believe, the creation of this ticket. We were trying to get an estimate for the requried storage space for the stash. For some reason, we looked at the average row size and multiplied by write rate and TTL. I don't recall the details of the conversation, but I do recall that the art was mainly to know which keypspace does what...

Can you double check the write rate for the stash?

daniel claimed this task.

We have been using the new backend for a couple of months now. It has been working fine.