Page MenuHomePhabricator

Convert various core/extension cache users to ReplicatedBagOStuff
Closed, ResolvedPublic

Description

Using a redis or MySQL, some code stashes objects or data in places where both DCs should see it.

Examples include:

Related Objects

Event Timeline

aaron claimed this task.
aaron raised the priority of this task from to Medium.
aaron updated the task description. (Show Details)
aaron removed a project: Patch-For-Review.
aaron set Security to None.
aaron added subscribers: Gilles, GWicke, mark and 7 others.

Change 207718 had a related patch set uploaded (by Aaron Schulz):
Added ObjectStash factory class and $wgMainStash/$wgObjectStashes

https://gerrit.wikimedia.org/r/207718

aaron removed a project: Epic.

Change 207718 merged by jenkins-bot:
Added ObjectCache::getMainStashInstance() and $wgMainStash

https://gerrit.wikimedia.org/r/207718

Change 212715 had a related patch set uploaded (by Aaron Schulz):
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/212715

Change 212717 had a related patch set uploaded (by Aaron Schulz):
[WIP] Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/212717

Change 213762 had a related patch set uploaded (by Aaron Schulz):
[WIP] Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/213762

Just to clarify, this would be for things where it is important for all instances to see the data, i.e., sort of like a brief key-value store? (Just curious because OATHAuth might need to use this for storing token expiration data to prevent replay attacks.)

Yes, although note that get() can have lag. I was actually thinking about having a flag to avoid lag for the replicated cache class.

noonce tokens are interesting (I was also thinking about that a lot today). If the GET/POST distinction is complete and only the later really mutates anything (like edits/comments) then noonce cache could be dc-local for performance, since POSTs would all go to one DC and have full deduplication and GETS would only allow one extra replay in the other DC (if fast enough and if HTTP was being used) and would not do anything. Of course if one is paranoid they can use add() on the stash bagostuff :)

Change 223790 had a related patch set uploaded (by Aaron Schulz):
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/223790

Change 212715 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/212715

Change 212717 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/212717

Change 213762 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/213762

Change 221994 had a related patch set uploaded (by Aaron Schulz):
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/221994

@aaron do we really want to rely on redis replication cross-datacenter? Or did I get that wrong? Redis replication is know not to be extraordiarily reliable even in a small-lag, same DC setup, did you do some experiments to see how reliable it would be?

If not, I'd love to help.

In T97620#1492791, @Joe wrote:

@aaron do we really want to rely on redis replication cross-datacenter? Or did I get that wrong? Redis replication is know not to be extraordiarily reliable even in a small-lag, same DC setup, did you do some experiments to see how reliable it would be?

If not, I'd love to help.

We used replication from tampa => eqiad during the switchover (though I don't think the consistent hashing was done correctly MW side). I was assuming we have replication likewise between eqiad => codfw.

@aaron at the moment we don't, as replicating redis would result in the codfw jobqueues processing again the same jobs as the eqiad ones, for instance.

Also, while I can think of that as a solution for a "definitive" switchover, I don't think it's a good idea long-term. But I'll look into options for making it as reliable as possible.

I'm just talking about the mc* redis instances (this bug is just about BagOStuff). We've done replication for that before.

I think the queues could be replicated in any case (though only duplicate jobs would be those who did not have the ACK replicated before switchover, which is tolerable), but that discussion could go on another task.

Change 221994 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/221994

Change 234190 had a related patch set uploaded (by Aaron Schulz):
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/234190

Change 234191 had a related patch set uploaded (by Aaron Schulz):
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/234191

Change 223790 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/223790

Change 234190 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/234190

Change 234191 merged by jenkins-bot:
Conversion to using getMainStashInstance()

https://gerrit.wikimedia.org/r/234191

CA tokens may end up done in T108253 if needed.