Page MenuHomePhabricator

Improve Echo seentime code for multi-DC access
Open, NormalPublic

Description

As part of T212129: Use a multi-dc aware store for ObjectCache's MainStash if needed., we want to migrate the storage for Echo's seen-time from Redis to Cassandra (via Kask and RESTBagOStuff). To facilitate that, we need the following:

NOTE: TTLs will be configured Kask-side, requiring no action on the part of the Echo extension

Event Timeline

Catrope created this task.May 8 2019, 11:25 PM
Restricted Application added a project: Growth-Team. · View Herald TranscriptMay 8 2019, 11:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Migrating from one storage backend to another, and from individual keys to pairs, would be a bit complicated. @mobrovac suggested that it would be easier if we could do both of these migrations at the same time. One way to do that would be to create a config setting that takes an array of storage backends, and have the code try these in order when reading. That way we can configure it to read first from the new backend, then the old backend (and then the user preference, which we still have fallback code for today), but only write new values to the first one, and we could make the pair vs separate keys thing a property of each storage backend as well.

Another issue @mobrovac pointed out is that there is no set date when we could stop reading from the old backend, because the keys in it never expire. Instead, I think we would have to write a migration script that fetches the seentime for every user through the fallback chain, and writes it to the new backend (if we write values from the fallback storage to the primary storage on get, this could just loop over all users and call the getter). After this script has run, we could then remove the fallback backends from the config (and finally remove the hidden prefence!)..

As for TTLs, I propose that we set a long but finite TTL (say, 90 days). When there's no seentime for the user because it's expired from the store, we'd fall back to the UNIX epoch, meaning we'd always think they had unseen notifications. For sanity, we would have to make sure that a user who has no seentime but also has no unread notifications is not considered to have unseen notifications.

@Catrope MultiWriteBagOStuff does what you're talking about, so it's probably not necessary to replicate it in the Echo code. We can do it with a configuration.

I think we'd still need a migration script, though.

@Catrope also, I talked to @kaldari by email yesterday, and he thought that all the data was going into the main stash. Is that true? I might need to review the Echo code again.

@Catrope also, I talked to @kaldari by email yesterday, and he thought that all the data was going into the main stash. Is that true? I might need to review the Echo code again.

No, that's not true. He repeated that in our team meeting this morning (he thought all data was in Redis), but that's not true and hasn't been for a long time. The rest of the data is all stored in the database. (And the number of unread notifications for each user is cached in WANCache.)

  • Set TTLs to allow seentimes to expire (right now they are stored forever)

I wonder whether setting TTLs is needed or desirable, really. The reason why this is a potential issue now is that Redis might evict some keys if it doesn't have room to store new ones. However, since we are going to migrate necessarily to a multi-DC-enabled solution, we can afford to store all needed keys indefinitely. In other words, you have been treating the main stash as permanent storage so far, and unless there are explicit application-level reasons to start TTL'ing the data, I'd suggest to keep it that way.

As for TTLs, I propose that we set a long but finite TTL (say, 90 days). When there's no seentime for the user because it's expired from the store, we'd fall back to the UNIX epoch, meaning we'd always think they had unseen notifications. For sanity, we would have to make sure that a user who has no seentime but also has no unread notifications is not considered to have unseen notifications.

If I'm not wrong, using the epoch time would mean larger tables scans in the DB, which is undesirable. I'd like to understand whether TTL'ing is necessary first, before we try finding solutions to the data has expired problem.

OK, if we can get away with not TTLing the data at all, that would certainly be simpler!

Yes, I think no TTL is the way to go. We don't want to lose this info.

kostajh moved this task from Inbox to Q2 2019-20 on the Growth-Team board.Jul 18 2019, 12:51 AM
kostajh added a subscriber: kostajh.

Setting for Q2 as Q1 is getting full. If someone submits a patch we could review it earlier.

Eevans added a subscriber: Eevans.

@Catrope In light of our meeting last Thursday, what needs to change here?

From our discussion I recall that:

  • We did want to use TTLs to prevent unbounded growth
  • We'd store the timestamps separately if atomic updates of the pair were not feasible
  • Sets would be performed using a GET (not POST), to avoid race-conditions under multi-master replication

Is this correct?

Yes, that sounds correct to me. Based on your second bullet point, I propose that we do the configurable storage backend and the TTL change, but that we keep storing timestamps separately for now unless and until we find (or you tell us) that it's better to store them in pairs. In other words, we'd do the first and third item from the check list in the task description, but we'd hold off on the second item for now. Does that sound OK?

Eevans added a comment.Tue, Oct 1, 1:16 AM

Yes, that sounds correct to me. Based on your second bullet point, I propose that we do the configurable storage backend and the TTL change, but that we keep storing timestamps separately for now unless and until we find (or you tell us) that it's better to store them in pairs.

Based on (my understanding of) your earlier description of how these are accessed, having them separate seemed like a reasonable way of modeling; I'm a little curious what prompted us to want them paired.

In other words, we'd do the first and third item from the check list in the task description, but we'd hold off on the second item for now. Does that sound OK?

Yeah, sounds OK to me.

Eevans renamed this task from Improve Echo seentime code for multi-DC to Improve Echo seentime code for multi-DC access.Tue, Oct 1, 8:32 PM
Eevans triaged this task as Normal priority.
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

Change 540729 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] [WIP] Make EchoSeenTime cache type configurable

https://gerrit.wikimedia.org/r/540729

Change 540731 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] [WIP] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/540731

I've put two patches in Gerrit that sketch out how I think this would be done. I'll come back to them tomorrow and test them properly etc.

Change 540975 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache entries expire after 1 year

https://gerrit.wikimedia.org/r/540975

I've put two patches in Gerrit that sketch out how I think this would be done. I'll come back to them tomorrow and test them properly etc.

These are now ready for review

Catrope updated the task description. (Show Details)Fri, Oct 4, 10:58 PM