Page MenuHomePhabricator

Improve Echo seentime code for multi-DC access
Closed, ResolvedPublic

Description

As part of T212129: Move MainStash out of Redis to a simpler multi-dc aware solution, we want to migrate the storage for Echo's seen-time from Redis to Cassandra (via Kask and [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/bcbd880b1310acabd22046ea4ccfc7fc157b17c8/includes/libs/objectcache/RESTBagOStuff.php | RESTBagOStuff ]]). To facilitate that, we need the following:

NOTE: TTLs will be configured Kask-side, requiring no action on the part of the Echo extension

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Migrating from one storage backend to another, and from individual keys to pairs, would be a bit complicated. @mobrovac suggested that it would be easier if we could do both of these migrations at the same time. One way to do that would be to create a config setting that takes an array of storage backends, and have the code try these in order when reading. That way we can configure it to read first from the new backend, then the old backend (and then the user preference, which we still have fallback code for today), but only write new values to the first one, and we could make the pair vs separate keys thing a property of each storage backend as well.

Another issue @mobrovac pointed out is that there is no set date when we could stop reading from the old backend, because the keys in it never expire. Instead, I think we would have to write a migration script that fetches the seentime for every user through the fallback chain, and writes it to the new backend (if we write values from the fallback storage to the primary storage on get, this could just loop over all users and call the getter). After this script has run, we could then remove the fallback backends from the config (and finally remove the hidden prefence!)..

As for TTLs, I propose that we set a long but finite TTL (say, 90 days). When there's no seentime for the user because it's expired from the store, we'd fall back to the UNIX epoch, meaning we'd always think they had unseen notifications. For sanity, we would have to make sure that a user who has no seentime but also has no unread notifications is not considered to have unseen notifications.

@Catrope MultiWriteBagOStuff does what you're talking about, so it's probably not necessary to replicate it in the Echo code. We can do it with a configuration.

I think we'd still need a migration script, though.

@Catrope also, I talked to @kaldari by email yesterday, and he thought that all the data was going into the main stash. Is that true? I might need to review the Echo code again.

@Catrope also, I talked to @kaldari by email yesterday, and he thought that all the data was going into the main stash. Is that true? I might need to review the Echo code again.

No, that's not true. He repeated that in our team meeting this morning (he thought all data was in Redis), but that's not true and hasn't been for a long time. The rest of the data is all stored in the database. (And the number of unread notifications for each user is cached in WANCache.)

  • Set TTLs to allow seentimes to expire (right now they are stored forever)

I wonder whether setting TTLs is needed or desirable, really. The reason why this is a potential issue now is that Redis might evict some keys if it doesn't have room to store new ones. However, since we are going to migrate necessarily to a multi-DC-enabled solution, we can afford to store all needed keys indefinitely. In other words, you have been treating the main stash as permanent storage so far, and unless there are explicit application-level reasons to start TTL'ing the data, I'd suggest to keep it that way.

As for TTLs, I propose that we set a long but finite TTL (say, 90 days). When there's no seentime for the user because it's expired from the store, we'd fall back to the UNIX epoch, meaning we'd always think they had unseen notifications. For sanity, we would have to make sure that a user who has no seentime but also has no unread notifications is not considered to have unseen notifications.

If I'm not wrong, using the epoch time would mean larger tables scans in the DB, which is undesirable. I'd like to understand whether TTL'ing is necessary first, before we try finding solutions to the data has expired problem.

OK, if we can get away with not TTLing the data at all, that would certainly be simpler!

Yes, I think no TTL is the way to go. We don't want to lose this info.

kostajh subscribed.

Setting for Q2 as Q1 is getting full. If someone submits a patch we could review it earlier.

Eevans subscribed.

@Catrope In light of our meeting last Thursday, what needs to change here?

From our discussion I recall that:

  • We did want to use TTLs to prevent unbounded growth
  • We'd store the timestamps separately if atomic updates of the pair were not feasible
  • Sets would be performed using a GET (not POST), to avoid race-conditions under multi-master replication

Is this correct?

Yes, that sounds correct to me. Based on your second bullet point, I propose that we do the configurable storage backend and the TTL change, but that we keep storing timestamps separately for now unless and until we find (or you tell us) that it's better to store them in pairs. In other words, we'd do the first and third item from the check list in the task description, but we'd hold off on the second item for now. Does that sound OK?

Yes, that sounds correct to me. Based on your second bullet point, I propose that we do the configurable storage backend and the TTL change, but that we keep storing timestamps separately for now unless and until we find (or you tell us) that it's better to store them in pairs.

Based on (my understanding of) your earlier description of how these are accessed, having them separate seemed like a reasonable way of modeling; I'm a little curious what prompted us to want them paired.

In other words, we'd do the first and third item from the check list in the task description, but we'd hold off on the second item for now. Does that sound OK?

Yeah, sounds OK to me.

Eevans renamed this task from Improve Echo seentime code for multi-DC to Improve Echo seentime code for multi-DC access.Oct 1 2019, 8:32 PM
Eevans triaged this task as Medium priority.
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

Change 540729 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] [WIP] Make EchoSeenTime cache type configurable

https://gerrit.wikimedia.org/r/540729

Change 540731 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] [WIP] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/540731

I've put two patches in Gerrit that sketch out how I think this would be done. I'll come back to them tomorrow and test them properly etc.

Change 540975 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache entries expire after 1 year

https://gerrit.wikimedia.org/r/540975

I've put two patches in Gerrit that sketch out how I think this would be done. I'll come back to them tomorrow and test them properly etc.

These are now ready for review

@Catrope are you needing further reviews from core platform team?

Change 540729 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache type configurable

https://gerrit.wikimedia.org/r/540729

Change 540975 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache entries expire after 1 year

https://gerrit.wikimedia.org/r/540975

Change 544184 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/extensions/Echo@master] SeenTime: Do not set the cache key to Unix epoch

https://gerrit.wikimedia.org/r/544184

Change 544199 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] rename service definition

https://gerrit.wikimedia.org/r/544199

Change 544184 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] SeenTime: Do not set the cache key to Unix epoch

https://gerrit.wikimedia.org/r/544184

Change 544199 merged by jenkins-bot:
[operations/mediawiki-config@master] rename service definition

https://gerrit.wikimedia.org/r/544199

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:15:35Z] <urbanecm@deploy1001> Synchronized wmf-config/: SWAT: 84c48df: rename service definition (T222851) (duration: 00m 53s)

Change 540731 merged by jenkins-bot:
[operations/mediawiki-config@master] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/540731

Mentioned in SAL (#wikimedia-operations) [2019-10-28T18:07:45Z] <urbanecm@deploy1001> Synchronized wmf-config/: SWAT: ddaa534: Config changes for Echo kask migration (T222851) (duration: 00m 55s)

Change 546672 had a related patch set uploaded (by Urbanecm; owner: Urbanecm):
[operations/mediawiki-config@master] Revert "Config changes for Echo kask migration"

https://gerrit.wikimedia.org/r/546672

Change 546672 merged by Urbanecm:
[operations/mediawiki-config@master] Revert "Config changes for Echo kask migration"

https://gerrit.wikimedia.org/r/546672

Mentioned in SAL (#wikimedia-operations) [2019-10-28T18:26:28Z] <urbanecm@deploy1001> Synchronized wmf-config/: SWAT: c48271d: Revert "Config changes for Echo kask migration" (T222851) (duration: 00m 53s)

The config change was rolled out in SWAT today. It was rolled back a short time later, out of an abundance of caution, because it seemed to apply more broadly than just testwiki (our expectation). During the time it was deployed, we were seeing ~1k/s requests. Sampling the records in Cassandra, the vast majority were of the form global:echo:seen:{alert,message}:time:%d. A number of them however were for enwiki, ruwiki, idwiki, outreachwiki, wikidatawiki, and others.

@Catrope, is this to be expected?

FWIW, everything seemed to be working correctly up until it was rolled back.

Change 546731 had a related patch set uploaded (by Eevans; owner: Catrope):
[operations/mediawiki-config@master] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/546731

To summarize a conversation with @Catrope on IRC:

There were redundant definitions of wgEchoSeenTimeCacheType, one that conditionally assigned a Kask store, the other that assigned it unconditionally. The latter took precedence and echostore was deployed everywhere (without impact). This has been fixed in r/546731.

testwiki may not be the best way to test this, because it uses global notifications. Meaning: If we configure testwiki to use Kask, while the default continues to use Redis, than we'll have a split-brain scenario where notifications viewed on testwiki will not be reflected elsewhere, and vice-versa. The new gerrit uses officewiki for a first deploy instead, since it does not have global notifications enabled.

Change 546731 merged by jenkins-bot:
[operations/mediawiki-config@master] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/546731

Mentioned in SAL (#wikimedia-operations) [2019-10-28T23:19:25Z] <catrope@deploy1001> Synchronized wmf-config/ProductionServices.php: Deploy Echo kask migration to officewiki for testing, part 1 (T222851) (duration: 00m 54s)

Change 546760 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] echostore: Set TTL to 1 year (31536000)

https://gerrit.wikimedia.org/r/546760

Mentioned in SAL (#wikimedia-operations) [2019-10-28T23:20:32Z] <catrope@deploy1001> Synchronized wmf-config/CommonSettings.php: Deploy Echo kask migration to officewiki for testing, part 2 (T222851) (duration: 00m 52s)

Change 546760 merged by jenkins-bot:
[operations/deployment-charts@master] echostore: Set TTL to 1 year (31536000)

https://gerrit.wikimedia.org/r/546760

Mentioned in SAL (#wikimedia-operations) [2019-10-28T23:23:41Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Deploy Echo kask migration to officewiki for testing, part 3 (T222851) (duration: 00m 52s)

@Catrope given the difficulty in doing a phased migration (vis-a-vis global notifications), do you have any objections to transitioning to the new store in a single go (i.e. updating it to be the default)? We accidentally did this yesterday, and TTBMK, there were no issues. We'd of course keep the multi-write configuration in place for the time being, so if rolling back were necessary, up-to-date seen times would continue to exist in Redis.

Change 547022 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] Migrate to Kask for Echo seen-time storage

https://gerrit.wikimedia.org/r/547022

Change 547022 merged by jenkins-bot:
[operations/mediawiki-config@master] Migrate to Kask for Echo seen-time storage

https://gerrit.wikimedia.org/r/547022

Mentioned in SAL (#wikimedia-operations) [2019-10-30T18:33:11Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T222851 Migrate to Kask for Echo seen-time storage (duration: 01m 01s)

Eevans claimed this task.

I believe this task to be done.

Eevans updated the task description. (Show Details)

I believe this task to be done.

Actually, no I don't.

  • Perform timestamp sets with GET instead of POST so that storage operations remain DC-local

This still needs to be done.

Summarizing an IRC discussion: @Catrope will pick this up mid-November(ish), and we'll target deployment for sometime after the November freeze (27th–29th), and before the December freeze (December 23rd-January 3rd).

Summarizing an IRC discussion: @Catrope will pick this up mid-November(ish), and we'll target deployment for sometime after the November freeze (27th–29th), and before the December freeze (December 23rd-January 3rd).

Any updates on this @Catrope ?

Change 563617 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] Use GET rather than POST for action=markseen

https://gerrit.wikimedia.org/r/563617

Change 563617 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Use GET rather than POST for action=markseen

https://gerrit.wikimedia.org/r/563617

Catrope updated the task description. (Show Details)