Improve Echo seentime code for multi-DC access
Closed, ResolvedPublic
Actions

Description

As part of T212129: Move MainStash out of Redis to a simpler multi-dc aware solution, we want to migrate the storage for Echo's seen-time from Redis to Cassandra (via Kask and [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/bcbd880b1310acabd22046ea4ccfc7fc157b17c8/includes/libs/objectcache/RESTBagOStuff.php | RESTBagOStuff ]]). To facilitate that, we need the following:

Storage must be configurable (right now it is hard-coded to use mainstash): https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/540729 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/540731
Use TTLs to allow seen-times to expire (they are currently stored indefinitely): https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/540975
Perform timestamp sets with GET instead of POST so that storage operations remain DC-local

NOTE: TTLs will be configured Kask-side, requiring no action on the part of the Echo extension

Details

Subject	Repo	Branch	Lines +/-
Use GET rather than POST for action=markseen	mediawiki/extensions/Echo	master	+4 -20
Migrate to Kask for Echo seen-time storage	operations/mediawiki-config	master	+1 -2
echostore: Set TTL to 1 year (31536000)	operations/deployment-charts	master	+3 -3
Config changes for Echo kask migration	operations/mediawiki-config	master	+36 -0
Revert "Config changes for Echo kask migration"	operations/mediawiki-config	master	+0 -38
Config changes for Echo kask migration	operations/mediawiki-config	master	+38 -0
rename service definition	operations/mediawiki-config	master	+4 -4
SeenTime: Do not set the cache key to Unix epoch	mediawiki/extensions/Echo	master	+0 -2
Make EchoSeenTime cache entries expire after 1 year	mediawiki/extensions/Echo	master	+2 -2
Make EchoSeenTime cache type configurable	mediawiki/extensions/Echo	master	+24 -10

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T252951 ResourceLoader DepStore lock acquired twice?
Resolved	None	T234286 Multi-DC Echo Notification Storage
Resolved	None	T234289 Migrate Wikimedia Echo notification timestamps from MainStash to Kask
Resolved	CCicalese_WMF	T234294 Configurable timestamp storage
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved	Krinkle	T113916 Switch ResourceLoader file dependency tracking to MultiDC-friendly backend
Resolved	jijiki	T267581 Phase out "redis_sessions" cluster and away from memcached cluster
Resolved	aaron	T212129 Move MainStash out of Redis to a simpler multi-dc aware solution
Resolved	Eevans	T222851 Improve Echo seentime code for multi-DC access

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: Growth-Team. · View Herald TranscriptMay 8 2019, 11:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Catrope added a parent task: T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.May 8 2019, 11:25 PM

Migrating from one storage backend to another, and from individual keys to pairs, would be a bit complicated. @mobrovac suggested that it would be easier if we could do both of these migrations at the same time. One way to do that would be to create a config setting that takes an array of storage backends, and have the code try these in order when reading. That way we can configure it to read first from the new backend, then the old backend (and then the user preference, which we still have fallback code for today), but only write new values to the first one, and we could make the pair vs separate keys thing a property of each storage backend as well.

Another issue @mobrovac pointed out is that there is no set date when we could stop reading from the old backend, because the keys in it never expire. Instead, I think we would have to write a migration script that fetches the seentime for every user through the fallback chain, and writes it to the new backend (if we write values from the fallback storage to the primary storage on get, this could just loop over all users and call the getter). After this script has run, we could then remove the fallback backends from the config (and finally remove the hidden prefence!)..

As for TTLs, I propose that we set a long but finite TTL (say, 90 days). When there's no seentime for the user because it's expired from the store, we'd fall back to the UNIX epoch, meaning we'd always think they had unseen notifications. For sanity, we would have to make sure that a user who has no seentime but also has no unread notifications is not considered to have unseen notifications.

SBisson mentioned this in W2269 Echo Tasks.May 9 2019, 12:30 AM

• Mholloway subscribed.May 9 2019, 1:03 PM

@Catrope MultiWriteBagOStuff does what you're talking about, so it's probably not necessary to replicate it in the Echo code. We can do it with a configuration.

I think we'd still need a migration script, though.

@Catrope also, I talked to @kaldari by email yesterday, and he thought that all the data was going into the main stash. Is that true? I might need to review the Echo code again.

In T222851#5170947, @EvanProdromou wrote:

@Catrope also, I talked to @kaldari by email yesterday, and he thought that all the data was going into the main stash. Is that true? I might need to review the Echo code again.

No, that's not true. He repeated that in our team meeting this morning (he thought all data was in Redis), but that's not true and hasn't been for a long time. The rest of the data is all stored in the database. (And the number of unread notifications for each user is cached in WANCache.)

Set TTLs to allow seentimes to expire (right now they are stored forever)

I wonder whether setting TTLs is needed or desirable, really. The reason why this is a potential issue now is that Redis might evict some keys if it doesn't have room to store new ones. However, since we are going to migrate necessarily to a multi-DC-enabled solution, we can afford to store all needed keys indefinitely. In other words, you have been treating the main stash as permanent storage so far, and unless there are explicit application-level reasons to start TTL'ing the data, I'd suggest to keep it that way.

In T222851#5169105, @Catrope wrote:

As for TTLs, I propose that we set a long but finite TTL (say, 90 days). When there's no seentime for the user because it's expired from the store, we'd fall back to the UNIX epoch, meaning we'd always think they had unseen notifications. For sanity, we would have to make sure that a user who has no seentime but also has no unread notifications is not considered to have unseen notifications.

If I'm not wrong, using the epoch time would mean larger tables scans in the DB, which is undesirable. I'd like to understand whether TTL'ing is necessary first, before we try finding solutions to the data has expired problem.

OK, if we can get away with not TTLing the data at all, that would certainly be simpler!

Yes, I think no TTL is the way to go. We don't want to lose this info.

Setting for Q2 as Q1 is getting full. If someone submits a patch we could review it earlier.

@Catrope In light of our meeting last Thursday, what needs to change here?

From our discussion I recall that:

We did want to use TTLs to prevent unbounded growth
We'd store the timestamps separately if atomic updates of the pair were not feasible
Sets would be performed using a GET (not POST), to avoid race-conditions under multi-master replication

Is this correct?

Yes, that sounds correct to me. Based on your second bullet point, I propose that we do the configurable storage backend and the TTL change, but that we keep storing timestamps separately for now unless and until we find (or you tell us) that it's better to store them in pairs. In other words, we'd do the first and third item from the check list in the task description, but we'd hold off on the second item for now. Does that sound OK?

In T222851#5536403, @Catrope wrote:

Yes, that sounds correct to me. Based on your second bullet point, I propose that we do the configurable storage backend and the TTL change, but that we keep storing timestamps separately for now unless and until we find (or you tell us) that it's better to store them in pairs.

Based on (my understanding of) your earlier description of how these are accessed, having them separate seemed like a reasonable way of modeling; I'm a little curious what prompted us to want them paired.

In other words, we'd do the first and third item from the check list in the task description, but we'd hold off on the second item for now. Does that sound OK?

Yeah, sounds OK to me.

CCicalese_WMF added a parent task: T234294: Configurable timestamp storage.Oct 1 2019, 1:01 PM

Eevans renamed this task from Improve Echo seentime code for multi-DC to Improve Echo seentime code for multi-DC access.Oct 1 2019, 8:32 PM

Eevans triaged this task as Medium priority.

Eevans updated the task description. (Show Details)

Change 540729 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] [WIP] Make EchoSeenTime cache type configurable

https://gerrit.wikimedia.org/r/540729

gerritbot added a project: Patch-For-Review.Oct 4 2019, 12:01 AM

Change 540731 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] [WIP] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/540731

I've put two patches in Gerrit that sketch out how I think this would be done. I'll come back to them tomorrow and test them properly etc.

Change 540975 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache entries expire after 1 year

https://gerrit.wikimedia.org/r/540975

In T222851#5545642, @Catrope wrote:

I've put two patches in Gerrit that sketch out how I think this would be done. I'll come back to them tomorrow and test them properly etc.

These are now ready for review

• Catrope updated the task description. (Show Details)Oct 4 2019, 10:58 PM

@Catrope are you needing further reviews from core platform team?

Change 540729 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache type configurable

https://gerrit.wikimedia.org/r/540729

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.3; 2019-10-22).Oct 18 2019, 12:00 PM

Change 540975 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Make EchoSeenTime cache entries expire after 1 year

https://gerrit.wikimedia.org/r/540975

Change 544184 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/extensions/Echo@master] SeenTime: Do not set the cache key to Unix epoch

https://gerrit.wikimedia.org/r/544184

Change 544199 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] rename service definition

https://gerrit.wikimedia.org/r/544199

Change 544184 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] SeenTime: Do not set the cache key to Unix epoch

https://gerrit.wikimedia.org/r/544184

Change 544199 merged by jenkins-bot:
[operations/mediawiki-config@master] rename service definition

https://gerrit.wikimedia.org/r/544199

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:15:35Z] <urbanecm@deploy1001> Synchronized wmf-config/: SWAT: 84c48df: rename service definition (T222851) (duration: 00m 53s)

Change 540731 merged by jenkins-bot:
[operations/mediawiki-config@master] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/540731

Mentioned in SAL (#wikimedia-operations) [2019-10-28T18:07:45Z] <urbanecm@deploy1001> Synchronized wmf-config/: SWAT: ddaa534: Config changes for Echo kask migration (T222851) (duration: 00m 55s)

Maintenance_bot removed a project: Patch-For-Review.Oct 28 2019, 6:10 PM

Change 546672 had a related patch set uploaded (by Urbanecm; owner: Urbanecm):
[operations/mediawiki-config@master] Revert "Config changes for Echo kask migration"

https://gerrit.wikimedia.org/r/546672

Change 546672 merged by Urbanecm:
[operations/mediawiki-config@master] Revert "Config changes for Echo kask migration"

https://gerrit.wikimedia.org/r/546672

Mentioned in SAL (#wikimedia-operations) [2019-10-28T18:26:28Z] <urbanecm@deploy1001> Synchronized wmf-config/: SWAT: c48271d: Revert "Config changes for Echo kask migration" (T222851) (duration: 00m 53s)

The config change was rolled out in SWAT today. It was rolled back a short time later, out of an abundance of caution, because it seemed to apply more broadly than just testwiki (our expectation). During the time it was deployed, we were seeing ~1k/s requests. Sampling the records in Cassandra, the vast majority were of the form global:echo:seen:{alert,message}:time:%d. A number of them however were for enwiki, ruwiki, idwiki, outreachwiki, wikidatawiki, and others.

@Catrope, is this to be expected?

FWIW, everything seemed to be working correctly up until it was rolled back.

Maintenance_bot removed a project: Patch-For-Review.Oct 28 2019, 7:11 PM

Change 546731 had a related patch set uploaded (by Eevans; owner: Catrope):
[operations/mediawiki-config@master] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/546731

gerritbot added a project: Patch-For-Review.Oct 28 2019, 8:52 PM

To summarize a conversation with @Catrope on IRC:

There were redundant definitions of wgEchoSeenTimeCacheType, one that conditionally assigned a Kask store, the other that assigned it unconditionally. The latter took precedence and echostore was deployed everywhere (without impact). This has been fixed in r/546731.

testwiki may not be the best way to test this, because it uses global notifications. Meaning: If we configure testwiki to use Kask, while the default continues to use Redis, than we'll have a split-brain scenario where notifications viewed on testwiki will not be reflected elsewhere, and vice-versa. The new gerrit uses officewiki for a first deploy instead, since it does not have global notifications enabled.

Change 546731 merged by jenkins-bot:
[operations/mediawiki-config@master] Config changes for Echo kask migration

https://gerrit.wikimedia.org/r/546731

Mentioned in SAL (#wikimedia-operations) [2019-10-28T23:19:25Z] <catrope@deploy1001> Synchronized wmf-config/ProductionServices.php: Deploy Echo kask migration to officewiki for testing, part 1 (T222851) (duration: 00m 54s)

Change 546760 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] echostore: Set TTL to 1 year (31536000)

https://gerrit.wikimedia.org/r/546760

Mentioned in SAL (#wikimedia-operations) [2019-10-28T23:20:32Z] <catrope@deploy1001> Synchronized wmf-config/CommonSettings.php: Deploy Echo kask migration to officewiki for testing, part 2 (T222851) (duration: 00m 52s)

Change 546760 merged by jenkins-bot:
[operations/deployment-charts@master] echostore: Set TTL to 1 year (31536000)

https://gerrit.wikimedia.org/r/546760

Eevans mentioned this in rDEPLOYCHARTS994baef95157: echostore: Set TTL to 1 year (31536000).Oct 28 2019, 11:22 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-28T23:23:41Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Deploy Echo kask migration to officewiki for testing, part 3 (T222851) (duration: 00m 52s)

Maintenance_bot removed a project: Patch-For-Review.Oct 29 2019, 12:10 AM

@Catrope given the difficulty in doing a phased migration (vis-a-vis global notifications), do you have any objections to transitioning to the new store in a single go (i.e. updating it to be the default)? We accidentally did this yesterday, and TTBMK, there were no issues. We'd of course keep the multi-write configuration in place for the time being, so if rolling back were necessary, up-to-date seen times would continue to exist in Redis.

Change 547022 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] Migrate to Kask for Echo seen-time storage

https://gerrit.wikimedia.org/r/547022

gerritbot added a project: Patch-For-Review.Oct 29 2019, 7:34 PM

Change 547022 merged by jenkins-bot:
[operations/mediawiki-config@master] Migrate to Kask for Echo seen-time storage

https://gerrit.wikimedia.org/r/547022

Mentioned in SAL (#wikimedia-operations) [2019-10-30T18:33:11Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T222851 Migrate to Kask for Echo seen-time storage (duration: 01m 01s)

• Mholloway unsubscribed.Oct 30 2019, 6:33 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 30 2019, 7:10 PM

I believe this task to be done.

Eevans updated the task description. (Show Details)Oct 31 2019, 9:01 PM

Eevans updated the task description. (Show Details)

In T222851#5624749, @Eevans wrote:

I believe this task to be done.

Actually, no I don't.

Perform timestamp sets with GET instead of POST so that storage operations remain DC-local

This still needs to be done.

Summarizing an IRC discussion: @Catrope will pick this up mid-November(ish), and we'll target deployment for sometime after the November freeze (27th–29th), and before the December freeze (December 23rd-January 3rd).

CCicalese_WMF added a project: Core Platform Team Initiatives (Multi-DC Echo Notification Storage).Nov 4 2019, 6:30 PM

In T222851#5626849, @Eevans wrote:

Summarizing an IRC discussion: @Catrope will pick this up mid-November(ish), and we'll target deployment for sometime after the November freeze (27th–29th), and before the December freeze (December 23rd-January 3rd).

Any updates on this @Catrope ?

Krenair subscribed.Dec 14 2019, 11:56 PM

Change 563617 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Echo@master] Use GET rather than POST for action=markseen

https://gerrit.wikimedia.org/r/563617

gerritbot added a project: Patch-For-Review.Jan 11 2020, 12:14 AM

Change 563617 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Use GET rather than POST for action=markseen

https://gerrit.wikimedia.org/r/563617

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.15; 2020-01-14); removed MW-1.35-notes (1.35.0-wmf.3; 2019-10-22).Jan 14 2020, 10:00 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 14 2020, 10:11 AM

• Catrope closed this task as Resolved.Jan 14 2020, 5:13 PM

• Catrope updated the task description. (Show Details)

Krinkle mentioned this in T252945: Avoid constant evictions on Redis main stash.May 16 2020, 5:23 PM

Krinkle mentioned this in T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.Jun 2 2020, 5:07 PM