Page MenuHomePhabricator

Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache
Closed, ResolvedPublic

Description

I updated translatewiki.net code two days ago (after a two week break).

Today I started getting reports that edits are going to wrong pages. I was not able to produce myself, but I found some history which looks wrong.

One example:

Click to view the version of December 5th 2017 [?oldid=XXX now shows correct content for this language because cache was purged]. It was supposed to be a two byte change overall (with comment saying some word is replaced). However looking at the diffs or viewing it directly we see a completely unrelated text:

https://translatewiki.net/w/i.php?title=MediaWiki:Wm-license-cecill-text/fr&diff=7714145&oldid=7692568

Same unrelated text is also seen on https://translatewiki.net/w/i.php?title=MediaWiki:Cx-notification-deleted-draft/fr&diff=9049005&oldid=9046073

Impact

  • Wrong content is displayed in page histories diffs both in translatewiki.net and Wikimedia wikis with Translate
  • Wrong content is getting exports (not 100% if caused by this)

Current status

  • It is determined that database is okay, and issue lies with caching of contents.
  • It is suspected that there is something wrong with WANObjectCache::getMultiWithUnionSetCallback that causes wrong contents to be cached.
  • The problematic caching has been disabled to prevent further corruption of caches. This has been deployed to translatewiki.net and Wikimedia wikis
  • Caches and translation pages have been purged on translatewiki.net (multiple times)
  • Caches and translation pages have been purged on Wikimedia wikis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 543730 had a related patch set uploaded (by Krinkle; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.35.0-wmf.2] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/543730

The master patch was backported to wmf.1 and +2'ed in master before wmf.2 was cut, but it got lost in Zuul land somehow. This must be deployed before group2.

Interesting, I've seen a similar caching issue where revision content got mixed up on a 1.33 wiki with $wgMultiContentRevisionSchemaMigrationStage = SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW;. I haven't yet been able to consistently reproduce it, though.

Huh... I'd love to know more about this. Was that with external storage enabled, or without? What was the caching config?

Just to follow up on this, the issue I was encountering was due to an unrelated misconfiguration. I was a bit too quick to connect the dots here. Sorry about the false alarm.

Change 542328 merged by jenkins-bot:
[mediawiki/core@master] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/542328

Piramidion added a comment.EditedOct 17 2019, 9:25 PM

Privacy policy/Proposed Revisions/uk on Meta-wiki is broken, the bug is still there (just sayin')

UPD: it seems that I'm stuck again: Privacy policy/FAQ/uk is broken too.

An observation:

  1. in both of the cases above, next to a correct translation of a correct item, an incorrect one is (randomly?) inserted. And that incorrect one is the very previous translated message (you can even see it in the left diff column, in the summary field). There are also other similar examples: diff1, diff2, or even this one, without an actual translation (I guess, it was a null edit)
  2. I noticed, that when I open my contributions, and look at my translations, sometimes an edit made to a target /uk page is shown as if it was done earlier than the actual creation of a page in the Translation: namespace. I wonder what the timestamps for those actions are, and if there's some kind of a lag in the creation of pages in Translate: namespace. Say, this edit is shown in my contributions as done earlier than the creation of this page
greg added a subscriber: greg.Oct 17 2019, 9:38 PM

The hotfix by itself didn't purge anything. Purging needs to be done separately and no mass purge has yet been done on Wikimedia wikis.

Is this still the case? Who is working on the cache purging?

jeena added a subscriber: jeena.Oct 17 2019, 11:57 PM
jeena added a comment.EditedOct 18 2019, 12:03 AM

I think the hotfix still needs to be merged to 1.35.0-wmf.2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/543730

@WDoranWMF can you confirm?

I am also impacted while translating Help:New_filters_for_edit_review/Filtering/fr providing a mess in the generated text. IMPORTANT => translations must be blocked if they are subject to be cancelled later to avoid useless work !

https://www.mediawiki.org/w/index.php?title=Help:New_filters_for_edit_review/Filtering/fr&action=edit&undoafter=3462762&undo=3462764 which provides after two last attempts an unwanted duplicated text as follows:

or incoherence in the correspondance of a previous correct translated text:

I think the hotfix still needs to be merged to 1.35.0-wmf.2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/543730

Yes, I believe @Krinkle added this as a blocker for the train to ensure https://gerrit.wikimedia.org/r/c/mediawiki/core/+/543730 gets deployed. After that, this bug can be removed as a blocker for the train, but it should be kept open until the caches have been purged.

I don't know if this is related, but there's now a very strange behavior with lost sessions somewhere in the wiki.

https://translatewiki.net/wiki/Thread%3ASupport/TranslateUserManager

You can see strange behavior of "TranslateUserManager" creating log entries with repeated (and public) incorrect statements, which are also privacy breaches for connected users. Internally there was incorrect tries of TranslateUserManager trying to automatically create a user account from my IPv6 address. This could occur when the server looses the session and I need to login again. It may just happen that it failed to get some session cookie, or there may be simultaneous random uses of IPv4 and IPv6 caused by background JSON requests using a different DNS resolver somewhere in the Javascript framework.

But it may also be another related problem of synchronisation between the main server and cache servers. At least this gives a date of start of this issue : 15 August, to be correlated with recent version updates in translatewikinet and incorrect asumptions in servers (like: the same connectd user cannot send requests from BOTH IPv4 and IPv6, or when this happens the incorrect assumption that there's a one-to-one mapping.

I have not specifically chosen IPv4 or IPv6, both may be used in HTTPS, but I note that the server now uses MIXED contents with HTTP and HTTPS, and their routing may be very different. Also this dates seems to coincide to a change in Google Chrome, that wants to enforce the switch from HTTP to HTTPS if this works, even if the requests are initially instructed (in JSON requests or XMLTHttpRequests). I note that my Gogole Chrome constantly displays deprecation warnings.

Be aware that mixed HTTP/HTTPS will soon be banned and are being phased out quite aggressively. And many Javascript frameworks have to be updated (including for example jQuery, but also MediaWiki itself, and requests for all external sites, including Wikimedia commons for files displayed in translatewiki.net in an unrelated domain).

May be this new bug signaled in TWN Support should have its own bug here in Phabricator, in that case move this message to the appropriate place.

Verdy_p rescinded a token.Oct 21 2019, 12:08 AM
Verdy_p awarded a token.

The last comment has nothing to do with this issue.

Change 543730 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.2] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/543730

Mentioned in SAL (#wikimedia-operations) [2019-10-21T11:25:54Z] <mobrovac@deploy1001> Synchronized php-1.35.0-wmf.2/includes/Storage/SqlBlobStore.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 1/3 - T235188 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T11:28:05Z] <mobrovac@deploy1001> Synchronized php-1.35.0-wmf.2/includes/libs/objectcache/wancache/WANObjectCache.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 2/3 - T235188 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T11:30:35Z] <mobrovac@deploy1001> Synchronized php-1.35.0-wmf.2/tests/phpunit/includes/Storage/SqlBlobStoreTest.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 3/3 - T235188 (duration: 01m 00s)

mobrovac lowered the priority of this task from Unbreak Now! to High.Oct 21 2019, 11:31 AM
mobrovac added a subscriber: mobrovac.

Lowering the priority as the hot fix has been applied to all production branches.

Verdy_p added a comment.EditedOct 21 2019, 9:51 PM
Dans T235188#5590245, @Nikerabbit a écrit :

The last comment has nothing to do with this issue.

Do you speak about "TranslateUserManager" trying repeatedly to create a fake account on my IPv6 address, even if I'm logged on with my named account ?
It does that about 4 or 5 times each day I'm connected since August 15. It probably does that because temporarily it cannot get some temporary optional cookie that was not renewed correctly, or becvause sessions cookies are renewed in a different order than expected when I logon, or cookies get flushed quite often on my PC if they are not used for more than about one hour.
This was already the same configuration since August 15. I think it may be related to the recent change in Chrome that enforces now HTTPS as much as it can, as well as using its own DNS client now over HTTPS (but may fallback it Google's DNS servers 8.8.8.8 with problems with some originating IPv6 addresses).
There are still not a lot people with "native" IPv4+IPV6 internet connections (in fact IPv4 on my ISP is now channel through an internal IPv4 over IPv6 tunnel created by the ISP's router; IPv6 is preferred now by my ISP for its Gigabit fiber accesses). But I do not trust my ISP's DNS and have configured the router to use OpenDNS instead (and that's what is also broadcast in my DHCP configuration).
Or could it be a recent defect of OpenDNS with strange responses for Wikimedia and TWN sites ?

How is it possible that my IPv6 address is shown by fake account creations (that in fact never occured) from TranslateUserManager, when I'm in fact logged on ? Such public logging should never occur as it breaches the Wikimedia policies, so this is a severe bug.

Is it related to https://phabricator.wikimedia.org/T236011 that you just created yesterday after my message ?
(which was initially reported for IPv4, but now occurs as well for IPV6)

@Verdy_p Yes, TranslateUserManager is not related to this issue in any way. You are spamming dozens of people subscribed to this task with irrelevant updates. Please stop it. I have already replied on translatewiki.net and filed T236011: Some newusers log entries show the IP address of the currently viewing user.

Pols12 added a subscriber: Pols12.Oct 23 2019, 12:00 AM

Lowering the priority as the hot fix has been applied to all production branches.

So it is supposed to be unnoticeable currently on wikis? Is that wrong content in edit mode another bug?

So it is supposed to be unnoticeable currently on wikis? Is that wrong content in edit mode another bug?

Like I wrote above, the hotfix does not purge caches. It only prevents further cache corruption.

Current state: The root cause that was causing new pages and revisions to be broken has been plugged. This means we are no longer expanding the range of corrupted content.

Next step: Repair the corruption that has occurred during the past two weeks. In particular:

  • Revision text cache in Memcached: Maybe slowly drop all memc keys that are prefixed with one of the affected wikis (having Translate installed), and contain the sub key component for revision text cache. This isn't something we've done recently, so will need to be done carefully and in coordination with ServiceOps to monitor Memc impact.
    • This affects page views of the corrupted page.
    • This affects loading the editor on the corrupted page. As such, event though the root cause was fixed, there can still be new corruption happening when people save edits to these pages due to loading the wrong wikitext to start the edit with. I imagine this also affects bots which wouldn't know that it is the wrong content.
    • This affects parsing of existing pages if they transclude the corrupted page.
  • ParserOutput: You can probably use the RejectParserCacheValue hook on these wikis to reject caches saved between 2 weeks ago and now (where now = after Memc is fixed).
WDoranWMF added a subscriber: Holger.EditedOct 23 2019, 5:52 PM

Adding serviceops, based on conversation with @Krinkle. The outstanding pieces are:

  1. Fix for WANCache that @aaron is working on, to facilitate the redeploy of the MCR patch
  2. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion
  3. Core Platform Team will need support from serviceops to purge the affected cache - @Krinkle suggests we'll also need to provide a means to identify the items to be purged.

@Anomie Can you take a look at 2, @Holger can you work with @Anomie to get a signature for 3 to allow purging?

serviceops what will you need from us/what's the process for purging the corrupt items?

Change 545647 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] RejectParserCacheValue to reject possibly-corrupted entries

https://gerrit.wikimedia.org/r/545647

  1. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion

Patch submitted. It wasn't clear to me which dates to purge, so I left it as WIP. Replace the two "FILL THIS IN" strings with the appropriate dates in TS_MW format.

The SqlBlobStore code using the bad WANObjectCache method was merged with 1.34.0-wmf.21, which was much more than 2 weeks ago. But maybe nothing was using the affected SqlBlobStore code path until more recently? That's not clear to me.

  1. Core Platform Team will need support from serviceops to purge the affected cache - @Krinkle suggests we'll also need to provide a means to identify the items to be purged.

The "revision text cache" is probably any key beginning with "global:SqlBlobStore-blob:$id:" where $id is the wiki id any of the affected wikis. If we want to go back before 1.34.0-wmf.23, also "global:BlobStore:address:$id:".

RevisionStore's cache shouldn't be affected, as it doesn't cache the blob content. At a quick glance through I didn't see any other relevant layers of caching.

@Krinkle Would you have time to review the patch?

The corrupted entries started happening in wmf/1.35.0-wmf.1 when Translate started using that interface.

Apart from revision text cache, I also recommend diff cache to be purged since wrong contents have also been cached in them.

Joe added a subscriber: Joe.Oct 24 2019, 10:58 AM

serviceops what will you need from us/what's the process for purging the corrupt items?

There is no existing process to do this. It will need us to get inventive if there is no way to enumerate the cache keys from software (so, MediaWiki), and require significant time.

From what @Anomie said, it should be possible to go on all the 18 memcached servers, dump somehow the list of all keys, filter for the ones we want by prefix (but we'll need a more precise list) and delete them all with a script.

Anything new that could be shared with communities?

jijiki added a subscriber: jijiki.Oct 24 2019, 3:54 PM
Krinkle added a comment.EditedOct 24 2019, 4:28 PM
  1. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion

Patch submitted. It wasn't clear to me which dates to purge, so I left it as WIP. [..]

[operations/mediawiki-config@master] RejectParserCacheValue to reject possibly-corrupted entries
https://gerrit.wikimedia.org/r/545647

I've updated the patch to use this start date. However the end date for ParserCache should not be set to the above as it also involves Memcached, that is still to be purged. Once done, set it to the current-ish time.

aaron renamed this task from Some revisions' contents are incorrect in the cache - wrong contents shown in history & diffs to Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.Oct 24 2019, 5:05 PM

@Joe Apologies for following newb questions - is access to those memcached instances through SRE? And if so what do you want us to prepare for you to make that as straightforward as possible for the SRE side?

Would it be most helpful for us to:

  1. Set the clear criteria for the keys to be removed
  2. Define how to dump and filter the keys
  3. Provide the scripts to do the above

Is there an instances we can use to test the above against?

From what @Anomie said, it should be possible to go on all the 18 memcached servers, dump somehow the list of all keys, filter for the ones we want by prefix (but we'll need a more precise list) and delete them all with a script.

Looks like the list of prefixes is

global:SqlBlobStore-blob:advisorswiki: 
global:SqlBlobStore-blob:amwikimedia: 
global:SqlBlobStore-blob:bdwikimedia: 
global:SqlBlobStore-blob:betawikiversity: 
global:SqlBlobStore-blob:bewikimedia: 
global:SqlBlobStore-blob:brwikimedia: 
global:SqlBlobStore-blob:cawikimedia: 
global:SqlBlobStore-blob:collabwiki: 
global:SqlBlobStore-blob:commonswiki: 
global:SqlBlobStore-blob:frwiktionary: 
global:SqlBlobStore-blob:hiwikimedia: 
global:SqlBlobStore-blob:idwikimedia: 
global:SqlBlobStore-blob:incubatorwiki: 
global:SqlBlobStore-blob:legalteamwiki: 
global:SqlBlobStore-blob:maiwikimedia: 
global:SqlBlobStore-blob:mediawikiwiki: 
global:SqlBlobStore-blob:metawiki: 
global:SqlBlobStore-blob:nowikimedia: 
global:SqlBlobStore-blob:otrs_wikiwiki: 
global:SqlBlobStore-blob:outreachwiki: 
global:SqlBlobStore-blob:punjabiwikimedia: 
global:SqlBlobStore-blob:ruwikimedia: 
global:SqlBlobStore-blob:sourceswiki: 
global:SqlBlobStore-blob:specieswiki: 
global:SqlBlobStore-blob:testcommonswiki: 
global:SqlBlobStore-blob:testwiki: 
global:SqlBlobStore-blob:testwikidatawiki: 
global:SqlBlobStore-blob:uawikimedia: 
global:SqlBlobStore-blob:wbwikimedia: 
global:SqlBlobStore-blob:wikidatawiki: 
global:SqlBlobStore-blob:wikimania2012wiki: 
global:SqlBlobStore-blob:wikimania2013wiki: 
global:SqlBlobStore-blob:wikimania2014wiki: 
global:SqlBlobStore-blob:wikimania2015wiki: 
global:SqlBlobStore-blob:wikimania2016wiki: 
global:SqlBlobStore-blob:wikimania2017wiki: 
global:SqlBlobStore-blob:wikimania2018wiki: 
global:SqlBlobStore-blob:wikimaniawiki:

If you want to purge diff caches too as suggested in T235188#5602261, it looks like that would be

advisorswiki:diff:
advisorswiki:inline-diff:
amwikimedia:diff:
amwikimedia:inline-diff:
bdwikimedia:diff:
bdwikimedia:inline-diff:
betawikiversity:diff:
betawikiversity:inline-diff:
bewikimedia:diff:
bewikimedia:inline-diff:
brwikimedia:diff:
brwikimedia:inline-diff:
cawikimedia:diff:
cawikimedia:inline-diff:
collabwiki:diff:
collabwiki:inline-diff:
commonswiki:diff:
commonswiki:inline-diff:
frwiktionary:diff:
frwiktionary:inline-diff:
hiwikimedia:diff:
hiwikimedia:inline-diff:
idwikimedia:diff:
idwikimedia:inline-diff:
incubatorwiki:diff:
incubatorwiki:inline-diff:
legalteamwiki:diff:
legalteamwiki:inline-diff:
maiwikimedia:diff:
maiwikimedia:inline-diff:
mediawikiwiki:diff:
mediawikiwiki:inline-diff:
metawiki:diff:
metawiki:inline-diff:
nowikimedia:diff:
nowikimedia:inline-diff:
otrs_wikiwiki:diff:
otrs_wikiwiki:inline-diff:
outreachwiki:diff:
outreachwiki:inline-diff:
punjabiwikimedia:diff:
punjabiwikimedia:inline-diff:
ruwikimedia:diff:
ruwikimedia:inline-diff:
sourceswiki:diff:
sourceswiki:inline-diff:
specieswiki:diff:
specieswiki:inline-diff:
testcommonswiki:diff:
testcommonswiki:inline-diff:
testwikidatawiki:diff:
testwikidatawiki:inline-diff:
testwiki:diff:
testwiki:inline-diff:
uawikimedia:diff:
uawikimedia:inline-diff:
wbwikimedia:diff:
wbwikimedia:inline-diff:
wikidatawiki:diff:
wikidatawiki:inline-diff:
wikimania2012wiki:diff:
wikimania2012wiki:inline-diff:
wikimania2013wiki:diff:
wikimania2013wiki:inline-diff:
wikimania2014wiki:diff:
wikimania2014wiki:inline-diff:
wikimania2015wiki:diff:
wikimania2015wiki:inline-diff:
wikimania2016wiki:diff:
wikimania2016wiki:inline-diff:
wikimania2017wiki:diff:
wikimania2017wiki:inline-diff:
wikimania2018wiki:diff:
wikimania2018wiki:inline-diff:
wikimaniawiki:diff:
wikimaniawiki:inline-diff:
Joe added a comment.Oct 28 2019, 2:54 PM

@Joe Apologies for following newb questions - is access to those memcached instances through SRE? And if so what do you want us to prepare for you to make that as straightforward as possible for the SRE side?

Would it be most helpful for us to:

  1. Set the clear criteria for the keys to be removed
  2. Define how to dump and filter the keys
  3. Provide the scripts to do the above

Is there an instances we can use to test the above against?

I don't think we're strictly needed in order to do this. Or better, someone should write a script that:

  • connects to memcached on $host:$port (given from the cli)
  • cycle through all keys in the server, using one of the techniques described in https://www.darkcoding.net/software/memcached-list-all-keys/
  • delete keys that correspond to some criteria, this time connecting via mcrouter and sending deletions to all datacenters.

We can do that but I think we're only needed, strictly speaking, to run the command. I'll get back to you in a few.

Joe added a comment.Oct 28 2019, 4:56 PM

After some digging, it appears impossible to reliably get all keys from a memcached server as loaded as our production ones.

Both using stats cachedump by hand, or using memcdump from libmemcached-tools I get around 3M keys. If I look at the stats, though, the server seems to have 44M items.

In this situation, I guess our best bet is to perform a rolling restart of the memcached servers, which will take up to 1 week to perform safely, because it will wipe part of our memcached keys.

If no one else has brilliant ideas, I will start working on it.

Joe added a comment.Oct 28 2019, 5:07 PM

To show better the cache issue:

{ echo 'stats items'; sleep 1; } | telnet mc1019 11211 | grep -F items:17:
STAT items:17:number 8433040
STAT items:17:age 88690
STAT items:17:evicted 446703589
STAT items:17:evicted_nonzero 446703231
STAT items:17:evicted_time 88690
STAT items:17:outofmemory 0
STAT items:17:tailrepairs 0
STAT items:17:reclaimed 395768512
STAT items:17:expired_unfetched 221048687
STAT items:17:evicted_unfetched 119428161
STAT items:17:crawler_reclaimed 0
STAT items:17:lrutail_reflocked 0

so it seems there are 8433040 objects in cache. Trying to retreive them, though, yelds different results:

$ { echo 'stats cachedump 17 8433040'; sleep 1; } | telnet mc1019 11211 | wc -l
Connection closed by foreign host.
26174

Mentioned in SAL (#wikimedia-operations) [2019-10-28T17:11:41Z] <_joe_> starting rolling restart of memcached servers in eqiad, beginning with mc1019 T235188

Mentioned in SAL (#wikimedia-operations) [2019-10-28T18:41:35Z] <rlazarus> restarted memcached on mc1020 T235188

Mentioned in SAL (#wikimedia-operations) [2019-10-28T19:50:30Z] <rlazarus> restarted memcached on mc1021 (T235188)

@Joe Sorry I only got this far down in my email. I guess the restart is in progress so. For future, what I meant above was we're happy to take on writing the script and working through it. Just wanted to check if we need perms for the access. Do you want us to follow the rolling restart or can we help out at all?

Mentioned in SAL (#wikimedia-operations) [2019-10-28T20:56:36Z] <cdanis> restart memcached on mc1022 T235188

Mentioned in SAL (#wikimedia-operations) [2019-10-29T07:01:58Z] <_joe_> restart memcached on mc1024-1036, 1 hour apart, via cumin (T235188)

Joe added a comment.Oct 30 2019, 6:38 AM

@WDoranWMF what I found out is that expunging values by prefix from memcached is impossible to do in a clean way without severely impacting performance.

So we just needed to perform the rolling restart, which is now done. That part of the cache is now clean.

Change 547696 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: fix cache pollution during Multi* method preemptive refreshes

https://gerrit.wikimedia.org/r/547696

Is the parser cache purging hook in place yet? I need it to be able to run cleanup script for translatable pages.

  1. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion

Patch submitted. It wasn't clear to me which dates to purge, so I left it as WIP. [..]

[operations/mediawiki-config@master] RejectParserCacheValue to reject possibly-corrupted entries
https://gerrit.wikimedia.org/r/545647

I've updated the patch to use this as start date. However the end date for ParserCache requires Memcached to be purged. Once done, set it to the current-ish time.

… the rolling [memcached] restart, is now done.

I've updated https://gerrit.wikimedia.org/r/545647 which is now ready to be rolled out.

CCicalese_WMF added a subscriber: CCicalese_WMF.

Removing Core Platform Team, since this is already on our Clinic Duty workboard. But, we believe there is no more work for us here. Or, were you looking for assistance deploying, @Krinkle?

Yeah, there is CR and deploy of the config and potentially other steps I'm missing. Looking for you to continue lead on this one and close the loop with Language engineering.

There is also an objectcache patch pending CR which is to prevent the same from re-occurring in the future, which in turn would unblock some of Revision refactor patches that Daniel/Marko were working on (which is what led to the incident originally).

Change 545647 abandoned by Anomie:
RejectParserCacheValue to reject possibly-corrupted entries

Reason:
James is correct, as far as I know. If all bad entries were before October 30, they should have expired as of last week.

https://gerrit.wikimedia.org/r/545647

Mentioned in SAL (#wikimedia-operations) [2019-12-11T09:04:51Z] <Nikerabbit> running Translate/refresh-translatable-pages.php --jobqueue for Translate wikis - T235027 T235188

Mentioned in SAL (#wikimedia-operations) [2019-12-11T10:34:14Z] <Nikerabbit> Finished running Translate/refresh-translatable-pages.php --jobqueue for Translate wikis - T235027 T235188

Nikerabbit updated the task description. (Show Details)Dec 11 2019, 10:43 AM

I finished my part of the cleanups. As far as I know everything has now been done to address the issues caused by the cache pollution. There is still the follow-up of re-enabling the caching after it works reliably, so handing the ball to Core Platform Team to handle the open patches linked to this task.

Thank you everyone for helping to debug a mysterious cache pollution bug and helping to clean up after it.

Change 542961 abandoned by Krinkle:
WANObjectCache: disable preemptive refresh for multi-get.

Reason:
Superseded by I239a3e1922f478c74c9 https://gerrit.wikimedia.org/r/547696

https://gerrit.wikimedia.org/r/542961

Change 547696 merged by jenkins-bot:
[mediawiki/core@master] objectcache: fix cache pollution in WANObectCache Multi* methods

https://gerrit.wikimedia.org/r/547696

Krinkle closed this task as Resolved.Feb 3 2020, 3:59 AM
Krinkle assigned this task to aaron.
Krinkle edited projects, added MediaWiki-Cache; removed MediaWiki-General.
Krinkle moved this task from Untriaged to libs/objectcache on the MediaWiki-Cache board.