Page MenuHomePhabricator

Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache
Open, HighPublic

Description

I updated translatewiki.net code two days ago (after a two week break).

Today I started getting reports that edits are going to wrong pages. I was not able to produce myself, but I found some history which looks wrong.

One example:

Click to view the version of December 5th 2017 [?oldid=XXX now shows correct content for this language because cache was purged]. It was supposed to be a two byte change overall (with comment saying some word is replaced). However looking at the diffs or viewing it directly we see a completely unrelated text:

https://translatewiki.net/w/i.php?title=MediaWiki:Wm-license-cecill-text/fr&diff=7714145&oldid=7692568

Same unrelated text is also seen on https://translatewiki.net/w/i.php?title=MediaWiki:Cx-notification-deleted-draft/fr&diff=9049005&oldid=9046073

Impact

  • Wrong content is displayed in page histories diffs both in translatewiki.net and Wikimedia wikis with Translate
  • Wrong content is getting exports (not 100% if caused by this)

Current status

  • It is determined that database is okay, and issue lies with caching of contents.
  • It is suspected that there is something wrong with WANObjectCache::getMultiWithUnionSetCallback that causes wrong contents to be cached.
  • The problematic caching has been disabled to prevent further corruption of caches. This has been deployed to translatewiki.net and Wikimedia wikis
  • Caches have been purged on translatewiki.net (multiple times)
  • Caches have not been purged on Wikimedia wikis
  • Badly cached entries still findable on translatewiki.net. This hints that there may be other causes, or issues with the cache purging strategy. Running the purge script again DOES make the wrong content go away, so maybe there is yet another place where wrong content can be cached, which then populates to these caches?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

According to:
https://www.php.net/manual/fr/functions.anonymous.php

The closure parameter "&$id" in getMultiWithUnionSetCallback() should not be passed by reference but by value (without the "&"): otherwise all closures in a loop over multiple IDs will be created with the same reference to the parent "$id" variable, instead of copying its current value into the closure (all closures created in the loop will be called much later). It then becomes impossible to process multiple ids in an union for the same batch...

lines 1811-1814:

		// Wrap $callback to match the getWithSetCallback() format while passing $id to $callback
		$id = null; // current entity ID
		$func = function ( $oldValue, &$ttl, &$setOpts, $oldAsOf )`
			use ( $callback, &$id, $newValsById, $newTTLsById, $newSetOpts )
		{...}

How can the callback pass the $id variable value ? It cannot be from this parent context, but must be passed explicitly by the callback itself.

line 1835:

			$values[$key] = $this->getWithSetCallback( $key, $ttl, $func, $opts );

The id's are not known at this place, the getWithSetCallback() will use its own local iterator with its own local $id variable, passing by value the $id it extracts from the iterator to the callback.

The use(&$id) in line 1817 is wrong, it's not the one effectively used in line 1699 in getMultiWithSetCallback() (which is called multiple times for each member of the union, each time with a different iterator and different local variable, but it does not affect the value of $id declared and initialized to null in line 1812. This code uses then different variables named $id in very different contexts.

Change 542962 abandoned by Daniel Kinzler:
SqlBlobStore: test caching behavior.

Reason:
no need to port this to the deployment branch

https://gerrit.wikimedia.org/r/542962

Change 542963 merged by MaxSem:
[mediawiki/core@wmf/1.35.0-wmf.1] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/542963

Nikerabbit updated the task description. (Show Details)Oct 15 2019, 8:02 AM

The hotfix was emergency deployed to Wikimedia wikis yesterday. The caches still need to be purged. It's not clear how or who is going to do this.

Ata added a subscriber: Ata.Oct 15 2019, 5:24 PM
Verdy_p added a comment.EditedOct 15 2019, 11:39 PM

Note that despite of the (selective?) cache purge in Translatewiki.net, the incorrect a cached version (including "Le champ est obsolète. $1", but not only this text) is still recurring (Look at some of my edit history on TWN for comments with "T235188").

I think that the disabling of Memcached for Fuzzybot's edits (using SQL requests directly) is still not deployed in TWN (or disabling it causes major performance degradation in that smaller server).


Side note:

I already observe severe performance or access degradation when looking for a set of 100 messages to review in large message modules (such as Mediawiki-core, which is also a clear sign that this set is way too large now with over 35000 messages, and that it should be modularized like Phabricator; no translation module should exceed about 500 messages):

Trying to load them frequently results in a server-side error with too long requests just to find those messages that sill need work for the connected user for new translations or for reviews: Memcached is supposed to help, but it also fails (or its eviction-policy has problems, or Memcached has some memory leakages causing unexpected disk I/O for swapping; for that case, Memcached's memory usage and swap I/O should also be monitored on its instance server if it causes excessive delays for normal operations in the Translate UI trying to load many messages with user-specific state filters).

I don't know if this can be the cause of problems now found in TWN for languages with high levels of completion (notably French, and Russian) or if Memcached swapping I/O can be the cause of problems now reported in Wikimedia Wikis with many translations and high completion levels for translations to languages with active translators, and possibly causing Memcached to unexpectedly report empty data for keys that are supposed to be present (in that case, the Translate extension may be supposing that such requests to Memcached can never fail at some point when reloading the same data multiple times for several steps of validations in large batches of messages)

Nikerabbit updated the task description. (Show Details)Oct 16 2019, 6:55 AM
Piramidion added a comment.EditedOct 16 2019, 7:42 AM

Yesterday I encountered a message on Meta-wiki that'd been showing a wrong content. Unlike other previously reported messages, it wasn't properly purged after the hotfix was deployed. I managed to purge it with a real edit instead of a null one (added a character > saved, removed the character > saved), if this info helps in any way.

The hotfix by itself didn't purge anything. Purging needs to be done separately and no mass purge has yet been done on Wikimedia wikis.

The wrong content is still visible in diffs for example: https://meta.wikimedia.org/w/index.php?title=Translations:Privacy_policy/286/uk&diff=prev&oldid=19461426

Change 543730 had a related patch set uploaded (by Krinkle; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.35.0-wmf.2] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/543730

The master patch was backported to wmf.1 and +2'ed in master before wmf.2 was cut, but it got lost in Zuul land somehow. This must be deployed before group2.

Interesting, I've seen a similar caching issue where revision content got mixed up on a 1.33 wiki with $wgMultiContentRevisionSchemaMigrationStage = SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW;. I haven't yet been able to consistently reproduce it, though.

Huh... I'd love to know more about this. Was that with external storage enabled, or without? What was the caching config?

Just to follow up on this, the issue I was encountering was due to an unrelated misconfiguration. I was a bit too quick to connect the dots here. Sorry about the false alarm.

Change 542328 merged by jenkins-bot:
[mediawiki/core@master] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/542328

Piramidion added a comment.EditedOct 17 2019, 9:25 PM

Privacy policy/Proposed Revisions/uk on Meta-wiki is broken, the bug is still there (just sayin')

UPD: it seems that I'm stuck again: Privacy policy/FAQ/uk is broken too.

An observation:

  1. in both of the cases above, next to a correct translation of a correct item, an incorrect one is (randomly?) inserted. And that incorrect one is the very previous translated message (you can even see it in the left diff column, in the summary field). There are also other similar examples: diff1, diff2, or even this one, without an actual translation (I guess, it was a null edit)
  2. I noticed, that when I open my contributions, and look at my translations, sometimes an edit made to a target /uk page is shown as if it was done earlier than the actual creation of a page in the Translation: namespace. I wonder what the timestamps for those actions are, and if there's some kind of a lag in the creation of pages in Translate: namespace. Say, this edit is shown in my contributions as done earlier than the creation of this page
greg added a subscriber: greg.Oct 17 2019, 9:38 PM

The hotfix by itself didn't purge anything. Purging needs to be done separately and no mass purge has yet been done on Wikimedia wikis.

Is this still the case? Who is working on the cache purging?

jeena added a subscriber: jeena.Oct 17 2019, 11:57 PM
jeena added a comment.EditedOct 18 2019, 12:03 AM

I think the hotfix still needs to be merged to 1.35.0-wmf.2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/543730

@WDoranWMF can you confirm?

I am also impacted while translating Help:New_filters_for_edit_review/Filtering/fr providing a mess in the generated text. IMPORTANT => translations must be blocked if they are subject to be cancelled later to avoid useless work !

https://www.mediawiki.org/w/index.php?title=Help:New_filters_for_edit_review/Filtering/fr&action=edit&undoafter=3462762&undo=3462764 which provides after two last attempts an unwanted duplicated text as follows:

or incoherence in the correspondance of a previous correct translated text:

I think the hotfix still needs to be merged to 1.35.0-wmf.2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/543730

Yes, I believe @Krinkle added this as a blocker for the train to ensure https://gerrit.wikimedia.org/r/c/mediawiki/core/+/543730 gets deployed. After that, this bug can be removed as a blocker for the train, but it should be kept open until the caches have been purged.

I don't know if this is related, but there's now a very strange behavior with lost sessions somewhere in the wiki.

https://translatewiki.net/wiki/Thread%3ASupport/TranslateUserManager

You can see strange behavior of "TranslateUserManager" creating log entries with repeated (and public) incorrect statements, which are also privacy breaches for connected users. Internally there was incorrect tries of TranslateUserManager trying to automatically create a user account from my IPv6 address. This could occur when the server looses the session and I need to login again. It may just happen that it failed to get some session cookie, or there may be simultaneous random uses of IPv4 and IPv6 caused by background JSON requests using a different DNS resolver somewhere in the Javascript framework.

But it may also be another related problem of synchronisation between the main server and cache servers. At least this gives a date of start of this issue : 15 August, to be correlated with recent version updates in translatewikinet and incorrect asumptions in servers (like: the same connectd user cannot send requests from BOTH IPv4 and IPv6, or when this happens the incorrect assumption that there's a one-to-one mapping.

I have not specifically chosen IPv4 or IPv6, both may be used in HTTPS, but I note that the server now uses MIXED contents with HTTP and HTTPS, and their routing may be very different. Also this dates seems to coincide to a change in Google Chrome, that wants to enforce the switch from HTTP to HTTPS if this works, even if the requests are initially instructed (in JSON requests or XMLTHttpRequests). I note that my Gogole Chrome constantly displays deprecation warnings.

Be aware that mixed HTTP/HTTPS will soon be banned and are being phased out quite aggressively. And many Javascript frameworks have to be updated (including for example jQuery, but also MediaWiki itself, and requests for all external sites, including Wikimedia commons for files displayed in translatewiki.net in an unrelated domain).

May be this new bug signaled in TWN Support should have its own bug here in Phabricator, in that case move this message to the appropriate place.

Verdy_p rescinded a token.Mon, Oct 21, 12:08 AM
Verdy_p awarded a token.

The last comment has nothing to do with this issue.

Change 543730 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.2] SqlBlobStore HOT FIX: remove caching from getBlobBatch

https://gerrit.wikimedia.org/r/543730

Mentioned in SAL (#wikimedia-operations) [2019-10-21T11:25:54Z] <mobrovac@deploy1001> Synchronized php-1.35.0-wmf.2/includes/Storage/SqlBlobStore.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 1/3 - T235188 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T11:28:05Z] <mobrovac@deploy1001> Synchronized php-1.35.0-wmf.2/includes/libs/objectcache/wancache/WANObjectCache.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 2/3 - T235188 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T11:30:35Z] <mobrovac@deploy1001> Synchronized php-1.35.0-wmf.2/tests/phpunit/includes/Storage/SqlBlobStoreTest.php: SqlBlobStore HOT FIX: remove caching from getBlobBatch; file 3/3 - T235188 (duration: 01m 00s)

mobrovac lowered the priority of this task from Unbreak Now! to High.Mon, Oct 21, 11:31 AM
mobrovac added a subscriber: mobrovac.

Lowering the priority as the hot fix has been applied to all production branches.

Verdy_p added a comment.EditedMon, Oct 21, 9:51 PM
Dans T235188#5590245, @Nikerabbit a écrit :

The last comment has nothing to do with this issue.

Do you speak about "TranslateUserManager" trying repeatedly to create a fake account on my IPv6 address, even if I'm logged on with my named account ?
It does that about 4 or 5 times each day I'm connected since August 15. It probably does that because temporarily it cannot get some temporary optional cookie that was not renewed correctly, or becvause sessions cookies are renewed in a different order than expected when I logon, or cookies get flushed quite often on my PC if they are not used for more than about one hour.
This was already the same configuration since August 15. I think it may be related to the recent change in Chrome that enforces now HTTPS as much as it can, as well as using its own DNS client now over HTTPS (but may fallback it Google's DNS servers 8.8.8.8 with problems with some originating IPv6 addresses).
There are still not a lot people with "native" IPv4+IPV6 internet connections (in fact IPv4 on my ISP is now channel through an internal IPv4 over IPv6 tunnel created by the ISP's router; IPv6 is preferred now by my ISP for its Gigabit fiber accesses). But I do not trust my ISP's DNS and have configured the router to use OpenDNS instead (and that's what is also broadcast in my DHCP configuration).
Or could it be a recent defect of OpenDNS with strange responses for Wikimedia and TWN sites ?

How is it possible that my IPv6 address is shown by fake account creations (that in fact never occured) from TranslateUserManager, when I'm in fact logged on ? Such public logging should never occur as it breaches the Wikimedia policies, so this is a severe bug.

Is it related to https://phabricator.wikimedia.org/T236011 that you just created yesterday after my message ?
(which was initially reported for IPv4, but now occurs as well for IPV6)

@Verdy_p Yes, TranslateUserManager is not related to this issue in any way. You are spamming dozens of people subscribed to this task with irrelevant updates. Please stop it. I have already replied on translatewiki.net and filed T236011: Some newusers log entries show the IP address of the currently viewing user.

Pols12 added a subscriber: Pols12.Wed, Oct 23, 12:00 AM

Lowering the priority as the hot fix has been applied to all production branches.

So it is supposed to be unnoticeable currently on wikis? Is that wrong content in edit mode another bug?

So it is supposed to be unnoticeable currently on wikis? Is that wrong content in edit mode another bug?

Like I wrote above, the hotfix does not purge caches. It only prevents further cache corruption.

Current state: The root cause that was causing new pages and revisions to be broken has been plugged. This means we are no longer expanding the range of corrupted content.

Next step: Repair the corruption that has occurred during the past two weeks. In particular:

  • Revision text cache in Memcached: Maybe slowly drop all memc keys that are prefixed with one of the affected wikis (having Translate installed), and contain the sub key component for revision text cache. This isn't something we've done recently, so will need to be done carefully and in coordination with ServiceOps to monitor Memc impact.
    • This affects page views of the corrupted page.
    • This affects loading the editor on the corrupted page. As such, event though the root cause was fixed, there can still be new corruption happening when people save edits to these pages due to loading the wrong wikitext to start the edit with. I imagine this also affects bots which wouldn't know that it is the wrong content.
    • This affects parsing of existing pages if they transclude the corrupted page.
  • ParserOutput: You can probably use the RejectParserCacheValue hook on these wikis to reject caches saved between 2 weeks ago and now (where now = after Memc is fixed).
WDoranWMF added a subscriber: Holger.EditedWed, Oct 23, 5:52 PM

Adding serviceops, based on conversation with @Krinkle. The outstanding pieces are:

  1. Fix for WANCache that @aaron is working on, to facilitate the redeploy of the MCR patch
  2. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion
  3. Core Platform Team will need support from serviceops to purge the affected cache - @Krinkle suggests we'll also need to provide a means to identify the items to be purged.

@Anomie Can you take a look at 2, @Holger can you work with @Anomie to get a signature for 3 to allow purging?

serviceops what will you need from us/what's the process for purging the corrupt items?

Change 545647 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] RejectParserCacheValue to reject possibly-corrupted entries

https://gerrit.wikimedia.org/r/545647

  1. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion

Patch submitted. It wasn't clear to me which dates to purge, so I left it as WIP. Replace the two "FILL THIS IN" strings with the appropriate dates in TS_MW format.

The SqlBlobStore code using the bad WANObjectCache method was merged with 1.34.0-wmf.21, which was much more than 2 weeks ago. But maybe nothing was using the affected SqlBlobStore code path until more recently? That's not clear to me.

  1. Core Platform Team will need support from serviceops to purge the affected cache - @Krinkle suggests we'll also need to provide a means to identify the items to be purged.

The "revision text cache" is probably any key beginning with "global:SqlBlobStore-blob:$id:" where $id is the wiki id any of the affected wikis. If we want to go back before 1.34.0-wmf.23, also "global:BlobStore:address:$id:".

RevisionStore's cache shouldn't be affected, as it doesn't cache the blob content. At a quick glance through I didn't see any other relevant layers of caching.

@Krinkle Would you have time to review the patch?

The corrupted entries started happening in wmf/1.35.0-wmf.1 when Translate started using that interface.

Apart from revision text cache, I also recommend diff cache to be purged since wrong contents have also been cached in them.

Joe added a subscriber: Joe.Thu, Oct 24, 10:58 AM

serviceops what will you need from us/what's the process for purging the corrupt items?

There is no existing process to do this. It will need us to get inventive if there is no way to enumerate the cache keys from software (so, MediaWiki), and require significant time.

From what @Anomie said, it should be possible to go on all the 18 memcached servers, dump somehow the list of all keys, filter for the ones we want by prefix (but we'll need a more precise list) and delete them all with a script.

Anything new that could be shared with communities?

jijiki added a subscriber: jijiki.Thu, Oct 24, 3:54 PM
Krinkle added a comment.EditedThu, Oct 24, 4:28 PM
  1. Core Platform Team needs to look at using RejectParserCacheValue hook to selectively reject values for the affected Wikis during the corruption window - this is @Krinkle suggestion

Patch submitted. It wasn't clear to me which dates to purge, so I left it as WIP. [..]

[operations/mediawiki-config@master] RejectParserCacheValue to reject possibly-corrupted entries
https://gerrit.wikimedia.org/r/545647

I've updated the patch to use this start date. However the end date for ParserCache should not be set to the above as it also involves Memcached, that is still to be purged. Once done, set it to the current-ish time.

aaron renamed this task from Some revisions' contents are incorrect in the cache - wrong contents shown in history & diffs to Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.Thu, Oct 24, 5:05 PM

@Joe Apologies for following newb questions - is access to those memcached instances through SRE? And if so what do you want us to prepare for you to make that as straightforward as possible for the SRE side?

Would it be most helpful for us to:

  1. Set the clear criteria for the keys to be removed
  2. Define how to dump and filter the keys
  3. Provide the scripts to do the above

Is there an instances we can use to test the above against?

From what @Anomie said, it should be possible to go on all the 18 memcached servers, dump somehow the list of all keys, filter for the ones we want by prefix (but we'll need a more precise list) and delete them all with a script.

Looks like the list of prefixes is

global:SqlBlobStore-blob:advisorswiki: 
global:SqlBlobStore-blob:amwikimedia: 
global:SqlBlobStore-blob:bdwikimedia: 
global:SqlBlobStore-blob:betawikiversity: 
global:SqlBlobStore-blob:bewikimedia: 
global:SqlBlobStore-blob:brwikimedia: 
global:SqlBlobStore-blob:cawikimedia: 
global:SqlBlobStore-blob:collabwiki: 
global:SqlBlobStore-blob:commonswiki: 
global:SqlBlobStore-blob:frwiktionary: 
global:SqlBlobStore-blob:hiwikimedia: 
global:SqlBlobStore-blob:idwikimedia: 
global:SqlBlobStore-blob:incubatorwiki: 
global:SqlBlobStore-blob:legalteamwiki: 
global:SqlBlobStore-blob:maiwikimedia: 
global:SqlBlobStore-blob:mediawikiwiki: 
global:SqlBlobStore-blob:metawiki: 
global:SqlBlobStore-blob:nowikimedia: 
global:SqlBlobStore-blob:otrs_wikiwiki: 
global:SqlBlobStore-blob:outreachwiki: 
global:SqlBlobStore-blob:punjabiwikimedia: 
global:SqlBlobStore-blob:ruwikimedia: 
global:SqlBlobStore-blob:sourceswiki: 
global:SqlBlobStore-blob:specieswiki: 
global:SqlBlobStore-blob:testcommonswiki: 
global:SqlBlobStore-blob:testwiki: 
global:SqlBlobStore-blob:testwikidatawiki: 
global:SqlBlobStore-blob:uawikimedia: 
global:SqlBlobStore-blob:wbwikimedia: 
global:SqlBlobStore-blob:wikidatawiki: 
global:SqlBlobStore-blob:wikimania2012wiki: 
global:SqlBlobStore-blob:wikimania2013wiki: 
global:SqlBlobStore-blob:wikimania2014wiki: 
global:SqlBlobStore-blob:wikimania2015wiki: 
global:SqlBlobStore-blob:wikimania2016wiki: 
global:SqlBlobStore-blob:wikimania2017wiki: 
global:SqlBlobStore-blob:wikimania2018wiki: 
global:SqlBlobStore-blob:wikimaniawiki:

If you want to purge diff caches too as suggested in T235188#5602261, it looks like that would be

advisorswiki:diff:
advisorswiki:inline-diff:
amwikimedia:diff:
amwikimedia:inline-diff:
bdwikimedia:diff:
bdwikimedia:inline-diff:
betawikiversity:diff:
betawikiversity:inline-diff:
bewikimedia:diff:
bewikimedia:inline-diff:
brwikimedia:diff:
brwikimedia:inline-diff:
cawikimedia:diff:
cawikimedia:inline-diff:
collabwiki:diff:
collabwiki:inline-diff:
commonswiki:diff:
commonswiki:inline-diff:
frwiktionary:diff:
frwiktionary:inline-diff:
hiwikimedia:diff:
hiwikimedia:inline-diff:
idwikimedia:diff:
idwikimedia:inline-diff:
incubatorwiki:diff:
incubatorwiki:inline-diff:
legalteamwiki:diff:
legalteamwiki:inline-diff:
maiwikimedia:diff:
maiwikimedia:inline-diff:
mediawikiwiki:diff:
mediawikiwiki:inline-diff:
metawiki:diff:
metawiki:inline-diff:
nowikimedia:diff:
nowikimedia:inline-diff:
otrs_wikiwiki:diff:
otrs_wikiwiki:inline-diff:
outreachwiki:diff:
outreachwiki:inline-diff:
punjabiwikimedia:diff:
punjabiwikimedia:inline-diff:
ruwikimedia:diff:
ruwikimedia:inline-diff:
sourceswiki:diff:
sourceswiki:inline-diff:
specieswiki:diff:
specieswiki:inline-diff:
testcommonswiki:diff:
testcommonswiki:inline-diff:
testwikidatawiki:diff:
testwikidatawiki:inline-diff:
testwiki:diff:
testwiki:inline-diff:
uawikimedia:diff:
uawikimedia:inline-diff:
wbwikimedia:diff:
wbwikimedia:inline-diff:
wikidatawiki:diff:
wikidatawiki:inline-diff:
wikimania2012wiki:diff:
wikimania2012wiki:inline-diff:
wikimania2013wiki:diff:
wikimania2013wiki:inline-diff:
wikimania2014wiki:diff:
wikimania2014wiki:inline-diff:
wikimania2015wiki:diff:
wikimania2015wiki:inline-diff:
wikimania2016wiki:diff:
wikimania2016wiki:inline-diff:
wikimania2017wiki:diff:
wikimania2017wiki:inline-diff:
wikimania2018wiki:diff:
wikimania2018wiki:inline-diff:
wikimaniawiki:diff:
wikimaniawiki:inline-diff:
Joe added a comment.Mon, Oct 28, 2:54 PM

@Joe Apologies for following newb questions - is access to those memcached instances through SRE? And if so what do you want us to prepare for you to make that as straightforward as possible for the SRE side?
Would it be most helpful for us to:

  1. Set the clear criteria for the keys to be removed
  2. Define how to dump and filter the keys
  3. Provide the scripts to do the above

Is there an instances we can use to test the above against?

I don't think we're strictly needed in order to do this. Or better, someone should write a script that:

  • connects to memcached on $host:$port (given from the cli)
  • cycle through all keys in the server, using one of the techniques described in https://www.darkcoding.net/software/memcached-list-all-keys/
  • delete keys that correspond to some criteria, this time connecting via mcrouter and sending deletions to all datacenters.

We can do that but I think we're only needed, strictly speaking, to run the command. I'll get back to you in a few.

Joe added a comment.Mon, Oct 28, 4:56 PM

After some digging, it appears impossible to reliably get all keys from a memcached server as loaded as our production ones.

Both using stats cachedump by hand, or using memcdump from libmemcached-tools I get around 3M keys. If I look at the stats, though, the server seems to have 44M items.

In this situation, I guess our best bet is to perform a rolling restart of the memcached servers, which will take up to 1 week to perform safely, because it will wipe part of our memcached keys.

If no one else has brilliant ideas, I will start working on it.

Joe added a comment.Mon, Oct 28, 5:07 PM

To show better the cache issue:

{ echo 'stats items'; sleep 1; } | telnet mc1019 11211 | grep -F items:17:
STAT items:17:number 8433040
STAT items:17:age 88690
STAT items:17:evicted 446703589
STAT items:17:evicted_nonzero 446703231
STAT items:17:evicted_time 88690
STAT items:17:outofmemory 0
STAT items:17:tailrepairs 0
STAT items:17:reclaimed 395768512
STAT items:17:expired_unfetched 221048687
STAT items:17:evicted_unfetched 119428161
STAT items:17:crawler_reclaimed 0
STAT items:17:lrutail_reflocked 0

so it seems there are 8433040 objects in cache. Trying to retreive them, though, yelds different results:

$ { echo 'stats cachedump 17 8433040'; sleep 1; } | telnet mc1019 11211 | wc -l
Connection closed by foreign host.
26174

Mentioned in SAL (#wikimedia-operations) [2019-10-28T17:11:41Z] <_joe_> starting rolling restart of memcached servers in eqiad, beginning with mc1019 T235188

Mentioned in SAL (#wikimedia-operations) [2019-10-28T18:41:35Z] <rlazarus> restarted memcached on mc1020 T235188

Mentioned in SAL (#wikimedia-operations) [2019-10-28T19:50:30Z] <rlazarus> restarted memcached on mc1021 (T235188)

@Joe Sorry I only got this far down in my email. I guess the restart is in progress so. For future, what I meant above was we're happy to take on writing the script and working through it. Just wanted to check if we need perms for the access. Do you want us to follow the rolling restart or can we help out at all?

Mentioned in SAL (#wikimedia-operations) [2019-10-28T20:56:36Z] <cdanis> restart memcached on mc1022 T235188

Mentioned in SAL (#wikimedia-operations) [2019-10-29T07:01:58Z] <_joe_> restart memcached on mc1024-1036, 1 hour apart, via cumin (T235188)

Joe added a comment.Wed, Oct 30, 6:38 AM

@WDoranWMF what I found out is that expunging values by prefix from memcached is impossible to do in a clean way without severely impacting performance.

So we just needed to perform the rolling restart, which is now done. That part of the cache is now clean.

Change 547696 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: fix cache pollution during Multi* method preemptive refreshes

https://gerrit.wikimedia.org/r/547696

Is the parser cache purging hook in place yet? I need it to be able to run cleanup script for translatable pages.