Page MenuHomePhabricator

Publish dataset about Memcached traffic for caching research
Open, LowPublic


Researchers in the caching community have been focused on a specific type of cache (such as CDN cache, in-memory cache) when they research and publish. However, there are certain similarities and fundamental difference between different types of cache, which we would like to explore and understand. We believe understanding the similarities and difference can help with future improvement on caching performance.

Wikimedia has made a great contribution in the past for helping with the CDN caching research,
see T225538, T128132, T144187), however, there is no trace available for in-memory caching service, we would like to request a caching trace of the in-memory cache (memcache?).

Specifically, we are looking for a non-sampled cache trace contains relative timestamp, anonymized requests key, op (set/get) and size.

As researchers, we are happy to contribute logging/tracing scripts if needed.

Event Timeline

1a1a11a created this task.Dec 11 2019, 8:46 PM
1a1a11a updated the task description. (Show Details)Dec 11 2019, 8:49 PM
Krinkle updated the task description. (Show Details)Feb 3 2020, 4:06 AM
Krinkle edited projects, added serviceops, Operations; removed MediaWiki-Cache.
Krinkle added a subscriber: Krinkle.

This is not a bug or feature request about MediaWiki core's ability to cache data. Rather it appears to be a request for data gathering and publication by WMF from the production Memcached and/or php-APCu services.

Joe added a subscriber: Joe.Feb 3 2020, 6:43 AM
jijiki triaged this task as Low priority.Feb 5 2020, 6:41 PM
jijiki added a subscriber: jijiki.
leila added a subscriber: leila.

I removed the Research tag as it refers to the work of the research team in WMF. However, if I can be of any help to the SRE team with this particular request, please ping me.

Krinkle renamed this task from Request for a in-memory caching data set for caching research to Research dataset about in-memory caching.Apr 18 2020, 11:21 PM
Krinkle renamed this task from Research dataset about in-memory caching to Publish dataset about Memcached traffic for caching research.Apr 18 2020, 11:23 PM

Narrowing scope to be about Memcached, which is presumably more widely applicable to the industry. Our Memcached cluster is horizontally scaled (sharded) and accessed by all web servers within a given data centre with values living up to a week or longer depending on restarts.

There is also an in-memory store locally on each web server, but that is more regularly restarted and might not be as interesting, but feel free to file a separate ticket for it.

Hi Krinkle, this is what we are looking for, it would be great if we can have such dataset, even if it is sampled. Thank you! Meanwhile, may I ask what kind of data is stored in memcached cluster?

BTW, we can help contribute some log collection scripts if needed.

CDanis added a subscriber: CDanis.Apr 18 2020, 11:32 PM

[…] what kind of data is stored in memcached cluster?

Pretty much anything and everything you can imagine relating to the MediaWiki software and Wikipedia.

Some links that may be of use:

elukey added a subscriber: elukey.Apr 19 2020, 7:18 AM

Hi Krinkle, just would like to check whether there is any update on this issue? Just to clarify the request, for the dataset, we are looking for the following information, "timestamp, key, size of value, TTL, operation". Thank you!