Dec 8 2019
Dec 4 2019
Thank you for setting up the detailed description on wikitech. This is really great!
Nov 21 2019
@lexnasser this is great, thank you! I really like the idea of a dedicated wikitech page. I can contribute some parsing and processing scripts in the future, and we might link them there, too (if people are interested).
Nov 14 2019
Thank you, @Nuria . Having the upload data set without a save flag makes perfect sense and is great!
Nov 12 2019
Thanks @lexnasser , that's very helpful. Is there a way to check how frequently submit queries happen for hosts other than upload?
@lexnasser : is this for a text_cache or for upload_cache (like cp5006)? I expect that only text caches (like cp5008) would see submit queries.
@lexnasser : yes I can confirm that the fields (except "save") are: relative_unix, hashed_host_path_query, image_type, response_size, time_firstbyte
Nov 9 2019
Thank you for your comment. I agree that with two records, this would be somewhat hard. But I believe that we don't need the second record, just the first. The goal is to detect writes (via submit), which we assume to immediately invalidate the cache.
Nov 8 2019
Ok, I looked up the current Varnish/ATS server assignment in puppet/conftool-data/node.
I have some thoughts and questions about the overall dataset / example from above and on the save flag.
Nov 2 2019
Relative timestamps are a great idea and absolutely sufficient for caching research. I should've come up with that myself and mentioned it earlier.
Oct 29 2019
Let me first answer why we need a timestamp field. At a high level, the goal of most caching research project is to come up with a new algorithm and then compare it to the original system, and show that we get better performance. Timestamps occur in both part of the process. There are many algorithms that explicitly rely on timestamps as an input. So, without timestamps we cannot simulate these algorithms. For the actual comparison of systems, timestamps are necessary to get almost all the interesting performance metrics.
Oct 3 2019
Aug 1 2019
@Nuria thank you for clarifying the time line and sorry for my misunderstanding of the quarterly schedule.
Jul 25 2019
Jun 14 2019
Thank you for your feedback, @Nuria , that's very helpful.
I forgot to add that the source of the query above is the readme file from T128132 . I've adapted it to this 2019. Also, as indicated in the task description, it would be very helpful to cover two different servers, i.e., to repeat the query for
x_cache like '%cp3033%'
x_cache like '%cp3034%'
@fdans thank you for responding so quickly. Gathering the data is not very urgent. It would be most helpful, if we can gather the data before the annual Fall paper deadlines (early September). Otherwise, people won't be able to see papers written based on this data until March 2021 (due to the annual publication and presentation cycle: submit in Fall, present in Spring).
Jun 11 2019
Nov 8 2016
Here's a graph of the hit ratio for various cache sizes.
Nov 7 2016
Here's some VCL/inline c that implements the exp-size amission policy for Varnish.
Ok, here are the new results for cache sizes between 50GB and 400GB. For now, I only looked at the Filter and Exp admission policies.
Sep 24 2016
Happy to take input what to simulate next. Each simulation run takes about 48h.
- 128 GB front mem cache? what's the actual size of cp4006?
- disk cache (750 GB)?
- more caching policies (I have a ton more, and I'm willing to implement more)?
I finally got some simulation results.
Sep 13 2016
I have finally been able to take a look at the data set: it's great - exactly what we need to analyze caching performance.
Sep 1 2016
Aug 31 2016
I have been working for some time with Ramesh on cache admission policies for Akamai workloads, and I can contribute the following takeaways
- frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonders but also two-hit-wonders etc. I have often seen optimal N values between 4 and 16 - though, I cannot yet say what works best for WMF's workloads. (N=1 is easy to implement when one chooses Bloom filters, but N>1 requires more work as "counting" Bloom filters are a lot harder to manage.)
- a new idea that simplifies frequency-based admission is to use probabilistic admission: admit with probability 1/N. In expectation (geometric distribution), this means that objects would get admitted after N requests without any need for bookkeeping (data structures etc). In our experiments, this achieved 98% of the hit ratio of frequency-based admission.
- for many CDN caching hierarchies, it makes sense to focus the front/memory cache on maximizing the object hit ratio (OHR) whereas the back/disk cache focuses on the byte hit ratio (BHR). This essentially means that the front/memory cache needs to cache mostly small objects, which significantly decreases the eviction volume so that it serves a higher fraction of requests (but possibly fewer bytes). For the back/disk cache, this has the advantage that there are fewer random reads and that the average requests is larger, which additionally increases sequentiality of disk/ssd reads and improves throughput.
Aug 18 2016
The 1 day data set looks great and the trends (e.g., for one-hit-wonders) follows the expectations from previous comments.
Aug 9 2016
Thank you for clarifying the X_cache field, this helps a lot.
It seems then that the current Hive query (x_cache like '%cp4006%') allows us to reproduce the cache hit ratio of the memory (Front) cache in cp4006 and of the SSD cache in cp4006, right?
I cannot answer the first point as we did not include the http_status column into this data set. Thanks for pointing this out, including http_status might help cleaning up the data.
Aug 8 2016
I've run the 1h data set through R and there's a brief summary below, if anyone is interested.
Aug 5 2016
Yes, you're correct (I had not thought of this).
I'll try to answer some of these questions:
Jul 25 2016
Starting with a 1G dataset is a great idea. I don't know about the max file size (on datasets.wikimedia.org), but the largest files I've seen there are about 300-500M. I guess 1G sounds reasonable, and we can always divide the compressed file into chunks.
Mar 23 2016
Mar 3 2016
Here's another idea for getting a smaller dataset.
As the eventual goal is to reproduce the cache performance, we can focus on requests issued to just one or two caches (e.g., one in ulsfo and one in esams). I'm not sure how many cache hostnames there are in total, but since there are already 30 caches in esams, the rate might be almost two orders of magnitude smaller?
Feb 25 2016
I can see that 100reqs/seq is too much to handle. What would be a reasonable request rate we can handle?
An increment counter was also used in the 2007 dataset. Seems like a good solution to me.
The 2007 dataset covers a large time span: September 19th 2007 until January 2nd 2008. With on average 2GB of logs per day, that's about 250 GB overall.
I understand that today's request rates would make such a thing unfeasible.