Research on caching systems and algorithms continues to be a hot topic in the academic community. This is fueled by changing workloads and new hardware becoming available*. The WMF helped this research community in the past by making anonymized request traces available in 2007 [1] and 2016 [2]. In turn, advances in caching systems (e.g., Varnish / Apache Traffic Server) help making WMF websites faster and more efficient.
Specifically, there are two current research trends:
a) replacing human heuristics by machine learning-based caching decisions [3], which significantly improves performance (latency, hit ratios).
b) building better flash/SSD storage engines [4], which significantly reduces wear out and prolongs hardware life time.
Unfortunately, the 2007 dataset does not include important fields for these studies (as described in T128132). And, the 2017 dataset is too short to either learn good policies (a) or validate large flash/SSD drives (b). The 2017 dataset also covers only an upload/media server, with text servers having significantly different workloads.
I would therefore like to ask for an updated dataset of anonymized user requests. This time, we would ideally pick one busy upload/media server (e.g., cp3033 in esams) and one busy text server (e.g., cp3034 in esams).
For an example of internal use at WMF, see T144187 of the Operations team. For examples of the numerous research papers that have benefited from these datasets see: [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13].
To make this task self-contained (although you can see some discussion in T128132): these datasets will be anonymized and not contain personally identifiable information, like URLs or user IP addresses/ geolocation information.
The requested dataset is different from existing datasets (like clickstream or page stats) as it contains all requests (not sampled) to a single server, over a consecutive time period. Without having all requests, it is impossible to reconstruct the system's caching decisions.
To be more specific about the dataset, I am asking for the following four fields per request:
- hashed_host_path = HASH(uri_host, uri_path)
- uri_query
- content_type
- response_size
Three additional fields would be very helpful, but are not strictly necessary:
- rounded timestamp (ts) to minute precision
- time_firstbyte
- x_cache
I would also like to ask for a dataset covering 1-2 months, compared to the two weeks from 2016. As outlined above, the current two-weeks dataset has limited benefit for research trends a) and b). And, repeating this for two servers from different clusters (uploads/text), e.g., cp3033 and cp3034.
[1] http://www.wikibench.eu
[2] https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/
[3] Berger. "Towards Lightweight and Robust Machine Learning for CDN Caching". ACM HotNets, November 2018.
[4] Li, Cheng, et al. "Pannier: Design and analysis of a container-based flash cache for compound objects." ACM Transactions on Storage (TOS) 13.3 (2017): 24.
[5] Einziger et al. "Tinylfu: A highly efficient cache admission policy." ACM Transactions on Storage (ToS) 13.4 (2017): 35.
[6] Blankstein et al. "Hyperbolic caching: Flexible caching for web applications." USENIX Annual Technical Conference. 2017.
[7] Basat et al. "Randomized admission policy for efficient top-k and frequency estimation." IEEE Conference on Computer Communications. 2017.
[8] Berger et al. "AdaptSize: Orchestrating the hot object memory cache in a content delivery network." Symposium on Networked Systems Design and Implementation. 2017.
[9] Berger et al. "Practical bounds on optimal caching with variable object sizes." Proceedings of the ACM on Measurement and Analysis of Computing Systems 2.2 (2018): 32.
[10] Einziger, Gil, et al. "Adaptive software cache management." International Middleware Conference. 2018.
[11] Rogers et al. "Cache-conscious wavefront scheduling." International Symposium on Microarchitecture. 2012.
[12] Krioukov, Andrew, et al. "Napsac: Design and implementation of a power-proportional web cluster." ACM SIGCOMM workshop on Green networking. 2010.
[13] Calheiros, et al. "Workload prediction using ARIMA model and its impact on cloud applications’ QoS." IEEE Transactions on Cloud Computing 3.4 (2014): 449-458.
*New hardware becoming available concerns the next generation of zoned-namespace SSDs (ZNS), which will be released in the first quarter of 2020. ZNS devices promise significant cost reductions with simultaneous performance improvements. However, due to a changed interface, building caching systems on ZNS is a new challenge.