In 2007, the WMF publicly released an anonymized trace containing 10% of user requests issued to the Wikipedia [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. Such new caching algorithms significantly increase cache hit ratios, which may in turn benefit the Wikipedia community.
The 2007 dataset has two shortcomings:
- it does not contain information about the response size, which essentially forces users of it to assume that all objects (texts, figures, ..) have the same size. This introduces significant errors into performance evaluations.
- request characteristics have changed significantly over the last nine years (e.g., increasing role of mobile devices). This means that the 2007 dataset does not represent well the caching performance under modern request streams.
I would like to ask for an updated dataset of user requests.
According to the Hive documentation [3], the data would be available in the table wmf.webrequest.
Using this table's column names, I would specifically ask for the following fields, which are based only on server-side information.
| sequence | unique request number (replaces time stamp to preserve privacy) |
| uri_host | URL of request |
| uri_path | URL of request |
| uri_query | needed to compile the save flag as in [1] |
| cache_status | needed to compile the save flag as in [1] |
| http_method | needed to compile the save flag as in [1] |
| response_size | additional field compared to [1] |
Additionally, it would be nice to have the following field.
| hostname | to study cache load balancing |
| content_type | to study hit rates per content type |
| time_firstbyte | for performance/latency comparison |
| x_cache | more cache statistics (cache hierarchy) |
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

