In 2007, the WMF publicly released an anonymized trace containing 10% of user requests issued to the Wikipedia . This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java . Such new caching algorithms significantly increase cache hit ratios, which may in turn benefit the Wikipedia community.
The 2007 dataset has two shortcomings:
- it does not contain information about the response size, which essentially forces users of it to assume that all objects (texts, figures, ..) have the same size. This introduces significant errors into performance evaluations.
- request characteristics have changed significantly over the last nine years (e.g., increasing role of mobile devices). This means that the 2007 dataset does not represent well the caching performance under modern request streams.
I would like to ask for an updated dataset of user requests.
According to the Hive documentation , the data would be available in the table wmf.webrequest.
Using this table's column names, I would specifically ask for the following fields:
| ts | timestamp in ms for request order |
| uri_host | URL of request |
| uri_path | URL of request |
| uri_query | needed to compile the save flag as in  |
| cache_status | needed to compile the save flag as in  |
| http_method | needed to compile the save flag as in  |
| response_size | additional field compared to  |
Additionally, it would be interesting to have
| hostname | to study cache load balancing |
| sequence | to uniquely order requests below ms |
| content_type | to study hit rates per content type |
| access_method | to study hit rates per access type |
| time_firstbyte | for performance/latency comparison |
| x_cache | more cache statistics (cache hierarchy) |