Page MenuHomePhabricator

Compile a request data set for caching research and tuning
Closed, ResolvedPublic8 Story Points

Description

In 2007, the WMF publicly released an anonymized trace containing 10% of user requests issued to the Wikipedia [1]. This data set has been used widely for performance evaluations of new caching algorithms, e.g., for the new Caffeine caching framework for Java [2]. Such new caching algorithms significantly increase cache hit ratios, which may in turn benefit the Wikipedia community.

The 2007 dataset has two shortcomings:

  • it does not contain information about the response size, which essentially forces users of it to assume that all objects (texts, figures, ..) have the same size. This introduces significant errors into performance evaluations.
  • request characteristics have changed significantly over the last nine years (e.g., increasing role of mobile devices). This means that the 2007 dataset does not represent well the caching performance under modern request streams.

I would like to ask for an updated dataset of user requests.

According to the Hive documentation [3], the data would be available in the table wmf.webrequest.
Using this table's column names, I would specifically ask for the following fields, which are based only on server-side information.

sequenceunique request number (replaces time stamp to preserve privacy)
uri_hostURL of request
uri_pathURL of request
uri_queryneeded to compile the save flag as in [1]
cache_statusneeded to compile the save flag as in [1]
http_methodneeded to compile the save flag as in [1]
response_sizeadditional field compared to [1]

Additionally, it would be nice to have the following field.

hostnameto study cache load balancing
content_typeto study hit rates per content type
time_firstbytefor performance/latency comparison
x_cachemore cache statistics (cache hierarchy)

[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 25 2016, 9:35 PM

@Danielsberger How big does the data set need to be? One hour? One day? It gets big very quickly.

Nuria added a subscriber: Nuria.EditedFeb 25 2016, 10:05 PM

For privacy this data needs to be sampled and also timestamps cannot be disclosed as they are but rather as increments from a (non disclosed) time in the past.

The 2007 dataset covers a large time span: September 19th 2007 until January 2nd 2008. With on average 2GB of logs per day, that's about 250 GB overall.
I understand that today's request rates would make such a thing unfeasible.

In my experience, the dataset needs to cover at least one week for cache metrics to become stable. Two to four weeks would make it much more interesting.

There seems to be a clear trade off between sampling rate and time span we can cover. The 2007 dataset is sampled with 1:10. If possible we should have a similarly high sampling rate to capture most of the temporal locality (1:10 to 1:100).

An increment counter was also used in the 2007 dataset. Seems like a good solution to me.

@Danielsberger : your best bet is to restrict the dataset to a project maybe? We get >100.000 res per sec

Danielsberger added a comment.EditedFeb 25 2016, 10:18 PM

I can see that 100reqs/seq is too much to handle. What would be a reasonable request rate we can handle?

The 2007 dataset has on average 10 million requests per hour (sampled 1:10).
This roughly corresponds to the number of pageviews of the English Wikipedia of the last months. (according to https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm)
If each page view triggers about 10 cache requests on average (wild guess), then focusing on the English Wikipedia would be similar to the request rate of the 2007 dataset.

The 2007 dataset needs roughly 20 Bytes per request with gzip compression (112 Bytes per request without compression).
Let say, we have 30 Bytes per request.
A naive calculation for a 1:10 sampling gives something like
10M reqs/hour x 30 Bytes = 7 GB/day.

Is this roughly correct and would this be a manageable dataset size?

Nuria added a comment.Feb 26 2016, 4:26 PM

Is this roughly correct and would this be a manageable dataset size?

Actually I think what you are requesting is about 200 bytes gzipped per request (aprox) . But, to be honest, I do not have a good gage for what a manageable dataset size looks like , will discuss with team and get back to you.

Milimetric triaged this task as Normal priority.Feb 29 2016, 5:08 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.

Here's another idea for getting a smaller dataset.
As the eventual goal is to reproduce the cache performance, we can focus on requests issued to just one or two caches (e.g., one in ulsfo and one in esams). I'm not sure how many cache hostnames there are in total, but since there are already 30 caches in esams, the rate might be almost two orders of magnitude smaller?

JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou removed JAllemandou as the assignee of this task.May 9 2016, 4:47 PM
JAllemandou edited projects, added Analytics; removed Analytics-Kanban.
JAllemandou added a subscriber: JAllemandou.
Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.Jun 2 2016, 5:04 PM
ema added a subscriber: ema.Jun 21 2016, 8:45 AM
elukey added a subscriber: elukey.Jun 21 2016, 8:45 AM
Milimetric assigned this task to Nuria.Jul 7 2016, 5:54 PM
Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Nuria edited projects, added Analytics; removed Analytics-Kanban.Jul 21 2016, 4:21 PM
Nuria moved this task from Dashiki to Operational Excellence Future on the Analytics board.
Nuria added a comment.EditedJul 21 2016, 5:04 PM

We wil have a flatfile(s) on datasets.wikimedia.org

How much data can we handle on an http endpoint? Seems that this data will compress pretty well, let's start with 1G compressed.

Nuria set the point value for this task to 8.Jul 21 2016, 5:04 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.

Starting with a 1G dataset is a great idea. I don't know about the max file size (on datasets.wikimedia.org), but the largest files I've seen there are about 300-500M. I guess 1G sounds reasonable, and we can always divide the compressed file into chunks.

I had two other thoughts.

1) Focus on single cache instead of sampling

The Hive query can select a specific data center and a single first-level cache via the x_cache field. For example, we could select one cache in the San Francisco data center, say "cp4006".
Every request, which enters through this cache, will have a Hive x_cache field set similar to this:
"cp1063 miss, cp4014 hit/3 cp4006 hit/15".
So, we can text-search for "cp4006" and select only those requests served through cp4006.

Due to load balancing, selecting a cache like cp4006 is pretty much the same as sampling, and brings down the request rate to a few thousand per second. Unlike sampling, this has the advantage of allowing a cache simulation that reproduces the caching performance of a real cache (in this case cp4006).

2) Hash Page URLs

Cache simulations won't actually need to know the full URL. So, we can save a lot of bits by hashing the URL. I'd also say a few hash clashes won't matter too much for cache performance, so we can go with a cheap hash function like Hive's HASH(() which yields an INT.
If, additionally, we compress the "save flag" as a simple 0/1, we only have four small numbers per request: (seq#, hash, save-flag, response_size).
I bet compressing the resulting text file would still make a huge difference, but the hash idea can help minimize the Hive response size even before the final compression step.

Nuria added a comment.Jul 25 2016, 5:18 PM

Excellent suggestion with hashing, hopefully we can get started on thisi this week.

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Jul 27 2016, 7:10 PM
Nuria moved this task from In Progress to Next Up on the Analytics-Kanban board.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Aug 4 2016, 7:41 PM
Nuria added a comment.EditedAug 4 2016, 7:50 PM

Does this select below seem ok? Note that:

  1. there is no time information
  2. is selecting for all projects for all content types

SELECT
uri_host, HASH(uri_path, uri_query) , cache_status, http_method, response_size, hostname, content_type, time_firstbyte, x_cache

FROM wmf.webrequest
WHERE

x_cache like '%cp4006%'
AND year = 2016
AND month = 7
AND agent_type = 'user'
AND access_method = 'desktop'

LIMIT <some>

Should uri_host be hashed with url too?

Nuria added a comment.Aug 4 2016, 8:45 PM
This comment was removed by Nuria.
Nuria added a subscriber: BBlack.Aug 5 2016, 7:47 PM

Per discussion with @BBlack turns out that "cache_status is completely different and wrong in ways that are difficult to even correlate to the real world".

Cache status can be inferred from X-Cache

X-Cache can be interepreted with a regex as follows to put things in the 4 basic dispositions:
in this order: /hit/ => hit, /int/ => int, /pass,[^,]+$/ => pass, /miss/ => miss, else it's unknown (a bug?)
many of the X-Cache lines will match multiple of those regex, but the first one that matches in order is the overall disposition

Cache metrics here: https://grafana.wikimedia.org/dashboard/db/varnish-caching

Nuria added a comment.Aug 5 2016, 8:00 PM

@BBlack: let us know if you have a better idea for sampling, thus far I am setting x_cache like '%cp4006%' to get (I thought) a consistent dataset without sampling requests 1/1000 or similar which i can also do.

I'll try to answer some of these questions:

With regard to having no time information: That seems fine to me. It would be nice to know the time stamp of the first and last request in the trace, say at seconds or minutes accuracy.

With regard to the hashing: I think it would be better to include uri_host into the hash, but exclude uri_query. Further, I'd advocate having uri_host and uri_query in the clear.
A HASH(uri_host, uri_path) ensures uniqueness for each caching object (including uri_query can create different hashes for the same object).
Having the uri_query in the clear seems necessary to compile the "save flag" (which was discussed on the mailing list a while ago). Technically, we can compress the uri_query and http_method into a binary (save flag) field, but I'm not sure the matching is easy do in Hive directly (?). We can check this out with the small test data set.

With regard to project and content types: incorporating all projects/languages and all content types would be great. If (in order to bring down the request rate) we need to choose between content types, I'd prefer the upload category (and not the text category) because it's more interesting from a caching perspective: the hit ratio is significantly lower for the "upload" caching clusters, which may indicate a greater room for improvement.

With regard to the cache_status: thank your for following up on this. From your comment, dropping cache_status clearly makes sense as we have X-Cache.

Did limiting to a single cache (x_cache like '%cp4006%') works as expected?

Nuria added a comment.Aug 5 2016, 8:35 PM

Having the uri_query in the clear seems necessary to compile the "save flag" (which was discussed on the mailing list a while ago).

I leave that up to you but most request to upload are not going to have a uri_query as they are requests for static resources.

Yes, you're correct (I had not thought of this).

In fact, our current query (selecting only traffic served through cp4006) already limits the trace to upload content as this cache is assigned to the upload cluster (as far as I know). Thus, I should expect empty entries in the uri_query field.

I think we should still leave the uri_query field for compatibility with traces for text content. If space and time permits, we may be able to include another single cache, which serves text traffic, such as cp4008.

Nuria added a comment.EditedAug 5 2016, 10:38 PM

1 hour of data here: https://datasets.wikimedia.org/limn-public-data/caching/

Note that is almost 2M gzipped.

Select:

SELECT
HASH(uri_host, uri_path) hashed_host_path, uri_query, content_type, http_method, response_size, time_firstbyte, x_cache

FROM wmf.webrequest
WHERE

x_cache like '%cp4006%'
AND year = 2016
AND month = 7
AND day=01
ANd hour=01
AND agent_type = 'user'
--AND content_type='text/html'

LIMIT 1000000000;

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Aug 8 2016, 4:08 PM

I've run the 1h data set through R and there's a brief summary below, if anyone is interested.

My main finding is that
a) the data is consistent and will prove very useful for analyzing cache performance;
b) a very large data set (7 days?) will be needed to analyze a memory cache in WMF's Varnish deployment (to evaluate alternatives for the current deployment).

.

Detailed explanation for point b)

The key component of any a caching system is the cache replacement policy, which selects which object to evict from the cache ("replace") when there is no space to store a newly-requested one.

There are various replacement policies which perform better or worse depending on the workload. When considering different caching systems (e.g., T96853), the replacement policy thus plays an important role.

In order to evaluate the impact of different replacement policies, we need to frequently trigger replacement decisions, so that the impact of those decisions is maximized.

The smallest types of deployed caches (WMF's Front Varnishes) have about 100 GB of capacity. In order to trigger replacement, a data set has to include a least 100 GB of different ("unique") objects. Otherwise, the cache can just store all objects. To get statistically significant results, we'd need, say, 500 GB of unique objects.

In the current 1h data set, there is 165 GB of total request traffic, of which 6.4 GB are unique objects. The volume of unique objects is going to grow as we consider longer request periods, mainly because there's always a fraction of roughly 12% "one-hit-wonders" (see the stats below).

As a lower bound, we can assume that every hour gives us another 6.4 GB of unique objects. Then, we'd need a data set covering about 78h.

As an upper bound, we can assume that only the 12% one-hit-wonders increase the volume of unique objects. So, past the initial 6.4 GB, every hour only adds about 0.8 GB of unique objects. Then, we'd need about 26 days of traffic.

That's assuming the current "hour" is somewhat representative.

Tthe truth is certainly somewhere in the middle (the pool of not-so-well-known pages is large).
I personally conjecture that a 7-day trace will be ok. This still means a huge data set, about 168x the current size. That's very large, probably about 142 GB uncompressed or 31 GB compressed (assuming the current compression ratio of 4.6).

.

Statistical findings for a)

There are

  • 165 GB (5 million) of requests, which arrived directly through the cache cp4006
  • 4x25 GB (4 million) of requests, which arrived through 4 caching peers in the ulsfo data center (caches cp4005/7/13/15)

The requests are 99.9% GET, and 0.07% HEAD requests.

The mean request/object size is 34 KB (median 3.6 KB), and the largest size is 223 MB. The distribution is shown below.

The fastest time to first byte was 0.03ms and the slowest was 61s.

3.5% of requests had zero response size, which, I assume, are aborted requests. That seems like a reasonable number to me.

As these are upload caches, content type distribution is

  • 57% image/png
  • 41% image/jpeg
  • 1% image/gif
  • 1% others

The popularity distribution follows a typical Zipf distribution, with a really heavy tail. That's been verified for WMF traffic before. 61% of objects are "one-hit-wonders" (requested only once, never again). This makes for 12% of the total request traffic.

There are a few weird queries like "?0.4574536606" or "?t=1467335" but less than 1000 overall.

.

Here is the request/object size distribution.

Here's the popularity (Zipf) distribution plot.

BBlack added a comment.Aug 8 2016, 8:49 PM

3.5% of requests had zero response size, which, I assume, are aborted requests. That seems like a reasonable number to me.

It sounds a little higher than I would've thought (but certainly within reason). Any chance this is due to 304s, or are you filtering for 2xx first?

61% of objects are "one-hit-wonders" (requested only once, never again). This makes for 12% of the total request traffic.

12% of traffic (bytes or requests?) as one-hit-wonders on cache_upload doesn't seem to align with our independent near-realtime stats. Those show a true-miss rate that's usually just under 3% of requests, and any one-hit-wonder should be a miss. In any case, especially on cache_text we know there's a one-hit-wonder problem. It's on our long-term radar to look at using a bloom filter (or similar) to only cache a given object on its second request within a certain timeframe for resetting overlapping filters.

I cannot answer the first point as we did not include the http_status column into this data set. Thanks for pointing this out, including http_status might help cleaning up the data.

As for the second point; I realized that my one-hit-wonder statistic from above is imprecise. Unfortunately, that still does not explain the high fraction of one-hit-wonders.

I should first clarify that the 12% figure did not account for the Varnish cache request routing, so it's all requests and objects in the whole 1h trace. There are

  • 9.2 million requests
  • 1.9 million unique objects
  • of which 1.1 million objects are requested only once

That's how I arrive at 12% one-hit-wonder requests and 58% (not 61%) unique objects.

Let's consider only those requests, which were answered by cp4006 as front cache. I assume I can catch those by matching the "last" x_cache equal to cp4006.
For example,

  • directly answered by cp4006 (included): cp1050 hit/11, cp2022 hit/1, cp4005 hit/58, cp4006 hit/3
  • indirectly answered by cp4006 (excluded): cp1048 hit/21, cp2017 hit/10, cp4006 hit/6, cp4015 miss

In this case, there are

  • 4.9 million requests
  • 1.4 million unique objects
  • of which 9 thousand objects are requested only once

That's 19% one-hit-wonder requests and 65% unique objects.

I also independently verified these numbers, by running both traces (all requests, cp4006-front requests) in my cache simulator. For an infinitely-large cache, I get a 79.1% cache hit ratio for all and a 70.9% cache hit ratio for cp4006-front. This is roughly consistent with the one-hit-wonder statistics from above.


In conclusion, there seems to be a significant inconsistency (just as you pointed out). Four possible explanations come to my mind:

  1. my analysis methodology is incorrect
  2. the hash used for this data set is different from Varnish's hash
  3. a 1h data set is too short to extrapolate one-hit-wonders and cache hit ratios
  4. there is data inconsistency between Hive and the Varnish hit ratio statistics

I would argue that a combination of 2.) and 3.) is the most likely explanation.

As for 3.) as I argued in the comment before, we need a lot more data to run proper caching simulations. I did already run a few, but it's just too little data. I think this support 3.)

As for 2.), we should definitely check this, although it seems to me, that HASH(uri_host, uri_path) uniquely identifies an object? Or does the request routing interfere with our focus on "cp4006" in an unexpected way?

As for 1.), I'm happy to share a little R script to reproduce the numbers (it assumed the cache.txt.gz has been gunzipped and is in the same folder) :


Not related: as for the bloom filter idea, I think I might be able to help with that, in the future.

I should first clarify that the 12% figure did not account for the Varnish cache request routing, so it's all requests and objects in the whole 1h trace. There are

  • 9.2 million requests
  • 1.9 million unique objects
  • of which 1.1 million objects are requested only once

That's how I arrive at 12% one-hit-wonder requests and 58% (not 61%) unique objects.

That's it's 1h might be a factor as well - a one hit wonder in a 1h time window might not be so in a longer window.

Let's consider only those requests, which were answered by cp4006 as front cache. I assume I can catch those by matching the "last" x_cache equal to cp4006.
For example,

  • directly answered by cp4006 (included): cp1050 hit/11, cp2022 hit/1, cp4005 hit/58, cp4006 hit/3
  • indirectly answered by cp4006 (excluded): cp1048 hit/21, cp2017 hit/10, cp4006 hit/6, cp4015 miss

To clarify how our X-Cache (and caching in general) works:

  1. From the X-Cache point of view, requests first arrive at the rightmost cache, and then go deeper to the left towards the applayer. It makes sense when looked at functionally: the X-Cache header is appended to on the outbound (response) side as a response traverses its way back up our cache layers towards the user.
  2. Reading from the right, once you find a "hit" entry, anything else further to the left is historical (that is, those were the X-Cache entries from the deeper layers when the hit-object was first fetched some time ago. X-Cache is stored in the cache with the object being hit).
  3. Our frontend-most cache layer is small (~96-128GB per machine) and covers all requests. LVS hashes on client IP to distribute the request load to several frontends, and they can cache the whole dataset (with some minor exceptions in the next point). When counting all frontend caches as a whole, the effective storage set size will be the average of the cache storage of all frontend nodes.
  4. Frontends refuse to cache certain objects (Range requests and very large files), which would be marked as pass in X-Cache for the front layer.
  5. Traffic that doesn't hit in the frontend proceeds through multiple backend layers (one layer at each datacenter it passes through, starting at the same DC as the frontend cache). The first backend layer is the local backends, which co-exist on the same physical hosts as the frontends.
  6. Traffic coming into backends is hashed (consistent hashing) on the URL to spread the set of objects across multiple machines, and the backends have large SSD storage (typically 720GB in one host). Due to the chashing, the effective storage size of an entire backend layer is the sum of the nodes' storage. For example, in ulsfo (cp4xxx) we have 6 upload cache hosts, so the ulsfo backend layer has a total effective storage size of 4.3TB.

In your two example lines, the 'direct' example was answered from a hit in the ~96GB memory cache in the cp4006 frontend cache. In the 'indirect' example, what happened is: the ~96GB frontend memory cache on cp4015 missed, then cp4015 chashed the object's URL to decide to contact cp4006's backend cache (720G SSD which is effectively dedicated to 1/6 of the total objects), which found a hit. No matter which frontend the request came through, that given object would always hash into cp4006's backend cache.

https://grafana.wikimedia.org/dashboard/db/varnish-caching is our live view of caching on the various clusters, and is parsed from X-Cache data.

Thank you for clarifying the X_cache field, this helps a lot.
It seems then that the current Hive query (x_cache like '%cp4006%') allows us to reproduce the cache hit ratio of the memory (Front) cache in cp4006 and of the SSD cache in cp4006, right?

To simulate cp4006's memory (Front) cache, we consider only those requests, where reading from the right states "cp4006" (either hit or miss, because that's going to be determined by the simulator).
To simulate cp4006's SSD cache, we consider only those requests, where reading from the right first gives a miss and the second entry is "cp4006" (now the hit/miss of the second layer is determined by the simulator).

These simulations can then indicate the effect on the hit ratio of deploying variants of the Bloom filter or other changes.

We could also try to simulate a complete data center - including the effect of hashing and the complete (distributed) SSD cache - but I fear the amount of data to reach steady state for such a simulation would be far too high, for now (?). I believe that the sampling of cache requests via (x_cache like '%cp4006%') is a cheaper way to get representative cache simulations.


From your statement, it seems that our current HASH(uri_host, uri_path) is correct, as the effect of consistent hashing is resolved by considering the x_cache field as above.
Do you think we should move forward with a larger data set (for which again I will calculate the statistics and we check its consistency)?

Nuria moved this task from Paused to In Progress on the Analytics-Kanban board.Aug 12 2016, 3:02 PM
Nuria added a comment.EditedAug 12 2016, 7:29 PM

I will try to get a larger dataset (hopefully one week) excluding any response that is not a 200. That should reduce data a bit and provide you with what seems a cleaner dataset.

From what I see, only counting 200 give us 4% less data per hour.

I will also remove http_method as it seems redundant.

I have put a "1 day" dataset here: https://datasets.wikimedia.org/limn-public-data/caching/

Please take a look, if it looks good we can provide 2 weeks worthy of files.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Aug 16 2016, 8:21 PM

The 1 day data set looks great and the trends (e.g., for one-hit-wonders) follows the expectations from previous comments.

  • "backwards consistent": I've compared the first (1h) data set to the first hour of this new data set. The only difference lies in the exclusion of non-200 response. As expected, every is the same, except that the fraction of zero-response-size requests decreases from 3.6% to 0.1%. This fraction is also 0.1% for the overall new (1h) data set, which seems great.
  • "plausible one-hit-wonders": The 12% of requests to one-hit-wonders was overly high in the first (1h) data set. In this new (1d) data set, we have a more realistic figure of 2.6%.

Moving forward with a 2-week data set sounds great to me.


Here is a summary comparing the current 1h and the 1day data sets.

Cache Metricold (1h) data setnew (1d) data set
total cache req9213253198260505
total cache vol299GB6407GB
unique obj count187149610564786
unique obj vol151GB1238GB
cp4006 disk requests55.2 % (of reqs)54.9 % (of reqs)
cp4006 front requests53.1 % (of reqs)54.1 % (of reqs)
cp4006 disk volume160GB3449GB
cp4006 front volume164GB3535GB
requests to 1-hit objs12.4 % (of reqs)2.6 % (of reqs)
61.3 % (of objs)48.5 % (of objs)
requests to 2-hit objs6.3 % (of reqs)1.6 % (of reqs)
31.2 % (of objs)29.1 % (of objs)
requests to 3-hit objs4.2 % (of reqs)1.2 % (of reqs)
20.9 % (of objs)22 % (of objs)
requests to 4-hit objs3.2 % (of reqs)1 % (of reqs)
15.6 % (of objs)18.3 % (of objs)
requests to 5-hit objs2.5 % (of reqs)0.9 % (of reqs)
12.3 % (of objs)16 % (of objs)
requests to most popular obj1.5 % (of reqs)1.2 % (of reqs)
requests w/ zero size3.6 % (of reqs)0.1 % (of reqs)
max object size223MB674MB
mean request size34KB33KB
Nuria moved this task from Paused to In Progress on the Analytics-Kanban board.Aug 19 2016, 2:57 PM
Nuria added a comment.Aug 29 2016, 8:07 PM

FYI, still working on 2 week dataset, need to harvest couple last days

Nuria moved this task from In Code Review to Done on the Analytics-Kanban board.Aug 31 2016, 8:35 PM
Krinkle added a subscriber: Krinkle.
Danielsberger closed this task as Resolved.Sep 13 2016, 12:55 PM

I have finally been able to take a look at the data set: it's great - exactly what we need to analyze caching performance.

Below are the overall trace statistics. It's 2.8 billion requests overall and the volume of unique objects is just over 5 TB, which is great. Overall, this seems to be well in line with what we learned from the two smaller datasets.

I think that Nuria has compiled a clean and incredibly valuable data set - thank you so much!


I had one minor question: Am I right to assume that the dates of this dataset are August 17 to August 31?


total cache req2806464496
total cache vol89TB
unique obj count37268394
unique obj vol5058GB
cp4006 disk requests54.8 % (of reqs)
cp4006 front requests54.2 % (of reqs)
cp4006 disk volume48TB
cp4006 front volume49TB
requests to 1-hit objs0.6 % (of reqs)
requests to 1-hit objs41.5 % (of objs)
requests to 2-hit objs0.4 % (of reqs)
requests to 2-hit objs27.8 % (of objs)
requests to 3-hit objs0.3 % (of reqs)
requests to 3-hit objs21.9 % (of objs)
requests to 4-hit objs0.2 % (of reqs)
requests to 4-hit objs18.8 % (of objs)
requests to 5-hit objs0.2 % (of reqs)
requests to 5-hit objs16.6 % (of objs)
requests to most popular obj1.2 % (of reqs)
requests w/ zero size0.1 % (of reqs)
max object size1217MB
mean request size34KB
Nuria added a comment.Sep 13 2016, 6:51 PM

I had one minor question: Am I right to assume that the dates of this dataset are August 17 to August 31?

no, dates are not those, since we did not need timestamps for this type of data I did not include them.

dayyoung0324 added a subscriber: dayyoung0324.EditedApr 16 2018, 2:25 PM

Hello Danielsberger and everyone,

Since I am figure out an algorithm to analyze workloads (e.g., page read or page create) over a certain period of time (e.g., one day) in a social network (or a distributed system) as shown in the following attachment file - SocialNetwork.pdf , I need realistic dataset including  patterns of user access to web servers  in a decentralized hosting environment. In other words, I expect that one trace record in the dataset has at least four attributes - timestamp, web server id, page size, and operations  (e.g., create, read, or update a page). It seems not to be easy to get the realistic dataset related to back-end.  Nuria already advised me that wikimedia downloads or API  do not provide a dataset like the one I am interested on.  The closest data to what you are asking might be the discuss here. I have two questions.
  1. Do you provide any available downloads for your data with smaller size?
  1. Could I use the similar sql SELECT FROM clause as the following example to crawl what I need from Wikimedia?

`
SELECT
HASH(uri_host, uri_path) hashed_host_path, uri_query, content_type, response_size, time_firstbyte, x_cache

FROM wmf.webrequest
WHERE

x_cache like '%cp4006%'  
AND year = 2016
AND month = 7
AND day=12
--AND hour=01
AND agent_type = 'user'
AND http_status= '200'
--AND content_type='text/html'

LIMIT 1000000000;
'

Nuria added a comment.Apr 16 2018, 3:03 PM

@dayyoung0324 Please do not post on tickets that are closed, as I mentioned the data available that most resembles your request is at: https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/

Do you provide any available downloads for your data with smaller size?
Could I use the similar sql SELECT FROM clause as the following example to crawl what I need from Wikimedia?

We do not provide any ability to query raw data.

@Nuria Since I have some questions related to this task T128132, should I create a new task to post my question? For example, the dataset sizes at https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/ are too large. I can't download any one of them through my network. Thank you.

Nuria added a comment.Apr 16 2018, 4:39 PM

Since I have some questions related to this task T128132, should I create a new task to post my question?

Using analytics@ e-mail list would be fine.

Krinkle removed a subscriber: Krinkle.Apr 16 2018, 11:32 PM