Page MenuHomePhabricator

Request for a large request data set for caching research and tuning
Closed, ResolvedPublic

Description

Research on caching systems and algorithms continues to be a hot topic in the academic community. This is fueled by changing workloads and new hardware becoming available*. The WMF helped this research community in the past by making anonymized request traces available in 2007 [1] and 2016 [2]. In turn, advances in caching systems (e.g., Varnish / Apache Traffic Server) help making WMF websites faster and more efficient.

Specifically, there are two current research trends:
a) replacing human heuristics by machine learning-based caching decisions [3], which significantly improves performance (latency, hit ratios).
b) building better flash/SSD storage engines [4], which significantly reduces wear out and prolongs hardware life time.

Unfortunately, the 2007 dataset does not include important fields for these studies (as described in T128132). And, the 2017 dataset is too short to either learn good policies (a) or validate large flash/SSD drives (b). The 2017 dataset also covers only an upload/media server, with text servers having significantly different workloads.

I would therefore like to ask for an updated dataset of anonymized user requests. This time, we would ideally pick one busy upload/media server (e.g., cp3033 in esams) and one busy text server (e.g., cp3034 in esams).

For an example of internal use at WMF, see T144187 of the Operations team. For examples of the numerous research papers that have benefited from these datasets see: [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13].

To make this task self-contained (although you can see some discussion in T128132): these datasets will be anonymized and not contain personally identifiable information, like URLs or user IP addresses/ geolocation information.
The requested dataset is different from existing datasets (like clickstream or page stats) as it contains all requests (not sampled) to a single server, over a consecutive time period. Without having all requests, it is impossible to reconstruct the system's caching decisions.

To be more specific about the dataset, I am asking for the following four fields per request:

  • hashed_host_path = HASH(uri_host, uri_path)
  • uri_query
  • content_type
  • response_size

Three additional fields would be very helpful, but are not strictly necessary:

  • rounded timestamp (ts) to minute precision
  • time_firstbyte
  • x_cache

I would also like to ask for a dataset covering 1-2 months, compared to the two weeks from 2016. As outlined above, the current two-weeks dataset has limited benefit for research trends a) and b). And, repeating this for two servers from different clusters (uploads/text), e.g., cp3033 and cp3034.


[1] http://www.wikibench.eu
[2] https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/
[3] Berger. "Towards Lightweight and Robust Machine Learning for CDN Caching". ACM HotNets, November 2018.
[4] Li, Cheng, et al. "Pannier: Design and analysis of a container-based flash cache for compound objects." ACM Transactions on Storage (TOS) 13.3 (2017): 24.
[5] Einziger et al. "Tinylfu: A highly efficient cache admission policy." ACM Transactions on Storage (ToS) 13.4 (2017): 35.
[6] Blankstein et al. "Hyperbolic caching: Flexible caching for web applications." USENIX Annual Technical Conference. 2017.
[7] Basat et al. "Randomized admission policy for efficient top-k and frequency estimation." IEEE Conference on Computer Communications. 2017.
[8] Berger et al. "AdaptSize: Orchestrating the hot object memory cache in a content delivery network." Symposium on Networked Systems Design and Implementation. 2017.
[9] Berger et al. "Practical bounds on optimal caching with variable object sizes." Proceedings of the ACM on Measurement and Analysis of Computing Systems 2.2 (2018): 32.
[10] Einziger, Gil, et al. "Adaptive software cache management." International Middleware Conference. 2018.
[11] Rogers et al. "Cache-conscious wavefront scheduling." International Symposium on Microarchitecture. 2012.
[12] Krioukov, Andrew, et al. "Napsac: Design and implementation of a power-proportional web cluster." ACM SIGCOMM workshop on Green networking. 2010.
[13] Calheiros, et al. "Workload prediction using ARIMA model and its impact on cloud applications’ QoS." IEEE Transactions on Cloud Computing 3.4 (2014): 449-458.


*New hardware becoming available concerns the next generation of zoned-namespace SSDs (ZNS), which will be released in the first quarter of 2020. ZNS devices promise significant cost reductions with simultaneous performance improvements. However, due to a changed interface, building caching systems on ZNS is a new challenge.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2019, 4:23 PM
Nuria edited projects, added Analytics; removed Analytics-Kanban.Jun 13 2019, 4:13 PM
fdans added a subscriber: fdans.Jun 13 2019, 4:59 PM

@Danielsberger we can help you gather this data but we won't be able to do this ourselves until Q2-Q3

fdans triaged this task as Medium priority.Jun 13 2019, 5:00 PM
fdans moved this task from Incoming to Mentoring on the Analytics board.
Nuria raised the priority of this task from Medium to High.Jun 13 2019, 5:00 PM

@fdans thank you for responding so quickly. Gathering the data is not very urgent. It would be most helpful, if we can gather the data before the annual Fall paper deadlines (early September). Otherwise, people won't be able to see papers written based on this data until March 2021 (due to the annual publication and presentation cycle: submit in Fall, present in Spring).

If the webrequest log hasn't changed since @Nuria worked out the last dataset (T128132), it should be relatively quick to run the query. Something like the following, if I'm not mistaken.

SELECT 
 HASH(uri_host, uri_path) hashed_host_path, uri_query, content_type, response_size,  time_firstbyte, ROUND(ts/60), x_cache

FROM wmf.webrequest
WHERE
  x_cache like '%cp3033%'  
  AND year = 2019
  AND month = 5
  AND agent_type = 'user'
  AND http_status= '200';

Hi @Nuria - Can you confirm the above request is correct for generating the data?

I forgot to add that the source of the query above is the readme file from T128132 . I've adapted it to this 2019. Also, as indicated in the task description, it would be very helpful to cover two different servers, i.e., to repeat the query for

x_cache like '%cp3033%'

and

x_cache like '%cp3034%'
Nuria added a comment.Jun 14 2019, 3:16 PM

The request as is swaps through quite a bit of data (.3 petabyte?) so one concern would be the size of the resulting data files. Also, there are privacy concerns with releasing a dataset with two months of data, specially with timestamps. We need to look at that more closely, the timestamps were omitted from the 1st dataset on purpose.

So, to sum up, we cannot release results of that query as is. I actually wonder, given size of results you are asking for, whether we can release them at all.

Thank you for your feedback, @Nuria , that's very helpful.

I was hoping that rounding to coarser granularity would minimize any privacy concern with the timestamp field.

  • Can we consider hour-granularity instead of minute-granularity?
  • Otherwise, we can just drop this field altogether. The dataset will still be immensely useful.

Going back to the old dataset (which spanned two weeks), the compressed size is about 55GB and uncompressed it's about 100GB. So, I'd expect eight weeks to be about 400-500GB. I agree that that's quite large.

  • The largest number of bytes is due to x_cache field, about 40-50% of the bytes per line. If we drop this field, I'd estimate the trace size to be around 250-300GB and I expect much higher compression ratios.
  • It would also be fine to go down to four or five weeks, which should further save another 30-40%, definitely below 200GB uncompressed.

Does this sound more reasonable?

In summary, something like the following query

SELECT 
 HASH(uri_host, uri_path) hashed_host_path, uri_query, content_type, response_size,  time_firstbyte

FROM wmf.webrequest
WHERE
  x_cache like '%cp3033%'  
  AND year = 2019
  AND month = 5
  AND agent_type = 'user'
  AND http_status= '200';

Hi Francisco (@fdans) and @Nuria ,

Since we're now in Q3 and a few weeks have passed, I wanted to check in with you. It would be great to release this dataset in August, so that researchers can use this in the annual paper submission cycle (as mentioned above).

After some additional research, I found out that the dataset size would be much larger for a "text" than for an "upload" cache, due to the higher request rate. After some deliberation with other researchers in the field (at Princeton, CMU, Warwick, and TU Berlin), I believe that we safely can limit the total number of rows to about 4-5 billion (as a LIMIT). However, these researchers pointed out the importance of having several different locations, e.g., eqsin, esams, and eqiad.

So, I would like to ask for traces from "upload" caches (low request rate), where x_cache like

  • %cp5001%
  • %cp3034%
  • %cp1076%

As discussed above, I also validated that the following five columns minimize the data amount, while being sufficient for caching research purposes:

  • HASH(uri_host, uri_path) hashed_host_path, uri_query, content_type, response_size, time_firstbyte

Please let me know if you have any other questions, concerns, and suggestions.

Thank you,
Daniel

Nuria added a comment.Jul 26 2019, 1:52 AM

@Danielsberger we are in Q1, as in the fiscal year has just started as it starts in July 1st. This quarter our team has couple less people due to family leave so it will be hard for us to do this before next quarter, Q2.

@Nuria thank you for clarifying the time line and sorry for my misunderstanding of the quarterly schedule.

All the best,
Daniel

Hi @Nuria and @fdans ,

I just wanted to politely check in about this quarter's schedule. I'm hoping some people are back and someone has a few cycles to create and export this trace :).

Thank you and all the best,
Daniel

Nuria added a subscriber: Lex.EditedOct 16 2019, 3:26 PM

hello, @lexnasser will be working on this in Q2

Hi @Danielsberger, I’m working on compiling this new public dataset for your caching research. I had a few questions that I hope you could answer so that I could get a better understanding of your specific wants and needs for this new release:

  • What’s the reason for requesting a timestamp field now for this new dataset as opposed to in 2016 for the previous dataset?
  • What specifically would a timestamp field be used for? What advantage does it provide?
  • How does timestamp granularity affect the utility of the dataset?
  • What is the optimal compromise for you between timestamp granularity and overall dataset duration? (ex. 10 minute granularity over 3 days vs. 1 day granularity over 14 days vs. no timestamp over 8 weeks, etc.) This tradeoff is most related to privacy considerations.
  • Are there any feature additions or changes other than timestamp in this new dataset to improve its utility for you that I could look into implementing?

Let me know if you have any questions or concerns that I could address as well. Thanks!

Hi @lexnasser ,

Let me first answer why we need a timestamp field. At a high level, the goal of most caching research project is to come up with a new algorithm and then compare it to the original system, and show that we get better performance. Timestamps occur in both part of the process. There are many algorithms that explicitly rely on timestamps as an input. So, without timestamps we cannot simulate these algorithms. For the actual comparison of systems, timestamps are necessary to get almost all the interesting performance metrics.

For example, one of the most important metrics is response time (aka latency). Response time largely depend on how many requests arrive within a short period of time (of wall clock time) as they're queued up and processed step by step. Without accurate timestamps, we need to make assumptions on the statistical properties of the arrival process, e.g., we could assume that all requests arrive equally spaced out over time, as a Poisson process, or something like that. Unfortunately, while it is know that these models are incorrect (lead to arbitrarily inaccurate response time numbers), at the same time it not well know what would be a realistic assumption on the arrival process. Consequently, traces without timestamps have lead to caching system research that is not useful in practice, e.g., leads to very high response times. This means that the Wikimedia Foundation's caches are less (power and cost) efficient than they could be, if academic research were applicable.

Our larger goal is to enable caching researchers to produce more practically useful algorithms and systems. This is not possible without timestamps for the trace.

The timestamp granularity has a massive impact on the dataset's utility. This is because bursts of arrivals happen at all timescales, e.g., at the sub-millisecond, second, minute, and even hour level. For example, in some new traces of other large websites, we can show that with nanosecond timestamps, we can very accurately reproduce the response time of the original system. It is not currently understood how exactly the timestamp granularity affects the accuracy of the evaluation result. We can't even study this question, because 99% of traces don't have accurate timestamps.

What is the actual accuracy from the underlying timestamp in the database? I expect we have only seconds?

Can you elaborate on how the granularity vs length trade-off was decided? Some of the servers see 2000-4000 requests per second, whereas others see 30000 requests per second (during the day's peak period). If we focus on the busiest server and the busiest hour of the week, shouldn't that minimize any privacy concerns as many individual requests are grouped together? (Note also that an individual user request is randomly routed to a server, so there's an additional sampling factor there). What's the threshold at which we believe that a sufficient number of user requests are grouped together in order to minimize privacy concerns? E.g., 10000 grouped together?

With regard to additional feature additions and other changes, I think we might want to postpone that. My experience in the past was that that is determined by how many GBs of data you can write out at once. It might be more interesting to have traces from several servers, maybe at different dates, than to have an individual dataset with many more columns.

Hi @Danielsberger, thanks for the thorough response. I'm currently reviewing all the different configurations of the features of the dataset and will try to accommodate your needs as much as practical. And yes, the underlying timestamp uses second-granularity.

I was wondering if it would be sufficient to include relative timestamps, where the first entry has lets say a timestamp of 0, and then every other entry would have a timestamp equivalent to seconds since that first entry. This is as opposed to an absolute timestamp, with which you'd be able to determine the specific date and/or time at which the entries occur.

If you have any suggestions regarding the balance of timestamp sensitivity and timestamp utility, I'm all ears. Have a good weekend :)

Hi @lexnasser ,

Relative timestamps are a great idea and absolutely sufficient for caching research. I should've come up with that myself and mentioned it earlier.

Let me think about the balance of timestamp sensitivity.
Daniel

lexnasser added a comment.EditedNov 6 2019, 9:08 PM

Hi @Danielsberger ,
I'm almost finished compiling the data. This is what the dataset would look like:

NOTE THAT THIS IS FAKE EXAMPLE DATA .

*Notes*relative_unixhashed_host_path_queryimage_typeresponse_sizetime_firstbytex_cache
First row03284820126jpeg181366.7234E-05cp1050 hit/2, cp2005 hit/20
0-2920338696jpeg327836.628E-05cp1074 hit/11, cp2005 hit/2
1-5944228png14510.075938463cp1048 miss, cp2005 hit/1
14779824171jpeg14854.6015E-05cp1072 hit/13, cp2005 hit/20
227896322315png19204.1008E-05cp1064 hit/22, cp2005 hit/35
..................
79 seconds since first row791627943997jpeg72520.039715528cp1074 hit/37, cp2005 hit/3
..................

Let me know if you see any issues or have any questions, requests, or recommendations with the format, content, or whatever. Thanks!

@Danielsberger

Also, I saw that in your 2016 dataset request (link) that you wanted a separate query field for a save flag.

Is this still needed?
If so, is it sufficient to either hash all the query values that are not null or just use a boolean variable if the query is not null instead of keeping them as plain text?
There were some values in the query field in the 2016 dataset that included URLs, which would not mesh well with the new timestamps for privacy reasons, so I won't be able to include them in plain text as before.

Hi @lexnasser ,

I have some thoughts and questions about the overall dataset / example from above and on the save flag.

Overall dataset:

  • Are we narrowing the query to a single server, e.g., via WHERE x_cache like '%cp3033%' ?
  • Which server are we using? Ideally we'd actually create two datasets, one for a cache_text and one for a cache_upload server, but since the ATS deployment (replacing Varnish) I can't figure out the right x_cache query.
  • I'm afraid that we'll have too much data, as Nuria previously pointed out. The x_cache field is one of the largest, we had this in the last dataset and no researcher / paper (afaik) used it. I think we can drop the x_cache column in the output (but keep it in the where clause).
  • How are we limiting the response size? It would be great to cover a longer time (say 4 weeks) period.

On the save flag:

We dropped the save flag in 2016, because we focused on a cache_upload server. Due to the very different statistical properties of cache_text and cache_upload it is important that we include a cache_text this time (one of the most frequent requests from other researchers). For cache_text , we'll need the save flag.

The save flag can be binary (and has been in the even older 2007 dataset). In 2007, the save flag was computed via an awk script after the actual export as described on the Analytics Mailing list (Sender: Tim Starling, Subject "Re: [Analytics] Request stream data set for cache tuning", on 25 Feb 2016). I've copied the function from vu.awk from Tim's email:

function savemark(url, code) {
    if (url ~ /action=submit$/ && code == "TCP_MISS/302")
        return "save"
    return "-"
}

$5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
    print $3, $9, savemark($9, $6)
}

This actually emits a text response ("save" or "-"), but we really just need 0/1. My reading is that we can get the action=submit part from the uri_query and the code from cache_status. Can you confirm that?

Ok, I looked up the current Varnish/ATS server assignment in puppet/conftool-data/node.

I think esams looks like an interesting and stable workload. Specifically, two servers: cp3050 as a cache_text server (running Varnish) and cp3051 as a cache_upload server (running ATS). Unfortunately, according to the instance grill down the only have about 2 weeks worth of past requests. In fact, due to the ongoing deployment changes, I could find few (busy) servers that have a longer history.

I propose that we compile a small data set (similar to how we've done it in 2016) to test everything and estimate the overall dataset size for cp3050 (and cp3051 if time allows). By next week or the week after, we'll probably have enough data to export the whole dataset.

Let me know what you think.
Thank you,
Daniel

Nuria added a comment.Nov 8 2019, 8:01 PM

we'll need the save flag.

If i understood your requirement you need a unique identifier that links the page with the "save" so as to know when cache has expired for a given item. Example:

This means that for a request like

en.wikipedia.org	/w/index.php	?title=Draft:IM_Entertainment&action=submit

you need to get a record like:

2019-01-01 13:45  Draft:IM_Entertainment saved

and also later

2019-01-01 14:00 Draft:IM_Entertainment  some-cache-info

Besides concerns with data size around releasing the cache_text dataset this means that this data cannot be obtained with a simple select as it requires quite a bit of url parsing.

Hi @Nuria ,

Thank you for your comment. I agree that with two records, this would be somewhat hard. But I believe that we don't need the second record, just the first. The goal is to detect writes (via submit), which we assume to immediately invalidate the cache.

Starting with this request

en.wikipedia.org /w/index.php ?title=Draft:IM_Entertainment&action=submit

the output could be just two columns, where the first column hashes the page identifier and the second column is binary (condition uri_query %like% "action=submit").

hash(en.wikipedia.org/w/index.php?title=Draft:IM_Entertainment) 1

I don't know the query interface well enough, but if it's hard to add a conditional column, we could also split the export in "likely reads" (where uri_query not %like% "action=submit") and "likely writes" (where uri_query %like% "action=submit").

Does this sound more reasonable?

Thank you,
Daniel

Hi @Danielsberger,

Are we narrowing the query to a single server, e.g., via WHERE x_cache like '%cp3033%' ?

Yes. I’m using WHERE x_cache like '%cp5006%' .

Which server are we using? Ideally we'd actually create two datasets, one for a cache_text and one for a cache_upload server, but since the ATS deployment (replacing Varnish) I can't figure out the right x_cache query.

As above, I’m using 5006, which is for images only via upload.wikimedia.org.

I'm afraid that we'll have too much data, as Nuria previously pointed out. The x_cache field is one of the largest, we had this in the last dataset and no researcher / paper (afaik) used it. I think we can drop the x_cache column in the output (but keep it in the where clause).

To confirm, the remaining fields are: relative_unix, hashed_host_path_query, image_type, response_size, time_firstbyte . Is that proper?

How are we limiting the response size? It would be great to cover a longer time (say 4 weeks) period.

I’m not sure what you mean by limiting the response size - I currently have no filters on the response size. I’ll have to consider the longer time period.

Let me know if I missed or misunderstood anything. Thanks!

Nuria added a comment.EditedNov 11 2019, 9:07 PM

Let's first narrow down the upload dataset. From your request the text dataset is quite a different one.

hash(en.wikipedia.org/w/index.php?title=Draft:IM_Entertainment)

This would work if request for this page came also in this form but they do not, the url looks quite different. More like: 'en.wikipedia.org/wiki/SomeTitle' so in order to create the data page title extraction needs to happen in every one submit url.

@lexnasser : yes I can confirm that the fields (except "save") are: relative_unix, hashed_host_path_query, image_type, response_size, time_firstbyte

@Nuria : ok, let's do upload first. We don't need to know the exact page title, this is just my ignorance of how things are stored. Hashing the host/path/query like you did in the previous dataset seems fine (both for upload and text). The only difference would be the save column, which is 1 if uri_query %like% "action=submit" and 0 otherwise.

Nuria added a comment.Nov 12 2019, 4:46 PM

The only difference would be the save column, which is 1 if uri_query %like% "action=submit" and 0 otherwise.

If you do not identify pages per title you would not know what objects are expiring. A page can be accessed by mobile url and seconds later be edited on a desktop url, the object it represents in the cache will expire but the urls by which those two actions happen are not similar. Does this make sense?

@Danielsberger

The only difference would be the save column, which is 1 if uri_query %like% "action=submit" and 0 otherwise.

The upload(.wikimedia.org) uri_query field does not contain an action=submit parameter for any entry.

@lexnasser : is this for a text_cache or for upload_cache (like cp5006)? I expect that only text caches (like cp5008) would see submit queries.

@Nuria : thank you again for clarifying, I now understand the extraction point. Is your concern the running time of the query, which has to apply a regex to every row? In that case, would it be possible to do the two extra columns (save flag + hashed title) only for a subset of the text queries, e.g., the first few hundred million rows?

@Danielsberger

is this for a text_cache or for upload_cache (like cp5006)? I expect that only text caches (like cp5008) would see submit queries.

This is for all cpXXXX under the uri_host upload.wikimedia.org, which deals mainly with image types, as was used in the 2016 set.

The submit queries are relevant to other uri_hosts like en.wikipedia.org, etc.

Thanks @lexnasser , that's very helpful. Is there a way to check how frequently submit queries happen for hosts other than upload?

@Danielsberger

I'm not sure if there's a public-facing way to check the frequency of submit queries. Will have to defer to @Nuria about that.

That said, I believe that *.wikimedia.org uri_hosts are the main sources of submit queries.

Nuria added a comment.Nov 13 2019, 4:36 PM

@Danielsberger clarifying a bit:

  • the upload dataset that will be provided will not have any "save" flags, @lexnasser is finalizing that one
  • we can work on a different dataset, a text one that could incorporate "save" flags for documents. There are about 200 edits per minute in wikipedia so that is the number of "saves"you might see in a minute of data for the whole fleet, again, we cannot possibly provide a dataset as large as two weeks with all requests and all "saves", it is much too large. Also, I am not sure how does this text dataset plays out with the idea to have requests that get to one host, submit requests are not cached and, I think, can hit any one host

In that case, would it be possible to do the two extra columns (save flag + hashed title) only for a subset of the text queries, e.g., the first few hundred million rows?

Few hundred million rows is not actually small either but in any case, yes, we would need to limit the data

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 14 2019, 5:00 PM

Thank you, @Nuria . Having the upload data set without a save flag makes perfect sense and is great!

I did not know that there only 200 edits per minute across all servers / shards. With that knowledge, we probably can/should ignore edits. The reasoning why timestamps at second-accuracy are save (from a privacy perspective) is that we aggregate thousands of requests under a single timestamp. This won't be true for edits so I think we won't be able to export them.

In conclusion, we also can move ahead with also exporting a text data (e.g., for cp5008) without the edit/save and title columns.

@Danielsberger

Checking in again.

I have documented the tentative info about the upload dataset on Wikitech here. Note that the links on the page point to the 2016 release, but will be updated upon this upcoming release.

Let me know if you see any issues with it. I don't have a precise timeframe for this release, but, the data will undergo a privacy review this week and will hopefully be released sometime soon after then if all is good.

Also, from my understanding, a save flag for the text dataset is no longer required. Let me know if my understanding is correct. Under that assumption, I will be able to provide the details about that version soon.

@lexnasser this is great, thank you! I really like the idea of a dedicated wikitech page. I can contribute some parsing and processing scripts in the future, and we might link them there, too (if people are interested).

For the text dataset: yes, let's go ahead without the save flag.

I'm sorry I held up the process with edits, which apparently occur so rarely (and which I could have gotten from official Wikimedia statistics). If you have some time after this, it would be great to do some supplemental analysis on when edits occur and whether they happen on pages that are also read very frequently. When caching on modern flash drives (SSDs/NVMe), it really makes sense to store pages and media objects that are read many times but almost never updated/written (as writing to flash is costly at scale). So, understanding these distributions would be very interesting. I could open another issue for that (but I also don't want to bother the analytics team with something that is mostly interesting to infrastructure folks).

@Danielsberger contributions to wikitech page are most welcome!

@Danielsberger

Updated Wikitech (LINK) once again with a description about the text data. Let me know if you see any last-minute issues.

The plan is to publish everything this Wednesday, December 4.

@Ottomata ^ I'll need your help to move the files to the correct location for release.

Ok! Find me on IRC let's do it!

Hi @lexnasser ,

Thank you for setting up the detailed description on wikitech. This is really great!

One thing that I'm wondering about is that the text traces are so small. From my understanding, the trace size is mostly determined by how many requests are in the file and text caches have a higher request rate than upload caches. So, shouldn't the text file be larger than the upload trace file?

Specifically, if we look at the instance drill down for cp5006 (which is an upload cache, and you mentioned before): the non-purge request rate is between 250 and 1.75K operations per second (tls).
Looking at the instance drill down for cp5006 (a text cache): the rate is between 650 and 3.2K operations per second (tls).

From these numbers, I might expect the text trace file to be about 2x larger than the upload file. It might be slightly smaller than that, because the text files have one less field (a single integer).

Am I missing something?

Nuria added a comment.Dec 4 2019, 3:42 PM

@Danielsberger It is the other way arround, the volume of requests to upload is a lot highter. Think that a web page is one made by one text document and many images, the images are requested from upload.

1a1a11a added a subscriber: 1a1a11a.Dec 5 2019, 9:55 PM

The data has been released!

URL: https://analytics.wikimedia.org/published/datasets/caching/2019/

Wikitech will be updated soon. I hope you find this data very useful. Let me know if you experience any issues with it.

Thanks to Nuria for all the help!

Nuria moved this task from In Code Review to Done on the Analytics-Kanban board.Dec 6 2019, 7:05 PM
ema awarded a token.Dec 8 2019, 9:25 AM
Nuria closed this task as Resolved.Dec 11 2019, 5:02 PM

Hi @Nuria, @lexnasser and everyone else, thank you for the dataset, they are great assets to the research community!
Same as Daniel's question, grafana shows the txt cache is serving several times higher request rate compared to upload cache (https://grafana.wikimedia.org/d/000000450/varnish-traffic-instance-breakdown?orgId=1&from=now-7d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend), however, the txt cache is much smaller than the upload cache, which seems weird.
I understand Nuria says upload cache servers have larger volumes because each page has several images, but there are also a lot of pages not having any image.
Is it possible to have a check? Thank you!

@1a1a11a

Thanks for the question! Nuria is more aware of the intricacies of the source of the data than I, but I believe the main other factor that limits the amount of data from the text cache is that the text data is filtered by is_pageview .

This means that it only includes html page webrequests on Wikipedia, which account for less than 10% of all text webrequests. Looking at Grafana, even though the text cache_type accounts for roughly twice the number of requests as the upload cache_type, pageview text webrequests should be outnumbered by image requests ~5:1, meaning that that factor alone should explain roughly half of the file-size difference. That, along with the lack of an image_type field, exclusion of non-Wikipedia sources (Wiktionary, etc.), and several other filters/factors should make up the remaining disparity.

Let me know if you have any more questions related to this topic. Thanks!

Hi @lexnasser, thank you for the quick reply! This is helpful. For me to better understand the filtering, what is this filter by is_pageview? What are the requests that are not pageview? Thank you!

Nuria added a comment.Dec 13 2019, 4:05 PM

@1a1a11a This is the pageview definition for requests that have is_Pageview=1: https://meta.wikimedia.org/wiki/Research:Page_view

To sum it up: pageviews and "all requests to text cache" are quite different things.

I see, thank you again for explaining and providing the trace!