Page MenuHomePhabricator

Roll-up raw sessionTick data into distribution
Closed, ResolvedPublic

Description

Context

Having spoken with the Product Analytics team, the general consensus is that the aggregations of the raw sessionTick data can be computed in the following:

Acceptance Criteria
  • As a data scientist or analyst, I am able to query the SessionLength data to display the following distributions
    • Average Session Length
    • Minimum Session Length
    • Maximum Session Length
    • Median Session Length
    • Distribution by Quantile
    • Distribution by Count "bucket"
Note
  • Sampling is a requirement to be decided upon before deployment to all wikis, but does not block this work to be done.
Data-QA sign off
  • Data in the sessionized table mforns.session_length looks good

Event Timeline

If we consider a simplified view of the sessionTick data, like:

timestamp                domain           tick
2021-01-01T00:00:00      en.wikipedia     1
2021-01-01T00:00:01      fr.wikipedia     1
2021-01-01T00:00:03      en.wikipedia     2
...

One possible path is to use a Spark-Scala job to transform it into a data set with 1 row per session, like:

session_end_timestamp      domain           session_length
2021-01-01T00:00:03        en.wikipedia     123
2021-01-01T00:00:05        fr.wikipedia     234
...

This is not strictly an aggregation, but it would allow us to use Presto-SQL's approx_percentile function to calculate and chart percentiles (25, 50, 75, 95, etc.) in Superset.
The advantage of this approach is that Superset could show percentile values for any specified period of time or domain set, given that the Superset query would aggregate counts for the specified slice/dice and only then calculate the percentiles.
It remains to be seen whether this solution has a good performance, given that the intermediate data set is not aggregated, thus being bigger in size.

Another option could be to aggregate across the time and tick dimensions. Considering that the tick interval is 1 minute, the resulting data would look like:

day             domain           session_length_minutes       count
2021-01-01      en.wikipedia     1                            2534
2021-01-01      en.wikipedia     2                            8456
2021-01-01      en.wikipedia     3                            735
2021-01-01      fr.wikipedia     1                            376
2021-01-01      fr.wikipedia     2                            543
2021-01-01      fr.wikipedia     3                            87
...

This would make the intermediate data set much smaller in size, and simpler to query than the raw one.
However, I see a couple details that seem inconvenient:

  • There might be a long tail of buckets for higher session lengths. Maybe we could consider reorganize the buckets exponentially (i.e.: 0-1m, 1-5m, 5-15m, 15-60m, 1-3h, 3h+)?
  • Superset will not be able to calculate percentiles on top of this data with approx_percentile. We could either have yet another intermediate dataset that allowed that, or have Superset graph "bucket percentages" instead of percentiles.

By "bucket percentage" I mean something like:

filter by: period=2021-01-01->20201-01-10 domain=en.wikipedia
split by: time(granularity=day)

2021-01-01
0-1m: 10% of total
1-5m: 15% of total
5-15m: 18% of total
...

2021-01-02
0-1m: 11% of total
1-5m: 14% of total
5-15m: 19% of total
...

To me this is like an "inverted percentile", contains the same information as the percentiles, and is much more light in terms of performance, and malleable! If visualized in a stacked chart it's pretty informing, IMO.
However, it's not standard like percentiles, and as Kate says, it doesn't show "a number" of session length. For this reason, the Presto-SQL option above might be better...

Hey all :]
I looked a bit into the size and length of the session_tick data that we're collecting right now, to determine what sampling rate we'll need to use.

tl;dr
I believe ~1/60 is the maximum sampling rate that we can use.
But it would be good to not force it. If we could do with less data, I'd choose 1/100.
Feel free to disagree, and let's discuss!

Long version:

CURRENT SIZE

  • The current length (#records) of the data is around 20M records per day, peaking at 400-500 ticks per second.
  • The current size (bytes) of the data is around 2.2GB per day (if we count 3x HDFS replication: 6.8GB per day).
  • To compare it with a relatively large existing dataset, pageview_hourly, session_tick is: 30 times smaller in length (#records) than pageview_hourly; and 7 times smaller in size (bytes) than pageview_hourly.

CURRENT SOURCES

  • We're currently collecting data from group0 and group1 wikis, which by my (potentially wrong!) calculations account for approx. 1/12th of total traffic.
  • We're currently not sampling the data collection. Rate = 1/1.

PROJECTION

  • When we deploy session_tick to all wikis, we might see an increase of 12 times the current size and length.
  • If we keep the sampling rate at 1/1, that would mean session_tick could have roughly 240M records per day (less than half the #records of pageview_hourly), and 26GB per day (~40% more bytes than pageview_hourly).

INTERMEDIATE TABLE

  • The intermediate table would be smaller than the raw session_ticks table.
  • I did a couple queries and it seems that the records with tick=0 are 1/10th of the total. This indicates that the intermediate table would have 1/10 less records than the session_tick table.
  • The intermediate table would only store the necessary fields (year, month, day, wiki, session_length), unlike the raw session_ticks table, which stores lots of meta fields. After checking, the records of the intermediate table are about 300 times lighter than the records in the raw session_ticks table. Which combined with the #records reduction, would make the intermediate table about 3000 times smaller in size (!!!) than the raw table.

PRESTO DATA SIZE

  • As mentioned in Presto's documentation, Presto is able to query a data set like MediaWikiHistory, which is about 750GB and 3B rows overall (spread over 20 years of data).
  • The corresponding daily size would be: 750GB/7300days=100MB/day, and 3B/7300days=400K rows/day.

CONCLUSION

  • The intermediate table has to be 100MB/day and 400K rows/day maximum.
  • Then, the raw table has to be 4GB/day (40*100MB) and 4M rows/day (10*400K) maximum.
  • With 1/1 sampling rate, though, the raw table is still 26GB/day and 240M rows/day.
  • For data size, the maximum sample rate would be 1/6. But for #records, the maximum sample rate would be 1/60.
  • With further additions to the intermediate table, like new dimensions, the data size would grow, but the #records would not. So I'd trust better the maximum sampling rate taken from the #records: 1/60.
  • We might want to use a sampling rate that is not the maximum, and leave some slack for safety.

OTHER NOTES

  • The raw session_tick data is going to be deleted after 90 days, as part of the automatic purging of privacy-sensitive data. I think that's OK, given that we'll keep the data in the intermediate table.
  • 90 days of raw data would sum up to 26GB/60*90=39GB total. I think that's acceptable.
  • With a sampling rate of 1/60, the instrumentation would produce about 400*12/60=80 events per second to Kafka, which is perfectly OK.

One minor note: the instrument is currently active on wiki groups 0 and 1 (not just 1). This probably doesn't affect the analysis above meaningfully, though, given the probably very low traffic on group 0 wikis.

One minor note: the instrument is currently active on wiki groups 0 and 1 (not just 1). This probably doesn't affect the analysis above meaningfully, though, given the probably very low traffic on group 0 wikis.

Oh, thanks for the detail! I believe the calculations are still valid, because I based them on the current size of the collected data, but will update the comment above!

Hey all :]
I looked a bit into the size and length of the session_tick data that we're collecting right now, to determine what sampling rate we'll need to use.

Thank you @mforns for crunching the numbers and writing all of this up

tl;dr
I believe ~1/250 is the maximum sampling rate that we can use.
But it would be good to not force it. If we could do with less data, I'd choose 1/1000.

I'm for sampling aggressively, especially at first. 1/1000 does not sound like a problem. We may find that we have insufficient data for certain projects etc. We can solve that problem when we encounter it.

Hey all, an update on this:
I'm testing some queries for aggregation. and got to see some examples of how the intermediate table might look like.
However, I think we might have an issue with the 1/1000 sampling rate. With this rate, lots of smaller wikis would not reach the number of measures necessary to have statistically sound percentiles.
So, bad news. But there's good news too :]
All this led me to review the calculations for the maximum sampling rate above. And I calculated again how much the raw data is going to grow when we switch it on for all wikis.
In the initial projection, I said 50x the current size, but after looking a bit closer (using the actual list of all wikis in group2, and querying wmf.projectview_hourly) I got that the growth factor should be about 12x.
I will update the calculation comment above with the new numbers, and highlight all changes in bold.

I also checked the real size of the aggregated intermediate table, and it is waay smaller than the raw one,
I guesstimated it would be 40 times smaller... Well, it's about 3000 times smaller. I updated the sampling rate calculation above accordingly (changes in bold).

Hey all! In this Superset chart you can see how session length data looks like for the first intermediate table approach.
This example shows 0.25, 0.5, 0.75 and 0.95 percentiles for ca.wikipedia from 2021-01-01 to 2021-01-08 (daily).

After looking a bit into the intermediate table data, I've seen that lots of small wikis end up having very few session_length measurements per day (<1000).
To have an idea about how many measurements we need to produce statistically sound percentiles, how many wikis satisfy that threshold, and how the sampling rate might affect them, I put together this spreadsheet:
https://docs.google.com/spreadsheets/d/1ae8GG4KsvGLbYyFMlcJ-4nqIsUG_B_TlmJ2U7sibFB8/edit#gid=0

I discussed with the Analytics team about how much session tick data could our pipeline possibly take, given that high sampling rates result in very low accuracy for our metric.
The conclusion that we arrived is that the pipeline could take up to 1/10 of all session tick data - theoretically without problems - as it is today.
If necessary, we could consider collecting all data un-sampled, but that would probably require some extra work, i.e. scaling EventGate, making sure session tick stream does not starve smaller streams, when importing and processing the data, setting up a specific deletion schedule to avoid keeping TB of data, etc.

kzimmerman triaged this task as Medium priority.Feb 8 2021, 7:30 PM

Status Update: Currently performing data-QA (added checkbox to description). Will update status this week.

QA checks performed for validating data between the raw table (event.mediawiki_client_session_tick) and intermediate table (mforns.session_length) is available on this Notebook