While we are reseting the data, it might make sense to just start counting 7 days. This way we can track the impact of changes we make more accurately (30 day moving average tends to be weaker indicator)
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | mforns | T117615 Provide weekly app session metrics separately for Android and iOS, and move to 7 day counts [13 pts] | |||
Duplicate | None | T117637 Move App session data to 7 day counts |
Event Timeline
(to record our in-person discussion about this from some days ago here, and expand a bit on it:)
It makes sense to me that a 7-day window might make the information a bit more actionable. (Per T86535#1016026 the main reason for choosing 30 days was just that "the old ad-hoc reports" used that window; on the other hand, Nuria cautioned there that anything below one week would probably not yield very meaningful data.)
However, T117615 is explicitly not about resetting the data, but about adding separate rows: "Since we have already collected quite a bit of historical data at this point for the aggregated (iOS & Android) metric, we should keep generating it as before, and add the platform-specific data separately."
It should be fine though to use a different window for the new platform-specific rows, as long as the value of the date_rangecolumn is set correctly.
@JKatzWMF this requires testing, calculating sessions requires quite a bit of data and until we test we will not know if doing it every 7 days renders meaningful data. My recommendation: get one of your devs to test this out. Should be easy to test (but lenghty) given that is a matter of modifying the parameters of the current report and run it 10/20 times for the ten weeks of data we normally hold in the cluster.
@Nuria thanks for your response. Can you explain to me why the following holds true?
until we test we will not know if doing it every 7 days renders meaningful data
Is this around the % of sessions that take place during the start and end of the period in question? Given the low median we see, I would be very surprised if there was a large number of sessions that were cut-off by this. Certainly the volume running through both apps is significant enough at this point that 7 days should be enough...
@JKantzWMF:
Is this around the % of sessions that take place during the start and end of the period in question?
Right, the number of users that have a session within the period. A session is defined as having more that 1 hit/request on the period for which it is calculated. The longer the period the more sessions you have. To report percentiles 1, 50, 90 and 99 you have to have a significant volume of data for which percentile reporting makes sense.
We need to test that that is the case (and you might be totally right that there is enough volume of data)
FWIW, the dataset has contained some sessions that are longer than a week, although losing these outliers to truncation effects would probably increase rather than decrease data quality ;)
(1121930 seconds = 13 days)
hive (wmf)> SELECT date_range, max FROM mobile_apps_session_metrics WHERE type = 'SessionLength'; date_range max 2015-5-3 -- 2015-6-1 89947 2015-5-10 -- 2015-6-8 66993 2015-5-17 -- 2015-6-15 70800 2015-5-24 -- 2015-6-22 70800 2015-5-31 -- 2015-6-29 70800 2015-6-7 -- 2015-7-6 70800 2015-6-14 -- 2015-7-13 63407 2015-6-21 -- 2015-7-20 84005 2015-6-28 -- 2015-7-27 688800 2015-7-5 -- 2015-8-3 1121930 2015-7-12 -- 2015-8-10 1121930 2015-7-19 -- 2015-8-17 1121930 2015-7-26 -- 2015-8-24 605873 2015-8-2 -- 2015-8-31 52522 2015-8-9 -- 2015-9-7 52522 2015-8-16 -- 2015-9-14 52522 2015-8-23 -- 2015-9-21 49544 2015-8-30 -- 2015-9-28 366136 2015-9-6 -- 2015-10-5 366136 2015-9-13 -- 2015-10-12 486039 2015-9-20 -- 2015-10-19 514369 2015-9-27 -- 2015-10-26 514369 2015-10-4 -- 2015-11-2 514369
We need to test that that is the case (and you might be totally right that there is enough volume of data)
I'm really not concerned about this given our overall volume and Tilman's point about artificial session length. Regardless, do you have a sense for when this could be prioritized and where this fits into your queue?
I -at best- we will be able to get to this by end of this quarter, not before but we can probably give you a more precise estimate later on this week.
this is what I understand - We should change the report to add a column for "Device" or change the type column to reflect the device type, and produce per device and overall metrics. We should also change the jobs to run both over a 30 day period and over 7 day periods. The 7 day periods are needed for more granularity, but since these are not available from May 2015, we will continue to generate the 30 day periods for historical comparison. All the data could go into the same table, there will be different report date ranges to reflect the time periods the reports are run over.
Hi Nuria, this was already specified in the main task (T117615 - TLDR: yes, for now we should continue the existing weekly reports covering 30 days worth of data for all devices).
Jon created a new ticket here to discuss a separate change to be implemented in those new platform-specific metrics, namely that they should cover 7 days instead of 30 days (and then we won't need 30-day per-platform metrics). I understand that this won't pose significant challenges besides deciding about the format in which to store the data (I think it's fine to use the existing table as specified in T117615, considering that it already has a column that clearly specifies the timespan, but am open to other options).
Thanks! Yes, that's correct (I posted my comment above without seeing that you had already provided a summary). To be extra clear, we don't need 30-day per-platform reports or 7-day overall reports.