Page MenuHomePhabricator

Move App session data to 7 day counts
Closed, DuplicatePublic

Description

While we are reseting the data, it might make sense to just start counting 7 days. This way we can track the impact of changes we make more accurately (30 day moving average tends to be weaker indicator)

Event Timeline

JKatzWMF raised the priority of this task from to Needs Triage.
JKatzWMF updated the task description. (Show Details)

(to record our in-person discussion about this from some days ago here, and expand a bit on it:)

It makes sense to me that a 7-day window might make the information a bit more actionable. (Per T86535#1016026 the main reason for choosing 30 days was just that "the old ad-hoc reports" used that window; on the other hand, Nuria cautioned there that anything below one week would probably not yield very meaningful data.)

However, T117615 is explicitly not about resetting the data, but about adding separate rows: "Since we have already collected quite a bit of historical data at this point for the aggregated (iOS & Android) metric, we should keep generating it as before, and add the platform-specific data separately."

It should be fine though to use a different window for the new platform-specific rows, as long as the value of the date_rangecolumn is set correctly.

@JKatzWMF this requires testing, calculating sessions requires quite a bit of data and until we test we will not know if doing it every 7 days renders meaningful data. My recommendation: get one of your devs to test this out. Should be easy to test (but lenghty) given that is a matter of modifying the parameters of the current report and run it 10/20 times for the ten weeks of data we normally hold in the cluster.

@Nuria thanks for your response. Can you explain to me why the following holds true?

until we test we will not know if doing it every 7 days renders meaningful data

Is this around the % of sessions that take place during the start and end of the period in question? Given the low median we see, I would be very surprised if there was a large number of sessions that were cut-off by this. Certainly the volume running through both apps is significant enough at this point that 7 days should be enough...

@JKantzWMF:

Is this around the % of sessions that take place during the start and end of the period in question?

Right, the number of users that have a session within the period. A session is defined as having more that 1 hit/request on the period for which it is calculated. The longer the period the more sessions you have. To report percentiles 1, 50, 90 and 99 you have to have a significant volume of data for which percentile reporting makes sense.

We need to test that that is the case (and you might be totally right that there is enough volume of data)

FWIW, the dataset has contained some sessions that are longer than a week, although losing these outliers to truncation effects would probably increase rather than decrease data quality ;)

(1121930 seconds = 13 days)

hive (wmf)> SELECT date_range, max FROM mobile_apps_session_metrics WHERE type = 'SessionLength';

date_range	max
2015-5-3 -- 2015-6-1	89947
2015-5-10 -- 2015-6-8	66993
2015-5-17 -- 2015-6-15	70800
2015-5-24 -- 2015-6-22	70800
2015-5-31 -- 2015-6-29	70800
2015-6-7 -- 2015-7-6	70800
2015-6-14 -- 2015-7-13	63407
2015-6-21 -- 2015-7-20	84005
2015-6-28 -- 2015-7-27	688800
2015-7-5 -- 2015-8-3	1121930
2015-7-12 -- 2015-8-10	1121930
2015-7-19 -- 2015-8-17	1121930
2015-7-26 -- 2015-8-24	605873
2015-8-2 -- 2015-8-31	52522
2015-8-9 -- 2015-9-7	52522
2015-8-16 -- 2015-9-14	52522
2015-8-23 -- 2015-9-21	49544
2015-8-30 -- 2015-9-28	366136
2015-9-6 -- 2015-10-5	366136
2015-9-13 -- 2015-10-12	486039
2015-9-20 -- 2015-10-19	514369
2015-9-27 -- 2015-10-26	514369
2015-10-4 -- 2015-11-2	514369

@Nuria

We need to test that that is the case (and you might be totally right that there is enough volume of data)

I'm really not concerned about this given our overall volume and Tilman's point about artificial session length. Regardless, do you have a sense for when this could be prioritized and where this fits into your queue?

I -at best- we will be able to get to this by end of this quarter, not before but we can probably give you a more precise estimate later on this week.

Then @Tbayer, @JKatzWMF

Would you mind clarifying what is the requirement?

You want monthly report to continue? (if so no changes to that report will be done)

You also want weekly metrics per device? (if so a new report of weekly metrics per device will be created)

@Nuria,

this is what I understand - We should change the report to add a column for "Device" or change the type column to reflect the device type, and produce per device and overall metrics. We should also change the jobs to run both over a 30 day period and over 7 day periods. The 7 day periods are needed for more granularity, but since these are not available from May 2015, we will continue to generate the 30 day periods for historical comparison. All the data could go into the same table, there will be different report date ranges to reflect the time periods the reports are run over.

Then @Tbayer, @JKatzWMF

Would you mind clarifying what is the requirement?

You want monthly report to continue? (if so no changes to that report will be done)

You also want weekly metrics per device? (if so a new report of weekly metrics per device will be created)

Hi Nuria, this was already specified in the main task (T117615 - TLDR: yes, for now we should continue the existing weekly reports covering 30 days worth of data for all devices).
Jon created a new ticket here to discuss a separate change to be implemented in those new platform-specific metrics, namely that they should cover 7 days instead of 30 days (and then we won't need 30-day per-platform metrics). I understand that this won't pose significant challenges besides deciding about the format in which to store the data (I think it's fine to use the existing table as specified in T117615, considering that it already has a column that clearly specifies the timespan, but am open to other options).

@Nuria,

this is what I understand - We should change the report to add a column for "Device" or change the type column to reflect the device type, and produce per device and overall metrics. We should also change the jobs to run both over a 30 day period and over 7 day periods. The 7 day periods are needed for more granularity, but since these are not available from May 2015, we will continue to generate the 30 day periods for historical comparison. All the data could go into the same table, there will be different report date ranges to reflect the time periods the reports are run over.

Thanks! Yes, that's correct (I posted my comment above without seeing that you had already provided a summary). To be extra clear, we don't need 30-day per-platform reports or 7-day overall reports.

Nuria triaged this task as Medium priority.Nov 30 2015, 7:41 PM
Nuria raised the priority of this task from Medium to High.Dec 3 2015, 6:17 PM