Page MenuHomePhabricator

Huge reports that could clog Wikimetrics may happen accidentally, add a warning
Closed, DeclinedPublic

Description

Add a warning when a user tries to run a report that would return more than X data points, where X is sufficiently large.


Version: unspecified
Severity: enhancement

Details

Reference
bz58754

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:40 AM
bzimport set Reference to bz58754.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/1347

Rather than a warning (which in my experience users often do not read) I think it will better to determine threshold of "X" and not let users upload if X is sufficiently large to run into problems.

I suggested a hard limit and people like Dario and Jamie were against that. It's possible that in some rare cases people might need to run very large reports. I agree with you about the warning, but without getting into user roles and different privileges it's the only solution I see.

If there are huge cohorts, may be the tool could automatically split it into subcohorts, then it will schedule each subcohort separately, and will create a temporary report storing results that will then be aggregated.
It is opssible if the SQL queries contain only aggregatable items (all of them should be aggregatable, because Wikimetrics should only be used to generate aggregate data to respect users privacy)

So all data columns should specify the type aggregate they use: COUNT, MIN, MAX, SUM.

Derived aggregates can be computed in a scheduled way using only these basic aggregates: this includes AVG (uses SUM and COUNT), STDDEV or VAR (uses SUM(data), SUM(data^2) and COUNT).

The scheduler would then report the status of each subcohort processed and if needed it can be paused at any time when it has already run for too long but there are enough data generated to create a valid report, and resumed later when the servers experiment lower work charges. The scheduler should also be able to monitor the time or work charge taken by each subcohort, in order to estimate and adjust the size of the next subcohort; or to insert variables delay before processing the next subcohort.

An SQL server admin could also kill a SQL query that takes too much time/resource: that query will fail, the scheduler will detect the failure and pause the processing intil the cohort parameters are adjusted and the scheduler being relaunched to restart the work for the last failed subcohort. This could allow manual tuning of these cohort sizes.

(the cohort uploader may also consider splitting this cohort himself into multiple ones with reasonnable size. The same cohort creator should not have multiple cohorts being processed at the same time, but he could schedul them in order).

mforns subscribed.

Declining because Wikimetrics is being discontinued. See: T211835.