Page MenuHomePhabricator

Make MobileWebUIClickTracking schema usable (too big)
Closed, ResolvedPublic

Description

An event logs user should be able to run queries on mobileweb click data in under an hour.

MobileWebUIClickTracking_10742159 is getting too big (~300gb).
https://meta.wikimedia.org/wiki/Schema:MobileWebUIClickTracking

  • Remove search events from schema.
  • Sample at 10%

Event Timeline

JKatzWMF raised the priority of this task from to Needs Triage.
JKatzWMF updated the task description. (Show Details)
JKatzWMF added a project: Web-Team-Backlog.
Jdlrobson set Security to None.

We already sample. We can sample more but this is going to be a lot of data regardless of what we do.

The main issue you are having with sampling is purging is not happening as it should be. Can you ask one of the analytics team to purge old items (I'm not sure how to do that)?

I agree, we should aggregate the old data and then purge it.

Does this have to do with the recent emails about agregation and purging?
(eg: removing the edit count to make it bucketed, etc)

Jhernandez renamed this task from make MobileWebUIClickTracking schema usable to Make MobileWebUIClickTracking schema usable (too big).Aug 14 2015, 11:34 AM
Jhernandez edited projects, added MobileFrontend; removed Web-Team-Backlog.

@Jhernandez, yes I think so. This task was linked in the email if I'm not mistaken.

no. separate concerns. this was instigated by my running the 'beta
header' analysis.

Jhernandez removed a project: Web-Team-Backlog.

Who should we ping on Analytics about this?

@Jhernandez, I would check with @kevinator about who could help us with the purging. In the meantime, separating the search out from the rest of the clicks would help considerably as the vast, vast majority are search.

Given there is mobilewebsearch do we even need this? Can that be used to the same effect?

@Jdlrobson I would recommend removing the search events. We could also do decreased sampling (smaller %), but only if we could sample at different rates for beta and stable. For beta, sampling any further will reduce accuracy to an unacceptable level.

Jdlrobson updated the task description. (Show Details)

Change 261192 had a related patch set uploaded (by Bmansurov):
Remove the 'search' event from Schema:MobileWebUIClickTracking

https://gerrit.wikimedia.org/r/261192

Is purging old data part of the task?

@bmansurov don't worry about that I can sync with @Nuria to make that happen in sign off

Change 261192 merged by jenkins-bot:
Remove the 'search' event from Schema:MobileWebUIClickTracking

https://gerrit.wikimedia.org/r/261192

@Nuria I see the purging strategy is 90 days for https://meta.wikimedia.org/wiki/Schema_talk:MobileWebUIClickTracking
Can you confirm this?
e.g. come February this should begin shrinking in size automatically.

@JKatzWMF / @dr0ptp4kt @Tbayer is there a pressing need to purge sooner?

@Jdlrobson, to follow up on our videoconference: I'm indifferent. @Tbayer, for the sake of you being able to conduct analysis, should the existing records older than 90 days be deleted?

For the purpose of addressing this once instead of having to handle this in a more protracted manner, it seems like changing from the documented UPDATE strategy to instead a DELETE purge strategy for records older than 90 days would be simplest and ensure we generally conserve disk space automatically in the future.

@bmansurov, speaking to the aggregation: do you know whether the Limn charts would retain the historical aggregates even if we applied the DELETE purge strategy? I seem to recall for some Limn charts the entire chartable output is being fully regenerated with every cronjob execution, but for others the cronjob only looks at, for example, the previous day's records and effectively appends the chartable output.

@dr0ptp4kt, I also remember limn to append new records, and not re-generate everything all over. @Milimetric, do you happen to remember whether "the Limn charts would retain the historical aggregates even if we applied the DELETE purge strategy"?

@dr0ptp4kt I don't have current or planned analyses involving this schema, so no objection from me on the purging.

In general however, I wonder if we should veer more in the direction of reducing the sampling rate (perhaps excepting beta, per JonK above).

Per the task description, the goal is to be able to run queries "in under an hour", and I can confirm from experience with e.g. MobileWebSectionUsage (a related table from mobile web) that this can be an issue - I've had queries take up to three days. The task description doesn't specify to which queries the one hour goal refers to. I would propose to interpret it as relatively simple queries (without JOINs, etc.) covering a timespan of one week (to account for the strong weekly periodicity of the data) - haven't tested myself how long these take currently though.

But as far as I'm aware, EventLogging tables always already come with an index for the timestamp field. That means that the overall size of the database (i.e. the aspect addressed by the 90 day purge) does not have a notable effect on the length of queries as long as they are restricted to a certain timespan. To make these faster and achieve this task's goal, one would need to reduce the sample rate instead (as proposed for the aforementioned MobileWebSectionUsage schema in T120292).

A simple count(*) on events from November 2015 took 3.5 minutes, and in this case it looks like it was actually about 100 million rows scanned. I suppose reducing this down to a week and doing stuff like joins could still result in pretty long queries.

I agree that reduction of the sampling rate (and not logging the search event) should largely simplify things. @bmansurov's patch (https://gerrit.wikimedia.org/r/#/c/261192/) handles that, so we'll be covered, I think.

@Nuria, is the size of the table too big for the old data such that old rows need to be deleted? Or is it okay as is?

A simple count(*) on events from November 2015 took 3.5 minutes, and in this case it looks like it was actually about 100 million rows scanned. I suppose reducing this down to a week and doing stuff like joins could still result in pretty long queries.

A simple COUNT(*) will only use the index and hence be way faster than a comparable query that actually accesses the table's content (JOINs or not). Hence this particular kind of query shouldn't be our benchmark here.

(I did not understand this myself either until AaronH pointed it out to me last month, in case of the MobileWebSectionUsage_14321266 table. The examples that had puzzled me back then:

  1. this one took 1 hour and 6 min, for (according to AaronH) 1.7 million rows back then:
SELECT SUM(IF(event_hasServiceWorkerSupport =1,1,0))/SUM(1) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151115%' AND wiki = 'enwiki';
  1. on the other hand, this one only took 3 seconds, despite covering the same rows (or even more, if the "wiki" field is indexed too, which I understand is the case):
SELECT COUNT(*) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151115%';

In other words, the execution time for a COUNT(*) and a simple query without JOINs etc. for the same rows can differ by a factor of more than 1000.

I agree that reduction of the sampling rate (and not logging the search event) should largely simplify things. @bmansurov's patch (https://gerrit.wikimedia.org/r/#/c/261192/) handles that, so we'll be covered, I think.

Ah OK, wasn't aware that the patch already reduces the sampling rate, too. Should be OK to wait and see then, until we have a concrete use case again.

@Nuria, is the size of the table too big for the old data such that old rows need to be deleted? Or is it okay as is?

Well, looking at size I would say it is huge (see below)

Now, it is true that purging is deactivated until we finish some db updates that @jcrespo is doing but I doubt a table with lamost 500G is queryable in any timely manner.

mysql:research@analytics-store.eqiad.wmnet [log]> SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as TOTAL SIZE (GB), ENGINE, CREATE_OPTIONS FROM information_schema.tables WHERE TABLE_SCHEMA='log' and table_name like 'Mobile%'and (DATA_LENGTH + INDEX_LENGTH) >1 order by (DATA_LENGTH + INDEX_LENGTH) desc;

+-------------------------------------------+------------------+--------+-----------------------------+

table_nameTOTAL SIZE (GB)ENGINECREATE_OPTIONS

+-------------------------------------------+------------------+--------+-----------------------------+

MobileWebUIClickTracking_10742159479.602417084388TokuDBcompression='tokudb_zlib'
MobileWikiAppToCInteraction_10375484121.233774635941TokuDBcompression='tokudb_zlib'
MobileWebEditing_859902581.684224204160TokuDBcompression='tokudb_zlib'
MobileWikiAppToCInteraction_846146779.390994886868TokuDBcompression='tokudb_zlib'
MobileWikiAppSearch_1064198872.639844453894TokuDBcompression='tokudb_zlib'
MobileWikiAppArticleSuggestions_1144842667.397623375989TokuDBcompression='tokudb_zlib'
MobileWebSectionUsage_1432126644.017015001737TokuDBcompression='tokudb_zlib'
MobileWikiAppMediaGallery_1092313540.088205121458TokuDBcompression='tokudb_zlib'
MobileWebSectionUsage_1503845833.697622315958TokuDBcompression='tokudb_zlib'
MobileWikiAppArticleSuggestions_1244379130.586776386015TokuDBcompression='tokudb_zlib'
MobileWikiAppArticleSuggestions_1059086924.782805022784TokuDBcompression='tokudb_zlib'
MobileBetaWatchlist_528106113.858823995106TokuDBcompression='tokudb_zlib'
MobileWikiAppOnboarding_912346612.828234877437TokuDBcompression='tokudb_zlib'
MobileWikiAppShareAFact_1133197411.165589747019TokuDBcompression='tokudb_zlib'
MobileWikiAppOperatorCode_898391810.578740695491TokuDBcompression='tokudb_zlib'