Page MenuHomePhabricator

Decide on sampling rate for EventLogging
Closed, ResolvedPublic

Description

Once we make the feature default on Chinese, Greek and Catalan wikipedias we'll need to have a sampling rate lesser than 1.

Page views per month (from http://stats.wikimedia.org/EN/Sitemap.htm)

LanguagePage views per monthLink
Chinese601,992,893http://stats.wikimedia.org/EN/SummaryZH.htm
Greek30,853,193http://stats.wikimedia.org/EN/SummaryEL.htm
Catalan38,498,078http://stats.wikimedia.org/EN/SummaryCA.htm

We also have ~22,620 people that have the feature activated on English Wikipedia.

Event Timeline

Prtksxna created this task.EditedJan 30 2015, 11:26 PM
Prtksxna raised the priority of this task from to High.
Prtksxna updated the task description. (Show Details)
Prtksxna added a project: Page-Previews.
Prtksxna added subscribers: Aklapper, Prtksxna, leila and 2 others.

@Quiddity, you had page view numbers for each WP? Added!

Prtksxna updated the task description. (Show Details)Feb 7 2015, 1:29 AM
Prtksxna set Security to None.

@Nuria, what other information do we need to decide a sampling rate?

Nuria added a comment.Feb 7 2015, 1:32 AM

@Prtksxna I guess we just need to know events per pageview to calculate how many events per second will we be seeing

@Prtksxna I guess we just need to know events per pageview to calculate how many events per second will we be seeing

How do you suggest we go about that? Currently our events only log sessionId and has no page specific information. Is there a way to estimate this?

Nuria added a comment.Feb 7 2015, 1:50 AM

Let's see:

  • Is every user to which a page is presented going to see a hovercard?
  • Does the hovercard view trigger 1 event or many?

What questions do you want to answer with the schema?

Is every user to which a page is presented going to see a hovercard?

If they hover over a link, then yes!

Does the hovercard view trigger 1 event or many?

One per link hovered or clicked on.

What questions do you want to answer with the schema?

That is a discussion we (@Jaredzimmerman-WMF @DarTar and @leila) have had a couple of times. I'll update T88164 soon to reflect our discussion.

What questions do you want to answer with the schema?

That is a discussion we (@Jaredzimmerman-WMF @DarTar and @leila) have had a couple of times. I'll update T88164 soon to reflect our discussion.

I have updated the description of T88166.

Nuria added a comment.Feb 7 2015, 2:17 AM

If you are potentially login several events per pageview and (adding all your pageviews above) you have about 250 pageviews per second I will try to log no higher than 2 event per second or so, that would mean 1/100 (rounding big time) -if events were sent once per pageview- but if you think that users might click, say, 5 times per page to see hovercards I would divide this by 5 so 1/500.

That would give you about 200k of events every day, you can run the experiment as long as needed. Note that we are not only concern about a reasonable amount of throughput that EL can sustain but also about getting an amount of data that we can look at, if we log too much we have to throw data away as queries will not run.

Prtksxna updated the task description. (Show Details)Feb 7 2015, 3:44 AM

Thanks @Nuria! We'll use that formula to come to a number. In the case that we collect too many or too few events in the first week, we can revisit and tweak the number. Is that alright?

@leila, how many events per day do you think we need to do a successful analysis?
@Jaredzimmerman-WMF, do you have as estimate on how many links people hover on on a single page view? Can you extrapolate on the basis of your own usage?

Nuria added a comment.Feb 7 2015, 4:07 AM

Right, iterating sounds good.

leila added a comment.Feb 9 2015, 6:06 PM

@Prtksxna, it depends what questions you want to answer at the end of the three months. Let's touch-base on this when we meet this week. We can update this thread afterwards.

@Prtksxna, it depends what questions you want to answer at the end of the three months.

I think you stated the question quite clearly on Trello. I had copied it to the description of T88166 too.

@Nuria I think we have to look to someone with more familiarity in EventLogging to know what sampling rate would be safe, and not cause issues. Who is in the best place to make that recommendation based on the number of active PV on those wikis?

Nuria added a comment.Feb 18 2015, 1:11 AM

I already commented on a ticket about this, @Prtksxna answered. Let me see if i can find the ticket.

Nuria added a comment.Feb 18 2015, 1:14 AM

Ah, it was this ticket!

"If you are potentially login several events per pageview and (adding all your pageviews above) you have about 250 pageviews per second I will try to log no higher than 2 event per second or so, that would mean 1/100 (rounding big time) -if events were sent once per pageview- but if you think that users might click, say, 5 times per page to see hovercards I would divide this by 5 so 1/500.

That would give you about 200k of events every day, you can run the experiment as long as needed. Note that we are not only concern about a reasonable amount of throughput that EL can sustain but also about getting an amount of data that we can look at, if we log too much we have to throw data away as queries will not run. "

Is this enough?

@leila do you think that gives you enough data for a high level of confidence when comparing enabled vs disabled user events (obviously we don't know what the opt out rate will be) gergo said they used 1/1000 for mediaviewer

Nuria added a comment.Feb 18 2015, 1:26 AM

The rate is not so important as the "amount" of data you will have at the end to do your analysis. Each feature has different usage and thus different rates of sampling.

@Jaredzimmerman-WMF, how you sample depends on what you want to do with the data.

if you want to be able to study how users use Hovercards in sessions, then you should sample sessions. if you want to see how users interact with Hovercards over the course of 90 days, then you should have userTokens with expiry date set to >90 days and you should sample users (something which is not in the schema as of now). If you go with the former, we should know something about the probability of each action or otherwise we can't use the sampled data to reconstruct the actual number of each action type. If this schema is already collecting data in Beta, we can use that data to compute the probabilities.

The biggest question for sampling and schema itself is: what do you want to be able to say after 90 days? What should happen that you call the feature successful and call for pushing it to all wikis? The sampling and schema should reflect that goal.

Seems like an easy place to start would be couting hoovercards displays per pageview, that would be easier than doing sessions (mediawiki only has true sessions for logged in users).

We can radomly sample pageviews and get that data.

Like:

if rand(1,10) > 5  // sample 50%
 send all hoovercard impressions for this page_id
 also send whether user was logged in.

Gather data over a long enough timeperiod that we think we have data for the majority of the user base (anonymous or otherwise)

If you want to get session data things become a bit more complicated as you need to have a token that persists across page transitions.

More detail:

  1. Get an identifier for the user visit to this page: visitId = randomIdentifier+pageId+ timestamp(example) (no need to persist it other than in memory as it is mean to last only for the visit to that one page)
  2. Send visitId plus every hoover card impression (with how long it was displayed)
  3. When analyzing aggregate by visitId to get all hoovercards displayed for one page.

So long as we can compare hovercard views to page views…

*Jared Zimmerman * \\ Director of User Experience \\ Wikimedia Foundation

M +1 415 609 4043 \\ @Jaredzimmerman http://loo.ms/g0

https://gerrit.wikimedia.org/r/196190 adds a sampling rate of 1.
@leila Do we have a number now? I'll update the patch accordingly.

leila added a comment.Mar 19 2015, 1:56 AM

@Prtksxna, I just reviewed 11528589. Can you specify what events will be logged when the Hovercard is turned off by the user?

@Prtksxna, I just reviewed 11528589. Can you specify what events will be logged when the Hovercard is turned off by the user?

Updated. You can see 11625443 now.

leila added a comment.Mar 19 2015, 6:13 PM

Thanks, @Prtksxna. Please log 1 out of 10 events.

Prtksxna closed this task as Resolved.Mar 20 2015, 5:01 AM
Prtksxna claimed this task.

As per @leila: Sampling rate has been decided as 1 out 10 events.

We'll revisit this number if needed

Quiddity moved this task from Next Up to Done on the Page-Previews board.Mar 28 2015, 1:14 AM