Page MenuHomePhabricator

Increase sampling rates for search metrics on smaller language wikis
Open, NormalPublic

Description

The sampling rates are too small for some combinations on the language/project breakdown for smaller projects or smaller languages. ( see https://discovery.wmflabs.org/metrics/#langproj_breakdown)

Let's increase these sampling rates so we get more meaningful graphs.

Event Timeline

EBjune created this task.Jun 13 2018, 3:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2018, 3:35 PM
TJones added a subscriber: TJones.Jun 14 2018, 5:48 PM
chelsyx moved this task from Triage to Backlog on the Product-Analytics board.
chelsyx added a subscriber: chelsyx.
Vvjjkkii renamed this task from Increase sampling rates for search metrics on smaller language wikis to i3aaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from i3aaaaaaaa to Increase sampling rates for search metrics on smaller language wikis.Jul 2 2018, 1:52 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
MBinder_WMF triaged this task as Normal priority.Aug 2 2018, 8:22 PM
Nuria added subscribers: kzimmerman, EBernhardson, Nuria.EditedFeb 7 2019, 10:45 PM

The path to be able to do this I think could be as follows:

  1. The backend of current dashboards (for some reports) is moved to pull data from hadoop via reportupdater rather than from mysql, we leave frontend as is for dashboards but similar to https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os the dashboards pull data from tsvs that come from hadoop instead of mysql. (cc @chelsyx, @kzimmerman )
  1. Once backend of (some) of dashboards is changed for some reports we remove publishing of data of schemas to MySQL. See current data stream now: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=TestSearchSatisfaction2
  1. We change sampling rates here (per @EBernhardson) https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/all/ext.wikimediaEvents.searchSatisfaction.js#L128 so every wiki is sampled to 1/10
  1. We see if this sampling is sufficient for minority languages

Of these tasks the one that would fall on Product-Analytics is the 1st one for which they would need some Analytics support

Change 508935 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Increase search satisfaction logging by 10x

https://gerrit.wikimedia.org/r/508935

I should note we were already sampling at 1:10 for any wiki not explicitly named in the configuration. The patch applies a 10x to all sampling rates, but caps it at 1:10. For low volume wikis this means no change, for mid volume wikis they will increase data by less than 10x. The largest wikis (that still have explicit sampling rates) will see a full 10x increase in collected data.

The name/description of the ticket though seem to suggest that our existing 1:10 sampling is not enough data. Do we need to consider something more like 1:4?

Change 508935 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Increase search satisfaction logging by 10x

https://gerrit.wikimedia.org/r/508935

I thought the goal was logging 100% for the smaller projects. 1:10 or even less is probably okay for the giant wikis (though it still makes computing relative percentages of volume either wrong or complicated to compute).

Nuria added a comment.May 13 2019, 2:38 PM

@TJones we can log at a higher rate but we should ramp up logging gradually.

Change 512056 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Increase TestSearchSatisfaction sampling by 5x

https://gerrit.wikimedia.org/r/512056

Change 512056 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Increase TestSearchSatisfaction sampling by 5x

https://gerrit.wikimedia.org/r/512056

debt added a subscriber: debt.Jun 11 2019, 5:30 PM

will update to enwiki to go to full sampling is still the plan, we'll get it halfway there next week when the train starts running again. It's currently sampling 1:40 and will move up slowly to get to sampling 1:1

Change 523826 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Give all wikis the same search schema sampling rate

https://gerrit.wikimedia.org/r/523826

Change 523826 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Give all wikis the same search schema sampling rate

https://gerrit.wikimedia.org/r/523826

debt added a comment.Jul 23 2019, 5:33 PM

We're on all wikis now, but not quite at 100%

debt added a comment.Tue, Jul 30, 5:23 PM

Keeping to push up slightly the sampling rate and looking at the data coming in.