Page MenuHomePhabricator

Increase default sampling ratio of ReadingDepth
Closed, ResolvedPublic2 Estimated Story Points

Description

Set wgWMEReadingDepthSamplingRate to 0.1

Background:
The ReadingDepth schema is currently sampled at 0.001 (0.1% of sessions), to which it was decreased last year down from the originally planned 0.05% because of load issues with the old MariaDB EventLogging infrastructure. Since then, the new Hadoop EL environment has become available, which doesn't have these rate constraints. And last week, Analytics Engineering already blacklisted ReadingDepth from MariaDB for us ( T203596#4577520 ) because we are about to increase the event rate via a separate sample that will send ReadingDepth events as part of the Page Issues A/B test (T200792, sampled at 20% of sessions).

This task is about increasing the default sample too. We are about to launch a separate research project where @Groceryheist will need to use this data for questions where 0.1% will be too low (e.g. how dwell time depends on content).

Acceptance criteria

  • Let analytics know this is happening before the deploy.
  • Monitor ReadingDepth event traffic post-deploy and that it matches expectations.
  • Check with analytics post-deploy
  • Analyse any errors that are introduced in the EventLogging pipeline relating to this change (use stat1004 and kafkacat - T196904 has some good pointers).
  • Check error rate is not increased. If it is, understand the root cause and fix.

Event Timeline

Change 462042 had a related patch set uploaded (by HaeB; owner: HaeB):
[operations/mediawiki-config@master] Increase sampling ratio for ReadingDepth

https://gerrit.wikimedia.org/r/462042

ovasileva set the point value for this task to 2.
Jdlrobson raised the priority of this task from High to Needs Triage.Sep 24 2018, 5:15 PM
Jdlrobson updated the task description. (Show Details)

Change 462535 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[operations/mediawiki-config@master] Increate ReadingDepthSamplingRate to 0.1

https://gerrit.wikimedia.org/r/462535

Change 462535 abandoned by Pmiazga:
Increate ReadingDepthSamplingRate to 0.1

Reason:
Abandoned in favour of I7501c3ff20b73af140b74f1297221aded950df1e

https://gerrit.wikimedia.org/r/462535

Monitor ReadingDepth event traffic post-deploy and that it matches expectations.

Per rOMWC6bf9cdc796ba: wme: Set ReadingDepth sampling rate to 0.1%:

The current sampling rate, 0.5% (the default), results in a peak rate of ~2500 events/minute.

So we should expect fewer events than that. Regardless, we saw a peak rate of over 18,000 events/minute for the Page Previews instrumentation, which seemed to be handled by EventLogging pipeline (ingesting into Hadoop, not MariaDB) just fine.

Change 462042 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase sampling ratio for ReadingDepth

https://gerrit.wikimedia.org/r/462042

Mentioned in SAL (#wikimedia-operations) [2018-09-25T11:20:31Z] <pmiazga@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:462042|Increase sampling ratio for ReadingDepth (T205176)]] (duration: 00m 50s)

Monitor ReadingDepth event traffic post-deploy and that it matches expectations.

Per https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=ReadingDepth&from=1537874100000&to=now, we're seeing ~550 events/minute and 0 errors. For the latter, I've also been tailing the eventlogging_Error Kafka topic [0] and I've seen no errors there either.

[0]
kafkacat -q -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_EventError | grep ReadingDepth

Further to T205176#4614616:

Between 12 AM and 6 AM today we saw an error rate of between 0.0005% and 0.0067%. The upper bound assumes that all erroneous events that couldn't be identified as belonging to any schema are ReadingDepth events.


[0]
select
    count(*) as n
from
    event.readingdepth
where
    year = 2018 and
    month = 9 and
    day = 26 and

    hour >= 0 and hour < 6
;

+-----------+
|     n     |
+-----------+
| 10754243  |
+-----------+
[1]
select
    event.schema as schema,
    count(*) as n
from
    event.eventerror
where
    year = 2018 and
    month = 9 and
    day = 26 and

    hour >= 0 and hour < 6 and

    event.schema in ("ReadingDepth", "unknown")
group by
    event.schema
;

+---------------+------+
|    schema     |  n   |
+---------------+------+
| ReadingDepth  | 56   |
| unknown       | 661  |
+---------------+------+

Further to T205176#4614616:

Between 12 AM and 6 AM today we saw an error rate of between 0.0005% and 0.0067%.

[...]
Thanks! Does this mean we can consider the AC "Analyse any errors that are introduced in the EventLogging pipeline relating to this change" fulfilled?

Thanks! Does this mean we can consider the AC "Analyse any errors that are introduced in the EventLogging pipeline relating to this change" fulfilled?

Sure! I've checked it off and I'll update the task description.

The event rate before and after deploy looks plausible from a glance at Grafana - closing this task now.