Page MenuHomePhabricator

Sharp drop of navtiming daemon metrics report rate on 2021-01-21
Closed, ResolvedPublic

Description

The drop corresponds to this deployment:

14:57 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate QuickSurveys schemas to EventGate on all wikis - T271165, T271166 (duration: 01m 05s)

This change affects navtiming processing of QuickSurveys (https://phabricator.wikimedia.org/T271166#6765813) but also seems like it's affecting processing of all the other schemas the navtiming daemon consumes from Kafka.

NavigationTiming:

Screenshot 2021-01-21 at 17.33.13.png (322×446 px, 26 KB)

SaveTiming:

Screenshot 2021-01-21 at 17.34.33.png (363×367 px, 32 KB)

Event Timeline

Gilles added a subscriber: dpifke.

From: https://phabricator.wikimedia.org/T271166#6765916

? Weird. no your client should be working just fine with the change. The events are flowing into Kafka just fine:

e.g. https://grafana-rw.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=QuickSurveyInitiation

Here's an example of a QuickSurveyInitiation I see now:

{
  "event": {
    "surveySessionToken": "xxxx-quicksurveys",
    "pageviewToken": "xxxxx",
    "surveyCodeName": "perceived-performance-survey",
    "eventName": "impression",
    "performanceNow": 27310
  },
  "schema": "QuickSurveyInitiation",
  "webHost": "es.m.wikipedia.org",
  "wiki": "eswiki",
  "$schema": "/analytics/legacy/quicksurveyinitiation/1.0.0",
  "client_dt": "2021-01-21T16:38:01.469Z",
  "meta": {
    "stream": "eventlogging_QuickSurveyInitiation",
    "domain": "es.m.wikipedia.org",
    "id": "xxxxxxx",
    "dt": "2021-01-21T16:38:21.486Z",
    "request_id": "xxxxxx"
  },
  "dt": "2021-01-21T16:38:01.470Z",
  "http": {
    "client_ip": "xxxxx",
    "request_headers": {
      "user-agent": "xxxxxx"
    }
  }
}

Looking at navtiming journalctl on webperf1001:

Jan 21 16:45:17 webperf1001 python3[31026]: kafka.errors.UnsupportedCodecError: UnsupportedCodecError: Libraries for snappy compression codec not found

I'm guessing that EventGate is producing these compressed with snappy, but eventlogging-processor is not!

Change 657639 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install python3-snappy for webperf navtiming

https://gerrit.wikimedia.org/r/657639

Change 657639 merged by Ottomata:
[operations/puppet@production] Install python3-snappy for webperf navtiming

https://gerrit.wikimedia.org/r/657639

Gilles claimed this task.

It does look fixed, thanks!