Page MenuHomePhabricator

NavigationTiming Extension schemas Event Platform Migration
Closed, ResolvedPublic

Description

See: https://wikitech.wikimedia.org/wiki/Event_Platform/EventLogging_legacy

Unless otherwise notified, client IP and consequently geocoded data will no longer be collected for this event data after this migration. Please let us know if this should continue to be captured. See also T262626.

Schemas

Status
2021-01-26:

  • have been created in secondary repo (merged)
  • wiki pages have been protected
  • Hive table have been evolved
  • Migrated on testwiki (untested)

2021-01-27:

  • Migrated on group0 and group1 wikis

2021-01-28

  • Migrated on all wikis

2021-02-01

  • disabled eventlogging-processor backend for these, using new Refine job.

2021-02-23

  • Removed mediawiki-config overrides. DONE!

Details

ProjectBranchLines +/-Subject
operations/mediawiki-configmaster+0 -15
operations/puppetproduction+26 -0
performance/navtimingmaster+5 -1
performance/navtimingmaster+2 -0
performance/navtimingmaster+6 -0
performance/coalmaster+5 -0
performance/navtimingmaster+2 -9
performance/navtimingmaster+6 -0
mediawiki/extensions/NavigationTimingmaster+11 -11
operations/mediawiki-configmaster+9 -24
performance/navtimingmaster+5 -5
operations/mediawiki-configmaster+66 -0
operations/mediawiki-configmaster+16 -0
operations/mediawiki-configmaster+56 -0
schemas/event/secondarymaster+2 K -0
schemas/event/secondarymaster+423 -0
schemas/event/secondarymaster+6 -4
schemas/event/secondarymaster+9 -4
schemas/event/secondarymaster+410 -0
schemas/event/secondarymaster+395 -0
schemas/event/secondarymaster+500 -0
schemas/event/secondarymaster+437 -0
schemas/event/secondarymaster+362 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 658430 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Add PaintTiming to analytics/legacy

https://gerrit.wikimedia.org/r/658430

Change 658430 merged by Mforns:
[schemas/event/secondary@master] Add PaintTiming to analytics/legacy

https://gerrit.wikimedia.org/r/658430

Change 658431 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Add ElementTiming to analytics/legacy

https://gerrit.wikimedia.org/r/658431

Change 658431 merged by Mforns:
[schemas/event/secondary@master] Add ElementTiming to analytics/legacy

https://gerrit.wikimedia.org/r/658431

Change 658434 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Add LayoutShift to analytics/legacy

https://gerrit.wikimedia.org/r/658434

Change 658434 merged by Mforns:
[schemas/event/secondary@master] Add LayoutShift to analytics/legacy

https://gerrit.wikimedia.org/r/658434

Change 658435 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Add FeaturePolicyViolation to analytics/legacy

https://gerrit.wikimedia.org/r/658435

Change 658435 merged by Mforns:
[schemas/event/secondary@master] Add FeaturePolicyViolation to analytics/legacy

https://gerrit.wikimedia.org/r/658435

Change 658438 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Fix examples of analytics/legacy/elementtiming

https://gerrit.wikimedia.org/r/658438

Change 658438 merged by Mforns:
[schemas/event/secondary@master] Fix examples of analytics/legacy/elementtiming

https://gerrit.wikimedia.org/r/658438

Change 658439 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Correct examples of analytics/legacy/LayoutShift

https://gerrit.wikimedia.org/r/658439

Change 658439 merged by Mforns:
[schemas/event/secondary@master] Correct examples of analytics/legacy/LayoutShift

https://gerrit.wikimedia.org/r/658439

Change 658442 had a related patch set uploaded (by Mforns; owner: Mforns):
[schemas/event/secondary@master] Add FirstInputTiming to analytics/legacy

https://gerrit.wikimedia.org/r/658442

Change 658442 merged by Mforns:
[schemas/event/secondary@master] Add FirstInputTiming to analytics/legacy

https://gerrit.wikimedia.org/r/658442

@Gilles, for all 'analytics' event streams, we've been automating some monitoring and ingestion by injecting artificial canary events into the streams a few times an hour. For Hive, we then can differentiate between hours that have no data, and hours for which the ingestion pipeline is broken.

We haven't enabled canary events for production event streams yet, as the consumers of the data are varied and would need to know the canary events are coming and to filter them out.

What should we do with these? Canary events can be identified by where meta.domain === 'canary'. We'd like to enable canary events everywhere we can, but if this will break your navtiming consumers, we can avoid doing this for now.

Change 658668 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Migrate 5 NavigationTiming extension schemas from metawiki

https://gerrit.wikimedia.org/r/658668

Change 658668 merged by Ottomata:
[schemas/event/secondary@master] Migrate 5 NavigationTiming extension schemas from metawiki

https://gerrit.wikimedia.org/r/658668

Change 658673 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Declare 5 NavigationTiming eventlogging streams and migrate on testwiki

https://gerrit.wikimedia.org/r/658673

Change 658673 merged by Ottomata:
[operations/mediawiki-config@master] Declare 5 NavigationTiming eventlogging streams and migrate on testwiki

https://gerrit.wikimedia.org/r/658673

Mentioned in SAL (#wikimedia-operations) [2021-01-26T20:53:33Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate 5 NavigationTiming schemas to Event Platform on testwiki - T271208 (duration: 01m 17s)

Change 659010 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrat 5 NavigationTiming eventlogging streams on group0 and group1 wikis

https://gerrit.wikimedia.org/r/659010

Change 659010 merged by Ottomata:
[operations/mediawiki-config@master] Migrate 5 NavigationTiming eventlogging streams on group0 and group1 wikis

https://gerrit.wikimedia.org/r/659010

Mentioned in SAL (#wikimedia-operations) [2021-01-27T15:31:42Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate 5 NavigationTiming schemas to Event Platform on group0 and group1 - T271208 (duration: 01m 07s)

Change 659022 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/mediawiki-config@master] Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki

https://gerrit.wikimedia.org/r/659022

Change 659022 merged by jenkins-bot:
[operations/mediawiki-config@master] Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki

https://gerrit.wikimedia.org/r/659022

Mentioned in SAL (#wikimedia-operations) [2021-01-27T19:17:00Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: cabb2e2009f97bb86c1b8827c3cc61cc991c41a9: Declare 6 more NavigationTiming eventlogging streams and migrate on testwiki (T271208) (duration: 01m 00s)

The migration of the 5 schemas to group0 and group1 yesterday caused a huge drop in report rate, something's not working.

NavigationTiming schema for group 0:

Group 2 is also affected for NavigationTiming, but doesn't drop to no data:

SaveTiming is also affected (group 1 shown here):

PaintTiming schema is also partially affected:

Checking if the data is in Hive for all of these, comparing the amount of rows between 2021-01-26 and 2021-01-27:

SELECT day, COUNT(*) FROM event.NavigationTiming WHERE year = 2021 AND month = 1 AND day IN (26,27) GROUP BY day;
Schema2021-01-262021-01-27Difference
NavigationTiming15511431509344- 2.7%
ResourceTiming10634921035328- 2.7%
SaveTiming277036268259-3.2%
CentralNoticeTiming109269802-11.3%
CpuBenchmark1006498987759-1.9%

The differences seen aren't consistent with the drop seen in the graphs and likely due to the overall traffic difference between those 2 days. The different for CentralNotice may also depend on ongoing notice campaigns.

This suggests that the issue might be in the pipeline that turns these events into statsd metrics.

This comment was removed by Gilles.

Seems caused by the navtiming agent expecting a userAgent key:

Jan 28 11:32:19 webperf1001 python3[11796]: Traceback (most recent call last):
Jan 28 11:32:19 webperf1001 python3[11796]:   File "/srv/deployment/performance/navtiming-cache/revs/ba6cd0d970d18aa71d3745eaa49fa09dea7dbe21/navtiming/__init__.py", line 831, in run
Jan 28 11:32:19 webperf1001 python3[11796]:     for stat in f(meta):
Jan 28 11:32:19 webperf1001 python3[11796]:   File "/srv/deployment/performance/navtiming-cache/revs/ba6cd0d970d18aa71d3745eaa49fa09dea7dbe21/navtiming/__init__.py", line 650, in handle_navigation_timing
Jan 28 11:32:19 webperf1001 python3[11796]:     if not self.is_compliant(event, meta['userAgent']):
Jan 28 11:32:19 webperf1001 python3[11796]: KeyError: 'userAgent'

I must have missed a possible execution path when I prepared navtiming's code for this change.

Change 659240 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Check NavigationTiming compliance after normalizing UA

https://gerrit.wikimedia.org/r/659240

Change 659240 merged by jenkins-bot:
[performance/navtiming@master] Check NavigationTiming compliance after normalizing UA

https://gerrit.wikimedia.org/r/659240

Seems like the fix worked as expected:

Oh no, sorry Gilles! I should have tested with you all a little more closely with events from testwiki.

@Gilles, interesting, would it be possible to add checking if meta.domain === 'canary' and considering such an event 'non compliant', or discarding it altogether? See my comment here: https://phabricator.wikimedia.org/T271208#6778241

@Gilles, FYI, since things seem to be ok now, we will proceed with the migration to all wikis.

Change 659264 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate NavigationTiming schemas to Event Platform on all wikis

https://gerrit.wikimedia.org/r/659264

Change 659288 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Finalize NavigationTiming extension Event Platform migration

https://gerrit.wikimedia.org/r/659288

Change 659264 merged by Ottomata:
[operations/mediawiki-config@master] Migrate NavigationTiming schemas to Event Platform on all wikis

https://gerrit.wikimedia.org/r/659264

Mentioned in SAL (#wikimedia-operations) [2021-01-28T15:25:26Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate NavigationTiming schemas to Event Platform on all wikis - T271208 (duration: 01m 11s)

Change 659308 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/NavigationTiming@master] Finalize Event Platform migration

https://gerrit.wikimedia.org/r/659308

@Gilles, interesting, would it be possible to add checking if meta.domain === 'canary' and considering such an event 'non compliant', or discarding it altogether? See my comment here: https://phabricator.wikimedia.org/T271208#6778241

Sure, I'll add that.

Change 659737 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Ignore canary events

https://gerrit.wikimedia.org/r/659737

Change 659308 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Finalize Event Platform migration

https://gerrit.wikimedia.org/r/659308

It seems like this latest deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/659264/ caused another partial drop of report rate:

This time, I'm noticing a much less frequent error coming from the navtiming daemon in regards to its processing of events from the QuickSurveyInitiation schema:

Jan 29 10:30:04 webperf1001 python3[28492]:   File "/srv/deployment/performance/navtiming-cache/revs/446e5dfdca4ec41d741f72b05bce2967783e4fcc/navtiming/__init__.py", line 831, in run
Jan 29 10:30:04 webperf1001 python3[28492]:     for stat in f(meta):
Jan 29 10:30:04 webperf1001 python3[28492]:   File "/srv/deployment/performance/navtiming-cache/revs/446e5dfdca4ec41d741f72b05bce2967783e4fcc/navtiming/__init__.py", line 489, in handle_quick_survey_initiation
Jan 29 10:30:04 webperf1001 python3[28492]:     wiki = meta['wiki']
Jan 29 10:30:04 webperf1001 python3[28492]: KeyError: 'wiki'

Is it expected that the QuickSurveyInitiation schema sometimes lacks a "wiki" field? It seems to be present in the schema definition:

https://github.com/wikimedia/schemas-event-secondary/blob/master/jsonschema/analytics/legacy/quicksurveyinitiation/1.0.0.json#L118-L121

Is it one of those canary events? I notice that the error happens every 15 minutes like clockwork:

Jan 29 10:00:05 webperf1001 python3[28492]: KeyError: 'wiki'
Jan 29 10:15:03 webperf1001 python3[28492]: KeyError: 'wiki'
Jan 29 10:30:04 webperf1001 python3[28492]: KeyError: 'wiki'
Jan 29 10:45:04 webperf1001 python3[28492]: KeyError: 'wiki'

Change 659737 merged by jenkins-bot:
[performance/navtiming@master] Ignore canary events

https://gerrit.wikimedia.org/r/659737

Mentioned in SAL (#wikimedia-operations) [2021-01-29T11:06:39Z] <gilles@deploy1001> Started deploy [performance/navtiming@125f6be]: T271208 Ignore canary events

Mentioned in SAL (#wikimedia-operations) [2021-01-29T11:06:44Z] <gilles@deploy1001> Finished deploy [performance/navtiming@125f6be]: T271208 Ignore canary events (duration: 00m 05s)

Ignoring canary events didn't stop the error 🙁

Change 659804 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Log instead of hard error on missing “wiki” field

https://gerrit.wikimedia.org/r/659804

Change 659804 merged by jenkins-bot:
[performance/navtiming@master] Log instead of hard error on missing “wiki” field

https://gerrit.wikimedia.org/r/659804

Mentioned in SAL (#wikimedia-operations) [2021-01-29T11:26:33Z] <gilles@deploy1001> Started deploy [performance/navtiming@e7712c3]: T271208 Log instead of hard error on missing wiki field

Mentioned in SAL (#wikimedia-operations) [2021-01-29T11:26:39Z] <gilles@deploy1001> Finished deploy [performance/navtiming@e7712c3]: T271208 Log instead of hard error on missing wiki field (duration: 00m 06s)

It is a canary message, I was just checking for those incorrectly:

Invalid QuickSurveyInitiation event, missing wiki field: {'http': {'client_ip': '127.0.0.1', 'request_headers': {'user-agent': 'Apache-HttpClient/4.5.12 (Java/1.8.0_272)'}}, 'client_dt': '2020-04-02T19:11:20.942Z', 'event': {'surveyCodeName': 'perceived-performance-survey'}, 'meta': {'domain': 'canary', 'request_id': 'cecead0e-b801-4cc8-a7e5-64c78fd68993', 'dt': '2021-01-29T11:30:05.026Z', 'id': 'b0caf18d-6c7f-4403-947d-2712bbe28610', 'stream': 'eventlogging_QuickSurveyInitiation'}, 'dt': '2020-04-02T19:11:20.942Z', 'schema': 'QuickSurveyInitiation', '$schema': '/analytics/legacy/quicksurveyinitiation/1.0.0'}

Fix incoming.

Change 659811 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Fix canary event check

https://gerrit.wikimedia.org/r/659811

Change 659811 merged by jenkins-bot:
[performance/navtiming@master] Fix canary event check

https://gerrit.wikimedia.org/r/659811

Mentioned in SAL (#wikimedia-operations) [2021-01-29T11:39:03Z] <gilles@deploy1001> Started deploy [performance/navtiming@ae8310a]: T271208 Fix canary event check

Mentioned in SAL (#wikimedia-operations) [2021-01-29T11:39:08Z] <gilles@deploy1001> Finished deploy [performance/navtiming@ae8310a]: T271208 Fix canary event check (duration: 00m 05s)

Change 659821 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/coal@master] Filter out canary events

https://gerrit.wikimedia.org/r/659821

Change 659821 merged by jenkins-bot:
[performance/coal@master] Filter out canary events

https://gerrit.wikimedia.org/r/659821

Mentioned in SAL (#wikimedia-operations) [2021-01-29T12:00:32Z] <gilles@deploy1001> Started deploy [performance/coal@b0d3b59]: T271208 Filter out canary events

Mentioned in SAL (#wikimedia-operations) [2021-01-29T12:00:39Z] <gilles@deploy1001> Finished deploy [performance/coal@b0d3b59]: T271208 Filter out canary events (duration: 00m 06s)

Canary events are now correctly filtered out. I'm a little skeptical that this will let us recover the missing 15% of report rate, when this error was only occurring once every 15 minutes. I have no idea where this drop caused by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/659264/ might be coming from.

@Ottomata are events from the old pipeline still supposed to make it through to Kafka at this point for all these schemas? I think that a number of clients with cached old JS are going to keep sending events to the old endpoints for some time.

are events from the old pipeline still supposed to make it through to Kafka at this point for all these schemas?

Until we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/659288, yes, they should still make it through ok.

During these migrations, I usually merge that change a day or two after the full migration to all wikis. But, since NavTiming is so wide spread, I was planning to wait until Monday. I'll do some analysis before I do to see how many of these events are still coming through the old pipeline.

Is it expected that the QuickSurveyInitiation schema sometimes lacks a "wiki" field? It seems to be present in the schema definition

Unless a field is required in a schema, code should probably check if the field exists before accessing it.

@Gilles as far as I can tell just from message throughput of the few of the ext:NavigationTiming schemas, things look just about normal week over week.

etc.

What is loadEventEnd?

loadEventEnd is just one of the fields recorded by the NavigationTiming schema. I notice that only the NavigationTiming schema is affected by the 15% drop, not PaintTiming, for example. On the dashboard you've linked to, I see that the amount of non-compliant events has dropped significantly.

I think this might be related to the fact that I swapped the order between the compliance check and the "get_navigation_timing_context" call that normalizes the UA string. This suggests that the issue lies in the navtiming daemon. I'll add more debugging calls to look into it.

Change 660762 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Add more debug logging

https://gerrit.wikimedia.org/r/660762

Change 660762 merged by jenkins-bot:
[performance/navtiming@master] Add more debug logging

https://gerrit.wikimedia.org/r/660762

Mentioned in SAL (#wikimedia-operations) [2021-02-01T08:53:44Z] <gilles@deploy1001> Started deploy [performance/navtiming@1e02d76]: T271208 Add more debug logging

Mentioned in SAL (#wikimedia-operations) [2021-02-01T08:53:49Z] <gilles@deploy1001> Finished deploy [performance/navtiming@1e02d76]: T271208 Add more debug logging (duration: 00m 05s)

Change 660765 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] browser_minor is needed for Mobile Safari allowlist

https://gerrit.wikimedia.org/r/660765

Change 660765 merged by jenkins-bot:
[performance/navtiming@master] browser_minor is needed for Mobile Safari allowlist

https://gerrit.wikimedia.org/r/660765

Mentioned in SAL (#wikimedia-operations) [2021-02-01T09:04:20Z] <gilles@deploy1001> Started deploy [performance/navtiming@3215510]: T271208 browser_minor is needed for Mobile Safari allowlist

Mentioned in SAL (#wikimedia-operations) [2021-02-01T09:04:25Z] <gilles@deploy1001> Finished deploy [performance/navtiming@3215510]: T271208 browser_minor is needed for Mobile Safari allowlist (duration: 00m 05s)

Change 659288 merged by Ottomata:
[operations/puppet@production] Finalize NavigationTiming extension Event Platform migration

https://gerrit.wikimedia.org/r/659288

FYI I just restricted the legacy eventlogging backend from accepting any of these events. For NavigationTiming on January 31, it looks like 638 out of 1461032 valid events received on that day went through the legacy eventlogging backend, so 0.04%.

Ottomata updated the task description. (Show Details)

Change 666372 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Remove wgEventLoggingSchemas overrides for QuickSurvey and NavigationTiming

https://gerrit.wikimedia.org/r/666372

Change 666372 merged by Ottomata:
[operations/mediawiki-config@master] Remove wgEventLoggingSchemas overrides for QuickSurvey and NavigationTiming

https://gerrit.wikimedia.org/r/666372