Page MenuHomePhabricator

Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform
Closed, ResolvedPublic

Description

We'll use this migration to implement the transparent backwards compatibility code in EventLogging to allow existing instrumentation to continue working via POSTing to EventGate.

Details

SubjectRepoBranchLines +/-
mediawiki/extensions/WikimediaEventsmaster+1 -1
operations/puppetproduction+2 -1
operations/puppetproduction+107 -47
mediawiki/extensions/WikimediaEventsmaster+3 -117
schemas/event/secondarymaster+449 -0
operations/mediawiki-configmaster+3 -0
operations/mediawiki-configmaster+1 -6
operations/mediawiki-configmaster+10 -10
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+2 -2
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+42 -2
mediawiki/extensions/EventLoggingmaster+1 -1
operations/mediawiki-configmaster+22 -0
analytics/refinery/sourcemaster+559 -148
operations/mediawiki-configmaster+4 -1
operations/mediawiki-configmaster+10 -7
operations/deployment-chartsmaster+3 -3
schemas/event/secondarymaster+231 -173
eventgate-wikimediamaster+25 -3
schemas/event/secondarymaster+1 K -0
mediawiki/extensions/EventLoggingmaster+142 -36
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 589074 merged by Ottomata:
[schemas/event/secondary@master] Preserve camelCase capitalization in analytics/legacy schemas

https://gerrit.wikimedia.org/r/589074

Change 592664 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - Support backwards compatible eventlogging_ topic prefixing

https://gerrit.wikimedia.org/r/592664

Change 592664 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics-external - Support backwards compatible eventlogging_ topic prefixing

https://gerrit.wikimedia.org/r/592664

Change 586447 merged by jenkins-bot:
[analytics/refinery/source@master] Unify Refine transform functions and add user agent parser transform

https://gerrit.wikimedia.org/r/586447

Change 592726 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventStreams - Add SearchSatisfaction stream config and remove beta specific overrides

https://gerrit.wikimedia.org/r/592726

Change 592726 merged by Ottomata:
[operations/mediawiki-config@master] wgEventStreams - Add SearchSatisfaction stream config and remove beta specific overrides

https://gerrit.wikimedia.org/r/592726

Mentioned in SAL (#wikimedia-operations) [2020-04-27T18:02:46Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: configure SearchSatisfaction - T249261 (duration: 00m 58s)

Change 592735 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_

https://gerrit.wikimedia.org/r/592735

Change 592735 merged by Ottomata:
[operations/mediawiki-config@master] wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_

https://gerrit.wikimedia.org/r/592735

Change 593610 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Add eventlogging_legacy job to refine EventLogging events from EventGate

https://gerrit.wikimedia.org/r/593610

Change 595032 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce

https://gerrit.wikimedia.org/r/595032

Change 595032 merged by jenkins-bot:
[operations/mediawiki-config@master] Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce

https://gerrit.wikimedia.org/r/595032

I think we are ready to move here. The annoying part is that our switch is defined in the extension.json as an extension attribute, not in mediawiki-configs. This means that we aren't able to do partial rollouts via MW config, we have to do it e.g. with the regular code train and/or swat deploys. I guess that's ok. The change to start emitting SearchSatisfaction via eventgate will be rolled out as part of a regular weekly train. We'll just have to watch it when the train happens to make sure all is well in prod.

Ah! Yes we can! Timo just pointed out:

https://gerrit.wikimedia.org/g/mediawiki/extensions/EventLogging/+/dbd4953dad286b9bccedb13a242a8a094beaf85f/includes/EventLoggingHooks.php#60

So we can use $wgEventLoggingSchemas to override the extension attributes for an incremental rollout. VERY COOL!

Change 595634 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Configure wgEventLoggingSchemas overrides in beta and testwiki

https://gerrit.wikimedia.org/r/595634

Change 595634 merged by Ottomata:
[operations/mediawiki-config@master] Configure wgEventLoggingSchemas overrides in beta and testwiki

https://gerrit.wikimedia.org/r/595634

Mentioned in SAL (#wikimedia-operations) [2020-05-12T14:33:44Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - Test everywhere, SearchSatisfaction on testwiki only - T249261 (duration: 01m 06s)

Change 596049 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] Use array_merge instead of array + when merging wgEventLoggingSchemas

https://gerrit.wikimedia.org/r/596049

t(ಠ益ಠt)

PHP! From https://www.php.net/manual/en/language.operators.array.php:

The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored.

WHY WOULD LEFT TAKE PRECEDENCE OVER RIGHT?!

Will make patch to EL to use array_merge instead.

Change 596049 merged by Ottomata:
[mediawiki/extensions/EventLogging@master] wgEventLoggingSchemas should override extension attributes

https://gerrit.wikimedia.org/r/596049

Not sure if it's related, but I've noticed one of my data pipelines that noticed a missing partition. Particularly event.mediawiki_cirrussearch_request/datacenter=eqiad/year=2020/month=5/day=8/hour=10 is missing. I've instructed airflow to pretend this step was a success and continue the process, but if we can load that data it could be re-run.

Hm, not related to this ticket, but possibly to T252203. Looking

Change 601360 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Fix schema uri for SearchSatisfaction on testwiki for wgEventLoggingSchemas

https://gerrit.wikimedia.org/r/601360

Change 601360 merged by Ottomata:
[operations/mediawiki-config@master] Fix schema uri for SearchSatisfaction on testwiki for wgEventLoggingSchemas

https://gerrit.wikimedia.org/r/601360

Mentioned in SAL (#wikimedia-operations) [2020-06-01T16:48:28Z] <otto@deploy1001> sync-file aborted: EventLogging - fix searchsatisfaction schema URI - testwiki only - T249261 (duration: 00m 02s)

Mentioned in SAL (#wikimedia-operations) [2020-06-01T16:49:32Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging - fix searchsatisfaction schema URI - testwiki only - T249261 (duration: 00m 59s)

test.wikipeda.org is now successfully sending SearchSatisfaction events via EventGate. I'll check to see that Refine handles them properly, and then perhaps enable this to other wikis throughout this week.

Change 601732 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] EventLogging - use EventGate on group0 wikis for SearchSatisfaction

https://gerrit.wikimedia.org/r/601732

Change 605955 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] refine.pp - bump refinery jar version and make eventlogging_analytics use event_transforms

https://gerrit.wikimedia.org/r/605955

Mentioned in SAL (#wikimedia-analytics) [2020-06-16T19:17:28Z] <ottomata> deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261

Mentioned in SAL (#wikimedia-operations) [2020-06-16T19:18:17Z] <otto@deploy1001> Started deploy [analytics/refinery@8b8ce6e]: deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261

Change 605955 merged by Ottomata:
[operations/puppet@production] refine.pp - bump version and make eventlogging_analytics use event_transforms

https://gerrit.wikimedia.org/r/605955

Evolved event.searchsatisfaction:

20/06/16 19:59:36 INFO DataFrameToHive: Running Hive DDL statement:
ALTER TABLE event.searchsatisfaction
ADD COLUMNS (
`_schema` STRING,
`client_dt` STRING,
`http` STRUCT<`method`: STRING, `status_code`: BIGINT, `client_ip`: STRING, `has_cookies`: BOOLEAN, `request_headers`: MAP<STRING, STRING>, `response_headers`: MAP<STRING, STRING>>,
`meta` STRUCT<`uri`: STRING, `request_id`: STRING, `id`: STRING, `dt`: STRING, `domain`: STRING, `stream`: STRING>
)

Change 606185 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Bump SearchSatisfaction schema version to 1.1.0 to pick up client_dt field

https://gerrit.wikimedia.org/r/606185

Change 606185 merged by Ottomata:
[operations/mediawiki-config@master] Bump SearchSatisfaction schema version to 1.1.0 to pick up client_dt field

https://gerrit.wikimedia.org/r/606185

Mentioned in SAL (#wikimedia-operations) [2020-06-17T13:31:20Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on testwiki version 1.1.0 - T249261 (duration: 00m 58s)

Change 601732 merged by Ottomata:
[operations/mediawiki-config@master] EventLogging - use EventGate on group0 wikis for SearchSatisfaction

https://gerrit.wikimedia.org/r/601732

Mentioned in SAL (#wikimedia-analytics) [2020-06-17T16:57:34Z] <ottomata> produce searchsatisfaction events on group0 wikis via eventgate - T249261

Mentioned in SAL (#wikimedia-operations) [2020-06-17T16:57:58Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on group0 wikis - T249261 (duration: 00m 56s)

Change 606515 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] EventLogging - use EventGate on group1 wikis for SearchSatisfaction

https://gerrit.wikimedia.org/r/606515

Change 606515 merged by Ottomata:
[operations/mediawiki-config@master] EventLogging - use EventGate on group1 wikis for SearchSatisfaction

https://gerrit.wikimedia.org/r/606515

Mentioned in SAL (#wikimedia-operations) [2020-06-18T19:44:02Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on group1 wikis - T249261 (duration: 00m 57s)

Change 606517 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] EventLogging - use EventGate on all wikis for SearchSatisfaction

https://gerrit.wikimedia.org/r/606517

Change 606517 merged by Ottomata:
[operations/mediawiki-config@master] EventLogging - use EventGate on all wikis for SearchSatisfaction

https://gerrit.wikimedia.org/r/606517

Mentioned in SAL (#wikimedia-operations) [2020-06-18T20:17:19Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on all wikis - T249261 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2020-06-18T22:30:48Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on all wikis - T249261 (duration: 00m 56s)

Ah! Everything looked great on group0 today so I proceeded with group1 and then all wikis. Something didn't work though; it looks like even though the EventLogging extension picked up my config change to produce the event in the new MEP format, it did not POST the event to EventGate; instead it just sent it via GET to the varnish beacon endpoint. This caused it to fail processing by eventlogging server side! I don't know why this happened; perhaps the required EventLogging extension changes have not been deployed? We merged them many weeks ago though, so I'd be surprised if they weren't. Will investigate.

We are currently missing SearchSatisfaction events for about 2 hours today. I will backfill these from EventError tomorrow. Sorry @EBernhardson!

So I just tried to backfill from the EventError table. After I did, the count of events for the offending hours still didn't look right. I investigated more, and it seems there are two types of errors that eventlogging-processor was (and is) encountering. The first one I saw I backfilled for. This type is able to parse the URL encoded JSON string of MEP style data, but isn't able to extract the EventLogging 'scid' (schema, revision), because this doesn't make sense in the MEP data. I backfilled about 100K events for hours 20:00 to 00:00 yesterday. (Backfill method here). The counts after I did this are:

hour |  _col1
  19 | 1634895
  20 |  705602
  21 |  111156
  22 |  460784
  23 |  956821

Compared with day=17 these hours should be around 1.5 million too. So I missed something...

The second type of error is a parse error, the url encoded JSON query string could not be decoded by eventlogging-processor:

Extra data: line 1 column 624 (char 623)"

I have to figure out why these events failed parsing, fix the raw encoded event string, and then parse and backfill them somehow.

Overall though, something is wrong with the EventLogging extension changes we've made. I still need to investigate into why this happened in the first place.

The parse error comes from a very strange extra '=' at the end of the URL encoded data sent by EventLogging. I cannot reproduce and I cannot see how or why EventLogging would add this, especially due to my configuration change.

But, I found the main reason for failure yesterday. mediawiki-config has

// EventLogging will POST events to this URI.
'wgEventLoggingServiceUri' => [
	'default' => false,
	'group0' => 'https://intake-analytics.wikimedia.org/v1/events?hasty=true',
],

Which means that when I switched SearchSatisfaction to POST to EventGate on non group0 wikis, the non group0 wikis would just do nothing with those events.

I just gathered all SearchSatisfaction EventError messages for the offending timeframe, including both types of errors I saw. There are only about 60K more of the weird parse errors than the could not read scid errors. So, only 166260 available events to backfill total, which is still way off. This makes sense, as during this time period the wikis were mostly not sending the events at all. I don't know why we got any events at all with this misconfiguration, but I'm willing to chalk it up to some strange misconfiguration + ResourceLoader caching issues.

This means that I effectively disabled SearchSatisfaction events on non group0 wikis for about 2 hours yesterday. Those events are lost.

@EBernhardson I'm sorry about this, I wish I could backfill these for you. :(

Change 607308 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Add analytics/legacy/templatewizard schema

https://gerrit.wikimedia.org/r/607308

Change 607308 merged by Ottomata:
[schemas/event/secondary@master] Add analytics/legacy/templatewizard schema

https://gerrit.wikimedia.org/r/607308

Mentioned in SAL (#wikimedia-operations) [2020-06-24T14:42:25Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on group0 - T249261 (duration: 01m 06s)

Mentioned in SAL (#wikimedia-operations) [2020-06-24T19:38:52Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Revert Migrate SearchSatisfaction from EventLogging to EventGate on group1 - T249261 (duration: 01m 06s)

Alright, Today I deployed config to migrate SearchSatisfaction to EventGate on group1 wikis. I have since reverted it. It was live on group1 from about 14:42 until 19:38 UTC.

For the most part, everything worked. However, there are two very strange issues I've run into. Marcel and I just spent a couple of hours debugging but are pretty stumped.


The first issue:

EventLogging seems to correctly get the $wgEventLoggingSchemas config change that tells EventLogging to produce the event to EventGate. It sees this config change and then formats the event for EventGate with fields like e.g. $schema and client_dt and meta.domain. However, the event is then sent as usual to the varnish beacon and makes its way into eventlogging-client-side topic in Kafka and then fails processing by eventlogging-processor. Here is an example ErrorEvent produced by eventlogging-processor:

{
  "event": {
    "code": "validation",
    "message": "scid could not be extracted from event.",
    "rawEvent": "{\"$schema\": \"/analytics/legacy/searchsatisfaction/1.1.0\", \"client_dt\": \"2020-06-24T19:19:59.543Z\", \"dt\": \"2020-06-24T19:19:59.543000Z\", \"event\": {\"action\": \"searchResultPage\", \"didYouMeanVisible\": \"no\", \"extraParams\": \"{\\\"offset\\\":0,\\\"fallback\\\":{\\\"mainResults\\\":{\\\"name\\\":\\\"__main__\\\",\\\"action\\\":null},\\\"querySuggestion\\\":null}}\", \"hitsReturned\": 26056, \"msToDisplayResults\": 7699, \"mwSessionId\": \"xxxxxx\", \"pageViewId\": \"xxxxxx\", \"query\": \"Magdeburg\", \"sampleMultiplier\": 1, \"scroll\": false, \"searchSessionId\": \"xxxxx\", \"searchToken\": \"xxxxx\", \"source\": \"fulltext\", \"uniqueId\": \"xxxxxx\"}, \"ip\": \"93.231.252.251\", \"meta\": {\"domain\": \"commons.wikimedia.org\"}, \"recvFrom\": \"cp3052.esams.wmnet\", \"schema\": \"SearchSatisfaction\", \"seqId\": 48284269, \"userAgent\": {\"browser_family\": \"Mobile Safari\", \"browser_major\": \"11\", \"browser_minor\": \"0\", \"device_family\": \"iPad\", \"is_bot\": false, \"is_mediawiki\": false, \"os_family\": \"iOS\", \"os_major\": \"11\", \"os_minor\": \"2\", \"wmf_app_version\": \"-\"}, \"uuid\": \"xxxxxxxx\", \"webHost\": \"commons.wikimedia.org\", \"wiki\": \"commonswiki\"}",
    "revision": -1,
    "schema": "unknown"
  },
  "recvFrom": "eventlog1002.eqiad.wmnet",
  "revision": 14035058,
  "schema": "EventError",
  "timestamp": 1593026400,
  "uuid": "b2471874b64f11ea96a61418775b0d42",
  "wiki": ""
}

It fails parsing because eventlogging-processor is expecting the revision field to be set, which it is not in Event Platform world.

This only happens for some events, but it does seem to happen consistently for specific browser installs out there, and much more frequently for some browsers:

presto:event> 
select json_extract(event.rawevent, '$.wiki'), json_extract(event.rawevent, '$.userAgent.browser_family'), count(*) 
from eventerror 
where event.message = 'scid could not be extracted from event.' and 
event.rawevent like '%meta%' 
and year=2020 and month=6 and day=24 
group by json_extract(event.rawevent, '$.wiki'), json_extract(event.rawevent, '$.userAgent.browser_family') 
order by count(*) desc;

      _col0      |      _col1      | _col2
-----------------+-----------------+-------
 "commonswiki"   | "IE"            |  2319
 "enwiktionary"  | "IE"            |   690
 "dewiktionary"  | "IE"            |   420
 "enwiktionary"  | "Safari"        |   347
 "mediawikiwiki" | "IE"            |   241
 "commonswiki"   | "Firefox"       |   145
 "commonswiki"   | "Safari"        |   138
 "cawiki"        | "IE"            |    91
 "frwiktionary"  | "IE"            |    90
 "enwikiquote"   | "IE"            |    72
 "jawiktionary"  | "IE"            |    56
 "hewiki"        | "IE"            |    50
 "enwikisource"  | "IE"            |    43
 "wikidatawiki"  | "Firefox"       |    43
 "dewikibooks"   | "IE"            |    41
 "dewikibooks"   | "Opera"         |    38
 "wikidatawiki"  | "IE"            |    35
 "kowikisource"  | "IE"            |    27
 "frwikibooks"   | "IE"            |    24
 "frwiktionary"  | "Safari"        |    24
 "elwiktionary"  | "IE"            |    17
 "dewikisource"  | "IE"            |    16
 "metawiki"      | "Safari"        |    14
 "viwiktionary"  | "IE"            |    14
 "kowiktionary"  | "IE"            |    12
 "frwiktionary"  | "Firefox"       |    10
 "itwiktionary"  | "IE"            |    10
 "ruwiktionary"  | "IE"            |    10
 "ruwiktionary"  | "Safari"        |    10
 "dewiktionary"  | "Firefox"       |     9
 "enwiktionary"  | "Pale Moon"     |     9
 "dewikivoyage"  | "IE"            |     8
 "enwiki"        | "Safari"        |     8
 "commonswiki"   | "IE Mobile"     |     8
 "cawiki"        | "Safari"        |     7
 "eswiktionary"  | "Safari"        |     7
 "enwikibooks"   | "Firefox"       |     7
 "test2wiki"     | "Safari"        |     7
 "mediawikiwiki" | "Safari"        |     7
 "svwiktionary"  | "IE"            |     7
 "enwikivoyage"  | "IE"            |     6
 "enwikibooks"   | "IE"            |     6
 "mediawikiwiki" | "Firefox"       |     5
 "enwiktionary"  | "Maxthon"       |     5
 "frwikisource"  | "IE"            |     4
 "frwiki"        | "Safari"        |     4
 "enwikiquote"   | "Mobile Safari" |     4
 "trwiktionary"  | "IE"            |     4
 "enwikivoyage"  | "Safari"        |     3
 "enwiktionary"  | "Mobile Safari" |     3
 "zhwikisource"  | "IE"            |     3
 "nlwiktionary"  | "IE"            |     3
 "hewiki"        | "Edge"          |     3
 "frwikinews"    | "IE"            |     2
 "specieswiki"   | "IE"            |     2
 "enwikibooks"   | "Safari"        |     2
 "dewiki"        | "Firefox"       |     2
 "cawiki"        | "Firefox"       |     2
 "arwikiversity" | "Safari"        |     2
 "mediawikiwiki" | "Chromium"      |     2
 "ruwikivoyage"  | "Firefox"       |     1
 "frwiktionary"  | "Mobile Safari" |     1
 "plwiktionary"  | "IE"            |     1
 "mediawikiwiki" | "Mobile Safari" |     1
 "plwiktionary"  | "Safari"        |     1
 "mediawikiwiki" | "Chrome"        |     1
 "dewiktionary"  | "Safari"        |     1
 "enwiktionary"  | "Firefox"       |     1
 "mlwiktionary"  | "Safari"        |     1
(69 rows)

HOW? Most of the events DO get sent to EventGate properly. I can see my own SearchSatisfaction events from my browser going through EventGate. The same config switch that tells EventLogging to format the event for EventGate should end up also telling EventLogging to POST the event to EventGate.

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventLogging/+/master/modules/ext.eventLogging/core.js#77

			if ( typeof revisionOrSchemaUri === 'string' ) {
				event.$schema = revisionOrSchemaUri;

Is the case that causes EventLogging to set e.g. event.$schema, and then later:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventLogging/+/master/modules/ext.eventLogging/core.js#208

			// Assume that if $schema was set by core.prepare(), this
			// event should be POSTed to EventGate.
			if ( event.$schema ) {
				core.submit( makeLegacyStreamName( schemaName ), event );

Is it possible that this code could evaluate differently in e.g. I.E. vs. Chrome?

When we also analyzed the IPs that were associated with these errors, they were relatively consistent; meaning the same IPs encountered the same bug repeatedly. This would make sense if the bug was specific to certain browser installs.


The other confusing problem has to do with which wikis send these events causing this problem. Note that in the query output above, there are some of these errors originating from non group0 and group1 wikis, e.g. enwiki. There was a 2 hour period last week between 2020-06-18T20:17 and 2020-06-18T22:23 where this config was enabled on all wikis. It was reverted and has not since been reenabled. How is it that any wikipedia is even attempting to format the event for EventGate, when the config that causes this to happen is not deployed to any wikipedia? The actual code to do this has been live for weeks now, and there were no errors of this type until I started attempting to migrate to EventGate.

Are some browsers caching code + config differently than we'd expect?

CC @Krinkle if you find some time to help. Thank you!

[…] Here is an example ErrorEvent produced by eventlogging-processor:

{  "event": {
    "code": "validation",
    "message": "scid could not be extracted from event.",
    "rawEvent": "{\"$schema\": \"/analytics/legacy/searchsatisfaction/1.1.0\", \"client_dt\": \"2020-06-24T19:19:59.543Z\", \"dt\": \"2020-06-24T19:19:59.543000Z\", \"event\": {\"action\": \"searchResultPage\", \"didYouMeanVisible\": \"no\", \"extraParams\": \"{\\\"offset\\\":0,\\\"fallback\\\":{\\\"mainResults\\\":{\\\"name\\\":\\\"__main__\\\",\\\"action\\\":null},\\\"querySuggestion\\\":null}}\", \"hitsReturned\": 26056, \"msToDisplayResults\": 7699, \"mwSessionId\": \"xxxxxx\", \"pageViewId\": \"xxxxxx\", \"query\": \"Magdeburg\", \"sampleMultiplier\": 1, \"scroll\": false, \"searchSessionId\": \"xxxxx\", \"searchToken\": \"xxxxx\", \"source\": \"fulltext\", \"uniqueId\": \"xxxxxx\"}, \"ip\": \"93.231.252.251\", \"meta\": {\"domain\": \"commons.wikimedia.org\"}, \"recvFrom\": \"cp3052.esams.wmnet\", \"schema\": \"SearchSatisfaction\", \"seqId\": 48284269, \"userAgent\": {\"browser_family\": \"Mobile Safari\", \"browser_major\": \"11\", \"browser_minor\": \"0\", \"device_family\": \"iPad\", \"is_bot\": false, \"is_mediawiki\": false, \"os_family\": \"iOS\", \"os_major\": \"11\", \"os_minor\": \"2\", \"wmf_app_version\": \"-\"}, \"uuid\": \"xxxxxxxx\", \"webHost\": \"commons.wikimedia.org\", \"wiki\": \"commonswiki\"}",
    "revision": -1,
    "schema": "unknown" }

[…]

https://gerrit.wikimedia.org/g/mediawiki/extensions/EventLogging/+/4f69fe5a/modules/ext.eventLogging/core.js#77

Method prepare()
			if ( typeof revisionOrSchemaUri === 'string' ) {
				event.$schema = revisionOrSchemaUri;

Is the case that causes EventLogging to set e.g. event.$schema, and then later:

https://gerrit.wikimedia.org/g/mediawiki/extensions/EventLogging/+/4f69fe5a/modules/ext.eventLogging/core.js#208

Method logEvent()
			event = core.prepare( schemaName, eventData ),
			// […]

			// Assume that if $schema was set by core.prepare(), this
			// event should be POSTed to EventGate.
			if ( event.$schema ) {
				core.submit( makeLegacyStreamName( schemaName ), event );

Is it possible that this code could evaluate differently in e.g. I.E. vs. Chrome? […]

I'm first retracing the same three steps:

  • (Assumption) The entry point is mw.eventLog.logEvent.
  • This calls prepare() and then find revisionOrSchemaUri as a string, which would assign event.$schema, event.dt, event.client_dt and event.meta.domain.
    • The above sample indeed shows that these properties are all set.
  • When we pick it up again in logEvent(), the event.$schema condition would have to be met given it's a non-empty string. Which we see in the sample, but just in case it may've been lost and set by other means, I also checked your your Gerrit change 607520 and its diffConfig Jenkins job. Plus I recall that from our IRC conversation, we also checked mwscript eval.php and from load.php and confirmed it was making its way through all the layers unharmed.
  • In the event.$schema branch of logEvent() it would then have to call mw.logEvent.submit().

So far so good. I'll continue from there.

But first, I recalled you talked about POST vs GET on IRC, so I also scrolled by the sendBeacon method:
https://gerrit.wikimedia.org/g/mediawiki/extensions/EventLogging/+/4f69fe5a/modules/ext.eventLogging/core.js#173

Method sendBeacon( url )
			if ( navigator.sendBeacon ) {
				try {
					navigator.sendBeacon( url );
				} catch ( e ) {}
			} else {
				document.createElement( 'img' ).src = url;
			}

Note that where the Beacon API uses POST, the Image request will be GET in older browsers. You mentioned Internet Explorer and Safari as standing out; and IE and Safari were both late to the Beacon API party, so this might be relevant somehow… Or not :)

Anyway, let's continue from the steps above. We were at submit():

https://gerrit.wikimedia.org/g/mediawiki/extensions/EventLogging/+/4f69fe5a/modules/ext.eventLogging/core.js#325

Method submit()
		// […]
		eventData.meta = eventData.meta || {};
		eventData.meta.stream = streamName;

		// […]
		eventData.client_dt = eventData.client_dt || new Date().toISOString();

		// […]
		if ( config.serviceUri ) {
			core.enqueue( function () {
				navigator.sendBeacon(
					config.serviceUri,
					JSON.stringify( eventData )
				);
			} );
		}

The first thing that stood out to me here is that the EventGate submit code unconditionally calls navigator.sendBeacon – it does not mw.eventLog.sendBeacon that I reviewd a few lines up. I couldn't find any earlier conditional, bail out, or fallback anywhere. So, unless proven otherwise, I assume this causes an uncaught exception in older browsers. The good news is that it is called from inside core.enqueue which means the callstack that it blows up is an async call stack originating from setTimeout dedicated to EventLogging-purposes. It means it has litlte to no impact on other code. Except within the BackgroundQueue for other enqueued callbacks by other instrumetnations, because each is called from within a while loop in in BackgroundQueue. Iif one of those throws, the rest of the queue does not execute. If the execution was triggered from a "last chance" event like tab closing this would lose some events, in other cases they would get a second chance during the next next 30s interval, or from the actual "last chance" event.

The second thing I noticed is that this method further mutates the eventData object. It sets eventData.meta.stream. However, this is not reflected in the sample payload that you captured. Hence, it seems most likely the submit() function was in fact never called, and therefore that my starting assumption about the entry poihnt may be wrong.

So, in trying to determine that, I code searched for "SearchSatisfaction":

ext.wikimediaEvents/searchSatisfaction.js
mw.loader.using( [ 'ext.eventLogging' ] ).then( function () {
	eventLog = eventLog || extendMwEventLog();
	eventLog.logEvent( 'SearchSatisfaction', evt );
} );
/* … */
mw.loader.using( [ 'ext.eventLogging' ] ).then( function () {
	eventLog = eventLog || extendMwEventLog();
	eventLog.logEvent( 'SearchSatisfactionErrors', evt );
} );

This code seems to predate T205744/T187207 and still uses a no-op using() callback to ensure the ext.eventLogging module, which is already required by the bundle that ships searchSatisfaction.js. This is fairly harmless apart from a confusing the reader, but would be good to clean up :)

The novel eventLog = eventLog || extendMwEventLog(); statement also stands out as suspicious. I've not seen EL used like this before. It suggests to me that a developer may have (ab)using a system without communicating to its maintainers (you) that they need help in understanding how it works, how it should be used for their use case, or that they want additional features from it.

And indeed, that appears to be the case. extendMwEventLog() returns a a copy of the mw.eventLog object with two unsupported modifications:

  1. The sendBeacon method is shadowed by a local version that never uses navigator.sendBeacon, but always uses an HTML image (that seems like a regression), and adds some custom logic to it that tracks whether the Image request succeeded/failed, and until such time, keeps track of the beacon URL in a local object. It then uses window.onbeforeunload to persist any unfinished beacons to localStorage. These are then attempted to be replayed during subsequent page loads on the same wiki. The code in question notes that this "adds latency cost" and "adds chance of duplicate events" and trades it for an "increased deliverability guarantee".
  2. The logEvent method is overriden as well but is no different than the original was (keyword being "was'"). I believe it was copied here because its author found that mw.eventLog is protected against this kind of tempering on the secondBeacon method. EventLogging always uses its local reference to the method, so the only way to get logEvent to use it a different one is to copy the method wholesale, including a call to the @private method mw.eventLog.prepare().

So then, the actual entry point to our story is this custom method, and not mw.eventLog.logEvent.

I said the custom "was" identical to the original. It no longer is, since EventLogging has since been updated to support EventGate. This custom/stale version uses the EventLogging server URL always.

This leaves us with one question still – Why does the problem only happen for a subset of events? Or phrased differently - What is making this work in the first place for what appears to be a majority of cases?

This took me a while to figure out. I was not able to reproduce it in Firefox, Safari, or Chrome. I suspected for a moment that the lazy-init nature of extendMwEventLog() may be causing a race condition that sometimes makes it unable to cause the above trouble, e.g. due to it somehow being refused or flipping back before it is called on - and that this race condition would somehow be "lucky" in Chrome.

After reading through the lazy-init code a few times, I finally noticed that despite being named "extend mw log", it does in fact not always extend. There is an early return I glanced over:

	function extendMwEventLog() {
		// …
		if ( navigator.sendBeacon ||  ) {
			return mw.eventLog;

So yeah, while this isn't strictly a bug with the sendBeacon API or our fallback for it, it did turn out to be triggered only in conditions where a fallback would normally reside. The whole chain reaction above only happens for browsers where sendBeacon is not natively supported.

Note that the "correct" path requires sendBeacon and POST for EventGate and is indeed used in all browsers where it can be used. This means the errors we're seeing are only happening for events that would otherwise be lost. Instead of throwing an uncaught error (as I speculate above) , the bug in SearchSatisfaction is accidentally preventing that and instead funneling those cases to legacy EventLogging.

THIS IS SO SO GREAT! Thank you so much Timo! I'll talk to @EBernhardson and get the offending code removed or fixed up from WikimediaEvents.

Change 608318 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/WikimediaEvents@master] searchSatisfaction.js - remove extendMwLogEvent

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/ /608318

Change 608318 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] searchSatisfaction.js - remove extendMwEventLog

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/ /608318

Change 593610 merged by Ottomata:
[operations/puppet@production] Add eventlogging_legacy Refine job for events migrated to EventGate

https://gerrit.wikimedia.org/r/c/operations/puppet/ /593610

Mentioned in SAL (#wikimedia-operations) [2020-07-06T16:09:44Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on group1 - T249261 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2020-07-06T17:50:29Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on all wikis - T249261 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2020-07-06T17:54:04Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on all wikis - T249261 - take 2 (duration: 00m 56s)

Change 609845 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/WikimediaEvents@master] Set SearchSastisfaction schema to Event Platform schema URI

https://gerrit.wikimedia.org/r/609845

Change 610055 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Refine SearchSatisfaction using new eventlogging_legacy job

https://gerrit.wikimedia.org/r/610055

Change 610055 merged by Ottomata:
[operations/puppet@production] Refine SearchSatisfaction using new eventlogging_legacy job

https://gerrit.wikimedia.org/r/610055

Change 609845 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Set SearchSatisfaction schema to Event Platform schema URI

https://gerrit.wikimedia.org/r/609845

Last week we fully migrated SearchSatisfaction events to eventgate-analytics-external. However, there seem to be still a few hundred events per hour (out of about 1.5 million per hour) that are sent using the legacy /beacon/event URI. For the most part, these events are being refined properly. However, we only noticed this because for one hour, one of these events was the first event in the hour. We currently use the first event in an hour to look up the $schema URI for that hour, and in this case the $schema field was NULL, as this was a legacy formatted and sent event, and the refine job failed for that hour. For the rest of them, the data is being refined properly, but is missing the new event platform fields.

I've looked at some of the data, and I don't see any of the same user agent trends we saw before.

scala> d0.where("`$schema` IS NULL").groupBy("userAgent.browser_family", "userAgent.browser_major").count().show(false)
+--------------+-------------+-----+
|browser_family|browser_major|count|
+--------------+-------------+-----+
|Firefox       |66           |6    |
|Firefox       |52           |8    |
|Firefox       |58           |4    |
|Safari        |13           |232  |
|IE            |11           |20   |
|Firefox       |76           |6    |
|Edge          |81           |7    |
|SeaMonkey     |2            |1    |
|Chrome        |49           |1    |
|Firefox       |78           |72   |
|Firefox       |74           |1    |
|Edge          |17           |7    |
|Chrome        |81           |2    |
|Chrome        |69           |8    |
|Safari        |11           |4    |
|Safari        |12           |5    |
|Chrome        |83           |326  |
|Chrome        |40           |1    |
|Edge          |83           |26   |
|Firefox       |77           |77   |
+--------------+-------------+-----+
only showing top 20 rows

This is different than the problem Timo found in https://phabricator.wikimedia.org/T249261#6261156. In this case, the events are being formatted in the legacy way properly and are being sent to the legacy GET /beacon/event endpoint. They are making their way through eventlogging-processor just fine.

I'm not sure how this would be, but the fact that the rate of these types of events is only a few hundred per hour makes me suspect this is a caching issue? We only have a few days back to look at, but it does seem the # of these events per day is slowly decreasing:

presto:event> select day, count(*) from searchsatisfaction where seqId IS NOT NULL and year=2020 and month=7 and day IN (8,9,10,11,12) group by day order by day;
 day | _col1
-----+-------
   8 | 55563
   9 | 29289
  10 | 17149
  11 | 13692
  12 | 10236
(5 rows)

I'd guess this is just browsers slowly forgetting the old code.

I'll force the one offending hour to Refine. T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events should help with this kind of issue in the future (where the first event in an hour has schema loading problems that cause the whole hour to fail.)

The backfill of the offending SearchSatisfaction hour (/wmf/data/event/SearchSatisfaction/year=2020/month=7/day=10/hour=16) required some manual steps in spark scala. There is no way in Refine right now to override the schema loader for a specific hour, so I had to write a little code on the Spark REPL to load the schema explicitly.

// sudo -u analytics /usr/bin/spark2-shell --name refine_searchsatisfaction_backfill_otto0 --files /etc/hive/conf/hive-site.xml,/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar,/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar --master yarn --queue production --driver-memory 8G --executor-memory 4G --conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:hive-jdbc-1.1.0-cdh5.10.0.jar:hive-service-1.1.0-cdh5.10.0.jar --conf spark.dynamicAllocation.maxExecutors=64  --principal analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA --keytab /etc/security/keytabs/analytics/analytics.keytab --jars /home/otto/refinery-source/refinery-job/target/refinery-job-0.0.130-SNAPSHOT.jar

import org.wikimedia.analytics.refinery.job.refine._
import org.wikimedia.analytics.refinery.spark.connectors.DataFrameToHive
import org.wikimedia.analytics.refinery.spark.connectors.DataFrameToHive.TransformFunction


 val target = RefineTarget(
    spark, 
    "/wmf/data/raw/eventlogging/eventlogging_SearchSatisfaction/hourly/2020/07/10/16",
    "/wmf/data/event/SearchSatisfaction/year=2020/month=7/day=10/hour=16",
    ExplicitSchemaLoader(Some(spark.table("event.searchsatisfaction").schema))
)

val transformFunctions: Seq[TransformFunction] = Seq(event_transforms.apply, filter_allowed_domains.apply)

val insertedDf = DataFrameToHive(
    spark,
    target.inputPartitionedDataFrame(),
    () => target.writeDoneFlag(),
    transformFunctions
)

val recordCount = insertedDf.df.count
println(s"Finished refinement of dataset $target. " +
                    s"(# refined records: $recordCount)")

// Finished refinement of dataset /wmf/data/raw/eventlogging/eventlogging_SearchSatisfaction/hourly/2020/07/10/16 -> `event`.`SearchSatisfaction` (year=2020,month=7,day=10,hour=16). (# refined records: 1433179)

If all events actually came through EventGate, this wouldn't ever happen, but as long as any are still coming through via eventlogging-processor, it could happen again. We could excludelist the migrated schemas somehow from the eventlogging client-side processor configuration. We've done this in the past when we used to use the eventlogging-valid-mixed topic to feed data to the eventlogging mysql consumer. This will be a bit of annoying and brittle configuration to maintain during this migration though.

Alternatively, if we do T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events and are able to load schemas using stream config rather than examining a record for a given hour, we will also avoid this problem, as dataset -> schema mapping will be explicit.