Page MenuHomePhabricator

eventgate_validation_error for NewcomerTask, HomepageTask, and HomepageVisit schemas
Closed, ResolvedPublic

Description

eventgate-validation logstash dashboard reports validation errors in the following schemas:

NewcomerTask schema: '.event.pageviews' should be integer - total count 109

HomepageVisit schema: seems to be be on testwiki wmf.29only: '.event.is_mobile' should be boolean, '.event.referer_namespace' should be integer, '.event.user_editcount' should be integer, '.event.user_id' should be integer - total count 18

HomepageModule schema:
'.event' should have required property 'module', '.event' should have required property 'mode' - total count 5
'.event.state' should be equal to one of the allowed values - total count 1 (already filed as T273084: HomepageModule events with validation errors).

Event Timeline

Couple of questions

  • Do such errors result in the events not being recorded (the same question @MMiller_WMF asked)? I checked, for example, .event.state is not required.
  • Why I did not see the validation errors when checking the Console with mw.trackSubscribe( 'event.HomepageModule', console.log); ?

Couple of questions

  • Do such errors result in the events not being recorded (the same question @MMiller_WMF asked)? I checked, for example, .event.state is not required.

I assume that if validation errors occur, this means that the event is not persisted in Kafka and so is not eventually saved somewhere. @Ottomata could you please confirm that? I scanned https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate and that was my understanding of what the validation error means.

  • Why I did not see the validation errors when checking the Console with mw.trackSubscribe( 'event.HomepageModule', console.log); ?

I think that's just because mw.trackSubscribe() shows you what we sent to the event ingestion service, it doesn't tell you anything about whether the service accepted or rejected that data.

In the past I used kafkacat (https://wikitech.wikimedia.org/wiki/Kafka#kafkacat) on the analytics machines to verify event logging fixes when we were backporting changes or toggling feature flags, that might be a more reliable tool to QA analytics event ingestion, although I don't know if this shows you the event stream before validation or after validation.

I don't understand why this error is occurring (logstash link). It's saying that "banner" is not a valid module name, but that was added in schema 1.1.0 which was deployed on January 20 (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/656284), the error is from today.

Change 661422 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] EventGate: Fallback to integer value if pageviews is null

https://gerrit.wikimedia.org/r/661422

I assume that if validation errors occur, this means that the event is not persisted in Kafka and so is not eventually saved somewhere.

https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Validation_Errors

They are not persisted to the expected downstream table, but are emitted into a validation error stream that gets persisted to the event.eventgate_analytics_external_error_validation table. Your events should be ones with errored_stream_name = 'eventlogging_HomepageModule', and the original event JSON string can be found in the raw_event field.

I think that's just because mw.trackSubscribe() shows you what we sent to the event ingestion service, it doesn't tell you anything about whether the service accepted or rejected that data.

Yup! The EventLogging extension doesn't do any schema validation, it just submits the events. If you want to dev schema validation, see: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Local_development

In the past I used kafkacat (https://wikitech.wikimedia.org/wiki/Kafka#kafkacat) on the analytics machines to verify event logging
although I don't know if this shows you the event stream before validation or after validation.

It shows you the data after validation. You can use the same tool to consume the validation error stream topics: (eqiad|codfw).eventgate-analytics-external.error.validation and see live validation errors. These are the same error messages that show up this logstash dashboard.

BTW, we are working on an internal instance of eventstreams that will have all streams available. This means you'll be able to consume these streams in your browser just like you can at https://stream-beta.wmflabs.org/v2/ui/#/, but with production data. You'll have to access this via an ssh tunnel, however. We'll announce this and put up docs once it is up.

I don't understand why this error is occurring (logstash link). It's saying that "banner" is not a valid module name, but that was added in schema 1.1.0 which was deployed on January 20

Ah, yes, this is confusing. That log is from the legacy eventlogging-processor backend. You are looking at the legacy eventlogging EventError validation stream, not the eventgate-analytics.external.validation.error stream. The legacy eventlogging-processor backend still processes and validates events that it receives from clients, but it will skip producing them to kafka. All JS clients should be configured to POST to EventGate now for migrated schemas, so if this is happening it means there are probably cached versions of the old JS out there that aren't doing this correctly.

Anyway, for a stream that is on Event Platform, the correct stream for validation errors is eventgate-analytics-external.error.validation, and the correct logstash dashboard is this one: https://logstash.wikimedia.org/app/kibana#/dashboard/AXN5OoJu3_NNwgAUlbUT

Hope that helps.

Change 661422 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] EventGate: Conditionally add pageviews to logData

https://gerrit.wikimedia.org/r/661422

Change 661449 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] EventGate: Fix missing module-name and mode for startediting-cta

https://gerrit.wikimedia.org/r/661449

Oh fun, we get some XSS attempts showing up as validation errors for the HomepageVisit schema, e.g.

'personaltoolslink()"&%1\'-;<xss_4622669>\'' is not one of ['userpagetab', 'usertalkpagetab', 'personaltoolslink', 'direct', 'specialconfirmemail', 'specialcontributions', 'specialwelcomesurvey', 'postedit-panel', 'other']

and

"';return true;var foo='" is not one of ['userpagetab', 'usertalkpagetab', 'personaltoolslink', 'direct', 'specialconfirmemail', 'specialcontributions', 'specialwelcomesurvey', 'postedit-panel', 'other']

I don't suppose there's much to be done about that type of thing?

@Ottomata I was also unsure what to make of '.event.is_mobile' should be boolean, '.event.referer_namespace' should be integer, '.event.user_editcount' should be integer, '.event.user_id' should be integer . Just looking at the first one, is_mobile should always be boolean (that field is set to the value returned from $skin instanceof SkinMinerva;), but it's getting recorded as "0" in this particular validation error.

Change 661455 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] EventGate: Actually check for task.pageviews not being null

https://gerrit.wikimedia.org/r/661455

I don't suppose there's much to be done about that type of thing?

Naw, not much we can do! Some of these might even pass validation and make it into Hive. THere is a is_wmf_domain flag that can be useful for filtering, but someone could fake that too.

I was also unsure what to make of...

It is certainly possible that these are not real events from your well behaved client. But, hm, this is one, right?

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.02.03?id=gs39aHcBoqQO90dBR31W

It looks like those are all from test.wikipedia.org, which is interesting. In raw_event, you can see that is_mobile:"0", which is a string, as are the other should be integer fields like user_editcount:"99". Sounds like something is not serializing the event properly to JSON.

OH OH OH, of course.

https://phabricator.wikimedia.org/T267333#6685630

This is a MW PHP sent event, and T272863: EventLogging PHP EventServiceClient should use EventBus->send(). is blocking our migration of these. The default PHP JSON serializer settings are terrible. We're trying to merge and deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/658441 but there are some CI issues standing in our way. HomepageVisit is migrated only on testwiki waiting for movement on this.

Yup! The EventLogging extension doesn't do any schema validation, it just submits the events. If you want to dev schema validation, see: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Local_development

I don't think that's a pragmatic option, especially for QA testers.
I wonder if it would be hard to publicly expose a "test" endpoint which does the validation and returns the results in the HTTP response? There could be a cookie or similar mechanism to direct event logging queries there.

I wonder if it would be hard to publicly expose a "test" endpoint which does the validation and returns the results in the HTTP response?

In production? Hm. It would be possible. The default eventgate POST API without the hasty=true query parameter does this. Perhaps in debug mode the hasty=true parameter could be removed?

Or, if in beta (or if they can ssh tunnel to the TBD eventstreams-internal instance), they can stream both the expected stream and the validation error stream in their browser in one tab, trigger the event in another tab, and then look for either the actual event or the validation error event.

Example in beta:
https://stream-beta.wmflabs.org/v2/ui/#/?streams=eventlogging_Test,eventgate-analytics-external.error.validation

Then in the JS browser console at https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page at did:

mw.eventLog.logEvent('Test', {'OtherMessage': 'hello there'});  // valid event
mw.eventLog.logEvent('Test', {'OtherMessage': 123}); // invalid event

After a few seconds (waiting for the EL browser queue to submit the event), at https://stream-beta.wmflabs.org/v2/ui/#/?streams=eventlogging_Test,eventgate-analytics-external.error.validation I got two events:

{"event":{"OtherMessage":"hello there"},"schema":"Test","webHost":"en.wikipedia.beta.wmflabs.org","wiki":"enwiki","$schema":"/analytics/legacy/test/1.2.0","client_dt":"2021-02-04T16:29:40.570Z","meta":{"stream":"eventlogging_Test","domain":"en.wikipedia.beta.wmflabs.org","id":"35c057db-5d59-4cf7-92ba-549a39a59c47","dt":"2021-02-04T16:29:47.240Z","request_id":"3210d240-6706-11eb-9a30-0e1a8cd8dbdc","topic":"eventlogging_Test","partition":0,"offset":13},"dt":"2021-02-04T16:29:40.570Z","http":{"request_headers":{"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}}}

{"meta":{"id":"af3a06a9-a76a-4266-8673-0760daf1c07f","dt":"2021-02-04T16:29:47.250Z","uri":"unknown","domain":"en.wikipedia.beta.wmflabs.org","request_id":"32114770-6706-11eb-a17f-a8553f7376c3","stream":"eventgate-analytics-external.error.validation","topic":"eqiad.eventgate-analytics-external.error.validation","partition":0,"offset":5410},"emitter_id":"eventgate-analytics-external","raw_event":"{\"event\":{\"OtherMessage\":123},\"schema\":\"Test\",\"webHost\":\"en.wikipedia.beta.wmflabs.org\",\"wiki\":\"enwiki\",\"$schema\":\"/analytics/legacy/test/1.2.0\",\"client_dt\":\"2021-02-04T16:29:40.570Z\",\"meta\":{\"stream\":\"eventlogging_Test\",\"domain\":\"en.wikipedia.beta.wmflabs.org\",\"id\":\"7059b2f9-5ea1-4d1a-8fee-3435a21f2bab\",\"dt\":\"2021-02-04T16:29:47.241Z\",\"request_id\":\"32114770-6706-11eb-a17f-a8553f7376c3\"},\"dt\":\"2021-02-04T16:29:40.570Z\",\"http\":{\"request_headers\":{\"user-agent\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36\"}}}","message":"'.event.OtherMessage' should be string","$schema":"/error/1.0.0","errored_schema_uri":"/analytics/legacy/test/1.2.0","errored_stream_name":"eventlogging_Test"}

The first event was my valid event, the second was the validation error event, which included "message":"'.event.OtherMessage' should be string"

or if they can ssh tunnel to the TBD eventstreams-internal instance

Ah, actually, we finished this today!

https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#In_production

Change 661449 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] EventGate: Fix missing module-name and mode for startediting-cta

https://gerrit.wikimedia.org/r/661449

Change 661455 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] EventGate: Actually check for task.pageviews not being null

https://gerrit.wikimedia.org/r/661455

I think all of the items here have been fixed, but wmf.30 isn't yet in group2.

Thanks everybody for the comments! Lots of info to digest, it's great.

wmf.30 has been deployed and it seems that there are no Growth projects related errors on eventgate-validation dashboard.

kostajh moved this task from QA to Ready for Development on the Growth-Team (Current Sprint) board.

'.event.user_editcount' should be integer is seen again this week. From the event data ({"event":{"action":"postedit-close","action_data":"","user_id":0,"user_editcount":null,"context":"postedit","editor_interface":"wikitext","is_suggested_task":true,"is_mobile":false,"page_id":0,"page_title":"","page_ns":0,"user_can_edit":true,"page_prot) it looks like the user was logged out some time after completing an edit but before closing the postedit dialog. We should probably add a check in HelpPanelLogger.log() to double check that we have a user ID before logging the event.

kostajh triaged this task as Medium priority.May 12 2021, 12:36 PM

Change 689856 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] HelpPanelLogger: Require user ID before logging

https://gerrit.wikimedia.org/r/689856

Change 689856 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] HelpPanelLogger: Require user ID before logging

https://gerrit.wikimedia.org/r/689856