Page MenuHomePhabricator

Streaming services errors should be routed to an error event topic.
Closed, ResolvedPublic

Description

Background/Goal

Currently, we output Flink application errors to a logging sink.

We should this general purpose error schema for error side outputs and produce into a dedicated (declared per application) kafka topic. The event that caused the error should be included as a raw json string in this error events.

Acceptance Criteria

  • Convention on error side outputs decided and documented: <job_name>.error
  • Errors eventutilties stream_manager pipeline cause an error event to be emitted to an error stream.

Artifacts/resources

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/data-engineering/eventutilities-python!36job_name0mainottoMake job_name a mandatory param and default error_destination with it
repos/data-engineering/eventutilities-python!29T326536_error_event_improvement0mainottoerror.py - improve error messages
repos/data-engineering/mediawiki-event-enrichment!14minimal-argseventutilities-python-version-bumpottoAdd minimal CLI parameterization for stream descriptors
repos/data-engineering/eventutilities-python!25T326536_error_eventmainottoFlinkStreamManager error process function handling and error event sideoutput
repos/data-engineering/eventutilities-python!23T326565_file_sinkmainottoEventDataStreamFactory - add row_file_sink method
repos/data-engineering/eventutilities-python!22T326536_FlinkStreamManager_refactormainottoRefactor FlinkStreamManager construction
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
JArguello-WMF renamed this task from [NEEDS GROOMING] Streaming services errors should be routed to an error event topic. to Q3 Streaming services errors should be routed to an error event topic..Jan 25 2023, 3:22 PM
JArguello-WMF triaged this task as Medium priority.
JArguello-WMF set the point value for this task to 5.
Ottomata renamed this task from Q3 Streaming services errors should be routed to an error event topic. to [NEEDS GROOMING] Streaming services errors should be routed to an error event topic..Jan 25 2023, 3:23 PM
Ottomata raised the priority of this task from Medium to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata removed the point value for this task.
JArguello-WMF renamed this task from [NEEDS GROOMING] Streaming services errors should be routed to an error event topic. to Streaming services errors should be routed to an error event topic..Jan 25 2023, 3:41 PM

Error event work in this branch.

I was thrown off for a while by the fact that we (of course) have to emit error events to side output as Rows, since the output_tag takes a specific Flink TypeInformation. I now convert the error event dict to a Row in the ProcessFunction before emitting it to the side output.

I asked the mailing list about this need, and someone quickly confirmed it is a bug and submitted a fix!

Change 895749 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mediawiki-page-content-change-enrichment - bump to 1.1.0

https://gerrit.wikimedia.org/r/895749

Change 895749 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki-page-content-change-enrichment - bump to 1.1.0

https://gerrit.wikimedia.org/r/895749

Change 895796 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream

https://gerrit.wikimedia.org/r/895796

Change 895796 merged by jenkins-bot:

[operations/mediawiki-config@master] wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream

https://gerrit.wikimedia.org/r/895796

Change 895800 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mediawiki-page-content-change-enrichment - set error-destination

https://gerrit.wikimedia.org/r/895800

Change 895803 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream

https://gerrit.wikimedia.org/r/895803

Change 895803 merged by jenkins-bot:

[operations/mediawiki-config@master] wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream

https://gerrit.wikimedia.org/r/895803

Change 895805 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] wgEventStreams - fix typo in new error stream name

https://gerrit.wikimedia.org/r/895805

Change 895805 merged by Ottomata:

[operations/mediawiki-config@master] wgEventStreams - fix typo in new error stream name

https://gerrit.wikimedia.org/r/895805

Mentioned in SAL (#wikimedia-operations) [2023-03-08T15:05:07Z] <otto@deploy2002> Synchronized wmf-config/ext-EventLogging.php: wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream - T326536 (duration: 11m 33s)

Mentioned in SAL (#wikimedia-operations) [2023-03-08T15:22:44Z] <otto@deploy2002> Synchronized wmf-config/ext-EventLogging.php: wgEventStreams - Fix typo in rc1.enrichment.mediawiki_page_content_change.error stream - T326536 (duration: 06m 41s)

Change 895800 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki-page-content-change-enrichment - set error-destination

https://gerrit.wikimedia.org/r/895800

We have our first real error!!!

kafkacat -C -u -b kafka-jumbo1006.eqiad.wmnet:9092 -t eqiad.rc1.enrichment.mediawiki_page_content_change.error | jq .

{
  "$schema": "/error/2.0.0",
  "dt": "2023-03-08T15:54:06.563791992Z",
  "emitter_id": "Flink Enrichment Job: rc1.mediawiki.page_change ->rc1.mediawiki.page_content_change",
  "errored_schema_uri": "/mediawiki/page/change/1.0.0",
  "errored_stream_name": "rc1.mediawiki.page_change",
  "message": "'pages'",
  "meta": {
    "domain": "fr.wikipedia.org",
    "dt": "2023-03-08T15:54:23.293253Z",
    "id": "891f7ec7-8173-4a91-b3d4-2279baec6876",
    "request_id": "e6ce97a3-076f-456b-9886-0ea4c7145b71",
    "uri": "https://fr.wikipedia.org/wiki/Spidart",
    "stream": "rc1.enrichment.mediawiki_page_content_change.error"
  },
  "raw_event": "...",
  "stack": "  File \"/opt/lib/python/site-packages/eventutilities_python/manager.py\", line 134, in process_element\n    yield self.func(event)\n  File \"/srv/app/pipeline.py\", line 37, in enrich\n    content_body = main_slot[\"query\"][\"pages\"][0][\"revisions\"][0][\"slots\"][\"main\"][\n"
}

Oh interesting!  We should have more about the actual error than just the message. In the log, the error prints as

KeyError: 'pages'

On it...

I investigated a couple of the errors. One seems to be because the page was deleted before the enrichment job could enrich it!

https://en.wikipedia.org/w/index.php?title=Talk:Rob_Run_Corleone&oldid=1143574763

We should handle these errors, probably as not errors? Just skipping the event?

@gmodena, To do this, we might need to figure out how to make process able to filter events as we discussed.

So: since we decided to use plain on datastream.map when error_destination = False (instead of our datastream.process w error handler), the job was dying when it encountered these errors. This makes sense I think. If stream manager error handling is disabled, then it is up to the user to catch exceptions in their own map function, right?

Alternatively, we could always use process w error handler, but and log but not emit events that encounter errors? I think I prefer the current behavior.

@gmodena, thoughts?

So: since we decided to use plain on datastream.map when error_destination = False (instead of our datastream.process w error handler), the job was dying when it encountered these errors. This makes sense I think. If stream manager error handling is disabled, then it is up to the user to catch exceptions in their own map function, right?

Yes. That's expected behaviour. Pipeline code delegates error handling to library (which in this case was missing).

Alternatively, we could always use process w error handler, but and log but not emit events that encounter errors? I think I prefer the current behavior.

Mmm... this is dangerous because it adds a level of obfuscation that will bite us later on. We have a pattern for error reporting which should be followed, or have the pipeline blow up in our face when not. +1 for keeping current behaviour.

@gmodena @tchin and I talked today and decided on the convention of:

<app_name>.error

Where app name corresponds to the deployed helmfile service, as well as the Flink job_name.

I will implement this as the default before we close this ticket.

Change 920378 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Declare mediawiki.page_change.v1 stream and produce from Eventbus

https://gerrit.wikimedia.org/r/920378

Change 920378 merged by jenkins-bot:

[operations/mediawiki-config@master] Declare mediawiki.page_change.v1 stream and produce from Eventbus

https://gerrit.wikimedia.org/r/920378