As a platform engineer, I need to implement proper error handling in a streaming processor so that errors are understood and corrected
Like any other network call errors are likely to happen when fetching data from the MW api, such errors should be handled properly so that data loss is limited.
Eventual consistency issues are also likely to happen in such architecture those problems should be mitigated as much as possible to increase the consistency of the output of this processor.
What was proven to work relatively well is:
- retry any error at most 3 times with 1 sec interval between retries
- retry events that are considered recent, in our case a good value for recent is 10s (c.f. T279698)
Regarding unrecoverable errors they should be captured in a flink side-output so that they can be analyzed after the fact.
Ideally errors should be classified to ease the analysis of those, ideas taken from the wikidata service updater:
- not found, the revision or title is not found: cause might be eventual consistency, page actually deleted before we had a chance to read its content (likely to happen during backfills)
- no content, on http 204
- unexpected response: on http >= 300
- empty response: valid http code but no body in the response
- unknown: for everything else
- Transient network errors do not cause loss of data
- Eventual consistency issues are understood and mitigated
- Unrecoverable errors are classified and captured in a side-output
- (optional) the error side-output has a proper schema, is stored in kafka and persisted in hive for analysis purposes