Page MenuHomePhabricator

[Shared Event Platform] Implement error handling and retry logic when fetching data from the MW api
Open, Needs TriagePublic

Description

User Story
As a platform engineer, I need to implement proper error handling in a streaming processor so that errors are understood and corrected

Like any other network call errors are likely to happen when fetching data from the MW api, such errors should be handled properly so that data loss is limited.
Eventual consistency issues are also likely to happen in such architecture those problems should be mitigated as much as possible to increase the consistency of the output of this processor.

What was proven to work relatively well is:

  • retry any error at most 3 times with 1 sec interval between retries
  • retry events that are considered recent, in our case a good value for recent is 10s (c.f. T279698)

Regarding unrecoverable errors they should be captured in a flink side-output so that they can be analyzed after the fact.
Ideally errors should be classified to ease the analysis of those, ideas taken from the wikidata service updater:

  • not found, the revision or title is not found: cause might be eventual consistency, page actually deleted before we had a chance to read its content (likely to happen during backfills)
  • no content, on http 204
  • unexpected response: on http >= 300
  • empty response: valid http code but no body in the response
  • unknown: for everything else
Done is:
  • Transient network errors do not cause loss of data
  • Eventual consistency issues are understood and mitigated
  • Unrecoverable errors are classified and captured in a side-output
  • (optional) the error side-output has a proper schema, is stored in kafka and persisted in hive for analysis purposes

Event Timeline

I used this in the past for building Python based services.

https://tenacity.readthedocs.io/en/latest/#waiting-before-retrying

Does anything exist like this for Scala? Maybe this...

https://index.scala-lang.org/hipjim/scala-retry

Could be an easy way for teams that implement services to set their own retry logic for calls

There's a number of libraries to support retries (with the typical backoff mechanisms you'd expect). Not that the choice is set in stone, but this was one of the requirements I looked at when choosing an HTTP library for the PoC.

The library I ended up using allows for pluggable backends (concurrency models), that should make retry logic on network errors pretty much zero cost to implement. The backend I'm using is Future based, at this would be their recommended retry plugin https://github.com/softwaremill/retry. There's more details at https://sttp.softwaremill.com/en/latest/resilience.html?highlight=retry#backend-specific-retries

It gets a bit more tricky when we want to retry on _successful_ requests; for example a query for a rev_id not yet available in the db replica, would return a 200 with a payload that describes the miss (we currently log these kind of responses).