The research spark job for html based edit-type has 60s timeout (previously 300s). And from the errors in research edit-types table, it looks like indeed some processing did reach the timeout limit. Note that the research pipeline did not have to apply the diff, just computing edit-types.
spark.sql("select edit_types.error from research.edit_types_html").groupBy("error").count().show(truncate=False);
+---------------------------+---------+
|error |count |
+---------------------------+---------+
|null |124586530|
|timeout error (300 seconds)|2759 |
|timeout error (60 seconds) |4043 |
|None |1469 |
+---------------------------+---------+We should have timeouts for the feature counts enrichment job as well:
- in appyling the diff
- in computing edit-types
Without timeouts, some processing may take a long time and block the pipeline. More importantly, for whatever reason, if the pipeline fails due to this long processing time (OOM?) the flink app will crash. We should pre-emptively raise with timeouts to jobs go into the error sink so we save the app and keep events flowing.
- With recent changes, the app should auto restart with backoff time, but if the same event is attempted to be computed repeatedly, the pipeline will not progress and will remain stuck