Page MenuHomePhabricator

Mitigate consequences of Gobblin hiccups generating late events and alerts
Closed, ResolvedPublic

Description

Gobblin hiccups are currently generating late events, which in turn trigger unnecessary alerts. As we have alerts for webrequest, this is where we see the problem. But the same applies to events.

Here is a proposal for an easy fix:
1/ Increase the sensing window from 2 consecutive hours to 3.
2/ Delay the jobs by 1 hour using the DelayedTimeTable mechanism.

For those 3 dags:

  • refine_webrequest_hourly_text
  • refine_webrequest_hourly_upload
  • refine_to_hive_hourly

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Make webrequest and refine wait 2 gobblin hoursrepos/data-engineering/airflow-dags!1729joalupdate_gobblin_sensorsmain
Customize query in GitLab

Event Timeline

I think we should go for option 1, it allows the delay to be based on data rather than calendar time, making it reliable even in case of reruns.
This will mean that event and pageview data will show up one hour later than they currently do. Is it ok from a product perspective?
Pinging my usual suspects @nshahquinn-wmf, @Mayakp.wiki and @Hghani on this by ignorance about whom else to tag :)

speaking for our needs, I think that's totally fine! :)

speaking for our needs, I think that's totally fine! :)

Agreed, I don't think an extra hour will cause any issues!

Thank you both for your answers :)
I'll move forward with this.