Page MenuHomePhabricator

The network_internal druid load job fails if data is not present
Closed, DeclinedPublic

Description

We have seen alerts caused by the lack of input data for the eventlogging_to_druid_network_flows_internal_hourly job.
This data has been sporadic in being generated, because the only current source is the new drmrs data centre.

There is a question in my mind as to whether or not his job should exit with an error when data is not present.

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
DeclinedNone

Event Timeline

Change 764737 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Absent the eventlogging_to_druid_job job temporarily

https://gerrit.wikimedia.org/r/764737

Change 764737 merged by Btullis:

[operations/puppet@production] Absent the eventlogging_to_druid_job job temporarily

https://gerrit.wikimedia.org/r/764737

Perhaps this behaviour is desired. Under normal circumstances the routers would always be generating data, so perhaps it would be correct for the job to exit with an error if there is no source data.
The reason for the lack of input data at the moment is that drmrs is the only data centre to be producing this sflow data so far and it's still being set up.

Happy to take advise on whether or not the current behaviour is ideal.

BTullis updated the task description. (Show Details)

Change 772877 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reenable the sflow job

https://gerrit.wikimedia.org/r/772877

odimitrijevic subscribed.

Now that drmrs dc is operational this should be resided upon. As part of the work let's ensure that the data is collected as expected and that it is reported correctly. Are there any quality checks that should be implemented for due diligence as part of a different task?

Change 772877 merged by Btullis:

[operations/puppet@production] Reenable the sflow job

https://gerrit.wikimedia.org/r/772877

I have re-enabled the task to collect the sflow data.

Once it has run we should be able to verify that the data is correctly loaded to druid by using this Turnilo dashboard.

BTullis triaged this task as Medium priority.

I can see data in Turnilo for the network_flows_internal stream, so I think everything is good here.

image.png (908×1 px, 95 KB)

As for this task, about whether the job should fail if data isn't present, I think that the consensus is that it should fail and therefore we should decline the task.

Adding @ayounsi to help check whether all expected data is present. I think that we're expecting to see some data from eqiad in here soon. Is that right?

The region field says 'Unknown' for all of the data at the moment. That's supposed to be populated by the DC code isn't it?

image.png (337×1 px, 31 KB)

Adding @ayounsi to help check whether all expected data is present. I think that we're expecting to see some data from eqiad in here soon. Is that right?

Yes, not sure yet why eqiad's data isn't showing up in kafka.

The region field says 'Unknown' for all of the data at the moment. That's supposed to be populated by the DC code isn't it?

image.png (337×1 px, 31 KB)

Turnilo has a ~5/6h delay so the region field should start working by then. The data is now correct in kafka.