Page MenuHomePhabricator

Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete
Open, MediumPublic

Description

Refine event pipeline refines raw data in hourly partitions without knowing if the partition is complete as there is no marker that tells the refine flows that import from kafka for that hour is has completed.

This becomes specially problematic in the case of data center switches for jobs that need to consume data for one topic from two different data-center prefixed partitions as data would be split across two hours.

The webrequest refine peipeline does not have this issue cause it uses the CamusPartitionChecker[1] to asses whether the hour is completely imported from kafka and ready to be refined. We should augment the event refine workflow to take advantage of this. Now, we have very many topics with few events and for those a check like this one would fail as some hours might be totally empty so using the partition checker widely implies that we have done the work to have canary events in all topics. T250844: MEP: canary events so we know events are flowing through pipeline

[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-camus/src/main/scala/org/wikimedia/analytics/refinery/camus/CamusPartitionChecker.scala

Event Timeline

ping @Milimetric , this ticket documents our recent discussion in batcave about issues brought up on the events recent job. (cc @JAllemandou and @Ottomata )

Ok, so I don't see any reason to block work on the wikidata item_page_link improvement until we have this working perfectly. For me, as long as this solution is coming down the pipeline, I'm ok with a workaround for now. So the question is, what workaround would work? If we hardcode the datacenter, we'd have to know for each hour which dataset to use, otherwise the dependent jobs could wait forever on the wrong empty folder.

Another solution is to run a delayed job for each event dataset that we need. This would check a folder that has datacenter=eqiad / datacenter=codfw for 4 hours ago. And writes a success flag in datacenter=all/year=.../month=.../day=.../hour=4-hours-ago.

What does everyone think?

Another solution is to run a delayed job for each event dataset that we need. This would check a folder that has datacenter=eqiad / datacenter=codfw for 4 hours ago. And writes a success flag in datacenter=all/year=.../month=.../day=.../hour=4-hours-ago.

moving the datacenter switch to an oozie workflow, on my opinion, seems brittle. Let's discuss on the other ticket.

I'd rather hard-code dataset folder to eqiad and read from both. Job will fail with SLAs when DC-switch happen, and we'll have to restart it with a patch. I hope the solution about events' partition correctness will be available possibly before DC-switch, or at least not too long ago so that not too many manual switches are needed.

I'd rather hard-code dataset folder to eqiad and read from both. Job will fail with SLAs when DC-switch happen, and we'll have to restart it with a patch.

+1 to this

fdans triaged this task as Medium priority.May 18 2020, 4:07 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

This is now possible for any stream that sets canary_events_enabled: true in EventStreamConfig.

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.