Page MenuHomePhabricator

Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete
Closed, ResolvedPublic

Description

Refine event pipeline refines raw data in hourly partitions without knowing if the partition is complete as there is no marker that tells the refine flows that import from kafka for that hour is has completed.

This becomes specially problematic in the case of data center switches for jobs that need to consume data for one topic from two different data-center prefixed partitions as data would be split across two hours.

The webrequest refine peipeline does not have this issue cause it uses the CamusPartitionChecker[1] to asses whether the hour is completely imported from kafka and ready to be refined. We should augment the event refine workflow to take advantage of this. Now, we have very many topics with few events and for those a check like this one would fail as some hours might be totally empty so using the partition checker widely implies that we have done the work to have canary events in all topics. T250844: MEP: canary events so we know events are flowing through pipeline

[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-camus/src/main/scala/org/wikimedia/analytics/refinery/camus/CamusPartitionChecker.scala

Event Timeline

ping @Milimetric , this ticket documents our recent discussion in batcave about issues brought up on the events recent job. (cc @JAllemandou and @Ottomata )

Ok, so I don't see any reason to block work on the wikidata item_page_link improvement until we have this working perfectly. For me, as long as this solution is coming down the pipeline, I'm ok with a workaround for now. So the question is, what workaround would work? If we hardcode the datacenter, we'd have to know for each hour which dataset to use, otherwise the dependent jobs could wait forever on the wrong empty folder.

Another solution is to run a delayed job for each event dataset that we need. This would check a folder that has datacenter=eqiad / datacenter=codfw for 4 hours ago. And writes a success flag in datacenter=all/year=.../month=.../day=.../hour=4-hours-ago.

What does everyone think?

Another solution is to run a delayed job for each event dataset that we need. This would check a folder that has datacenter=eqiad / datacenter=codfw for 4 hours ago. And writes a success flag in datacenter=all/year=.../month=.../day=.../hour=4-hours-ago.

moving the datacenter switch to an oozie workflow, on my opinion, seems brittle. Let's discuss on the other ticket.

I'd rather hard-code dataset folder to eqiad and read from both. Job will fail with SLAs when DC-switch happen, and we'll have to restart it with a patch. I hope the solution about events' partition correctness will be available possibly before DC-switch, or at least not too long ago so that not too many manual switches are needed.

I'd rather hard-code dataset folder to eqiad and read from both. Job will fail with SLAs when DC-switch happen, and we'll have to restart it with a patch.

+1 to this

fdans triaged this task as Medium priority.May 18 2020, 4:07 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

This is now possible for any stream that sets canary_events_enabled: true in EventStreamConfig.

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

We just encountered an issue related to this ticket. I'll add it here to not lose context.

On 2022-08-04 15:59, Sandra deployed refinery as part of Data Engineering's ops week weekly train deploy. A few things went wrong, but notably the artifacts failed the git fat part of the sync on an-launcher1002. This was not noticed or fixed until 1.5 hours later, at around 18:19. This means that there was a full hour for which systemd timers that needed refinery artifact .jar files were failing on an-launcher1002, including the produce_canary_events job.

In /var/log/refinery/produce_canary_evnets/produce_canary_events.log:

Aug  4 16:45:00 an-launcher1002 produce_canary_events[429]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:00:02 an-launcher1002 produce_canary_events[2339]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:15:00 an-launcher1002 produce_canary_events[23815]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:30:01 an-launcher1002 produce_canary_events[3400]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:45:00 an-launcher1002 produce_canary_events[13158]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 18:00:00 an-launcher1002 produce_canary_events[22951]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 18:15:02 an-launcher1002 produce_canary_events[9752]: 2022-08-04T18:15:02.462 INFO ProduceCanaryEvents Loaded ProduceCanaryEvents config:
[... successful runs after this time ...]

Because a full hour went by without ANY canary events being produced, all Kafka topics that usually have no volume (many of the codfw ones) had no events for hour=17. This caused no Hive partition for codfw hour=17 to be created, which caused some downstream search airflow jobs that use the event.mediawiki_cirrussearch-request Hive table to fail waiting for data.

Timeline of the deploy failures can be found at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log

It'd be nice if we produced canary events from something more robust than a single systemd timer on an-launcher1002. Perhaps an airflow job, or perhaps even better some k8s job or service that periodically did this.

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Ottomata claimed this task.

I'm going to resolve this task. We should track further work in T266798: [Event Platform] Enable canary events for all MediaWiki streams.