Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Nuria
	May 12 2020, 7:37 PM

Description

Refine event pipeline refines raw data in hourly partitions without knowing if the partition is complete as there is no marker that tells the refine flows that import from kafka for that hour is has completed.

This becomes specially problematic in the case of data center switches for jobs that need to consume data for one topic from two different data-center prefixed partitions as data would be split across two hours.

The webrequest refine peipeline does not have this issue cause it uses the CamusPartitionChecker[1] to asses whether the hour is completely imported from kafka and ready to be refined. We should augment the event refine workflow to take advantage of this. Now, we have very many topics with few events and for those a check like this one would fail as some hours might be totally empty so using the partition checker widely implies that we have done the work to have canary events in all topics. T250844: MEP: canary events so we know events are flowing through pipeline

[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-camus/src/main/scala/org/wikimedia/analytics/refinery/camus/CamusPartitionChecker.scala

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T251609 Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events
Resolved	Ottomata	T266798 [Event Platform] Enable canary events for all MediaWiki streams
Resolved	Ottomata	T252585 Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete

Event Timeline

• Nuria created this task.May 12 2020, 7:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 12 2020, 7:37 PM

ping @Milimetric , this ticket documents our recent discussion in batcave about issues brought up on the events recent job. (cc @JAllemandou and @Ottomata )

Ok, so I don't see any reason to block work on the wikidata item_page_link improvement until we have this working perfectly. For me, as long as this solution is coming down the pipeline, I'm ok with a workaround for now. So the question is, what workaround would work? If we hardcode the datacenter, we'd have to know for each hour which dataset to use, otherwise the dependent jobs could wait forever on the wrong empty folder.

Another solution is to run a delayed job for each event dataset that we need. This would check a folder that has datacenter=eqiad / datacenter=codfw for 4 hours ago. And writes a success flag in datacenter=all/year=.../month=.../day=.../hour=4-hours-ago.

What does everyone think?

Another solution is to run a delayed job for each event dataset that we need. This would check a folder that has datacenter=eqiad / datacenter=codfw for 4 hours ago. And writes a success flag in datacenter=all/year=.../month=.../day=.../hour=4-hours-ago.

moving the datacenter switch to an oozie workflow, on my opinion, seems brittle. Let's discuss on the other ticket.

I'd rather hard-code dataset folder to eqiad and read from both. Job will fail with SLAs when DC-switch happen, and we'll have to restart it with a patch. I hope the solution about events' partition correctness will be available possibly before DC-switch, or at least not too long ago so that not too many manual switches are needed.

I'd rather hard-code dataset folder to eqiad and read from both. Job will fail with SLAs when DC-switch happen, and we'll have to restart it with a patch.

+1 to this

• fdans triaged this task as Medium priority.May 18 2020, 4:07 PM

• fdans moved this task from Incoming to Event Platform on the Analytics board.

Ottomata added a parent task: T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events.Jul 1 2020, 5:30 PM

This is now possible for any stream that sets canary_events_enabled: true in EventStreamConfig.

Ottomata claimed this task.Oct 22 2020, 2:45 PM

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

Ottomata removed a project: Analytics-Kanban.Oct 29 2020, 4:15 PM

To call this task done, we should first complete T266798: [Event Platform] Enable canary events for all MediaWiki streams.

Ottomata added a project: Event-Platform.Oct 29 2020, 4:17 PM

Ottomata moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.Aug 11 2022, 6:08 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptAug 11 2022, 6:08 PM

Ottomata moved this task from To be Estimated/To be discussed to Backlog on the Event-Platform board.Aug 15 2022, 2:29 PM

We just encountered an issue related to this ticket. I'll add it here to not lose context.

On 2022-08-04 15:59, Sandra deployed refinery as part of Data Engineering's ops week weekly train deploy. A few things went wrong, but notably the artifacts failed the git fat part of the sync on an-launcher1002. This was not noticed or fixed until 1.5 hours later, at around 18:19. This means that there was a full hour for which systemd timers that needed refinery artifact .jar files were failing on an-launcher1002, including the produce_canary_events job.

In /var/log/refinery/produce_canary_evnets/produce_canary_events.log:

Aug  4 16:45:00 an-launcher1002 produce_canary_events[429]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:00:02 an-launcher1002 produce_canary_events[2339]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:15:00 an-launcher1002 produce_canary_events[23815]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:30:01 an-launcher1002 produce_canary_events[3400]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 17:45:00 an-launcher1002 produce_canary_events[13158]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 18:00:00 an-launcher1002 produce_canary_events[22951]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents
Aug  4 18:15:02 an-launcher1002 produce_canary_events[9752]: 2022-08-04T18:15:02.462 INFO ProduceCanaryEvents Loaded ProduceCanaryEvents config:
[... successful runs after this time ...]

Because a full hour went by without ANY canary events being produced, all Kafka topics that usually have no volume (many of the codfw ones) had no events for hour=17. This caused no Hive partition for codfw hour=17 to be created, which caused some downstream search airflow jobs that use the event.mediawiki_cirrussearch-request Hive table to fail waiting for data.

Timeline of the deploy failures can be found at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log

It'd be nice if we produced canary events from something more robust than a single systemd timer on an-launcher1002. Perhaps an airflow job, or perhaps even better some k8s job or service that periodically did this.

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Sep 6 2022, 10:42 AM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Oct 14 2022, 8:40 AM

Ottomata mentioned this in T330236: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?).Feb 22 2023, 4:40 PM

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

I'm going to resolve this task. We should track further work in T266798: [Event Platform] Enable canary events for all MediaWiki streams.

Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is completeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete
Closed, ResolvedPublic
Actions

Related Objects
Search...