Page MenuHomePhabricator

[Event Platform] Enable canary events for mediawiki.revision-visibility-change
Closed, DuplicatePublic

Description

After work on T340880, we now consume the hourly partitions that are generated from the mediawiki.revision-visibility-change event stream. This Airflow job stalled recently waiting on partition datacenter=eqiad/year=2023/month=9/day=17/hour=4. This partition never materialized as there were 0 events in that hour range.

On a recent slack thread, we discovered that canary events are not being produced for mediawiki.revision-visibility-change:

		'mediawiki.revision-visibility-change' => [
			'schema_title' => 'mediawiki/revision/visibility-change',
			'destination_event_service' => 'eventgate-main',
			'canary_events_enabled' => false,
		],

(Github: https://github.com/wikimedia/operations-mediawiki-config/blob/6aea7f20e164f1f7f6f84b2160cc8b16427b5a85/wmf-config/ext-EventStreamConfig.php#L1358-L1361 )

As per @Ottomata, this stream predates the introduction of canary events, and thus it doesn't have it enabled. But for having T340880 work reliably, we now want these canary events to happen.

In this task we want to:

  • Enable canary events for mediawiki.revision-visibility-change for the reliable consumption of the downstream HDFS table.
  • Given this stream is likely consumed by other folks, we need to announce the change and explain to folks how to filter out these events.

Event Timeline

CC @lbowmaker and @VirginiaPoundstone. We just found this out, and it is an eventual blocker for productionizing Dumps 2.0.

@xcollazo, @Ottomata - is a wikitech-l announcement sufficient (like we did for page_change)? We could work on drafting something like that.

But for having T340880 work reliably, we now want these canary events to happen.

Do you need the canary events, or just the Hive partition to be present? IMHO, if reasonably feasible, this is something that should be addressed in airflow (be resilient to missing partitions), and not introduce behaviour changes upstream.

Do you need the canary events, or just the Hive partition to be present?

I need the Hive partition.

this is something that should be addressed in airflow (be resilient to missing partitions), and not introduce behaviour changes upstream.

Ah, but part of the rationale of why canary events where invented was so that Hive consumers can reliably find and sense partitions. Most of our Airflow DAGs as of today depend on partitions being created. It's a common pattern.

@xcollazo, @Ottomata - is a wikitech-l announcement sufficient (like we did for page_change)? We could work on drafting something like that.

I'll defer to folks that know downstream consumers better?

Here is another reason why need the canary events enabled:

Right now, we are telling Airflow to only consume partitions under datacenter=eqiad:

hive_event_mediawiki_revision_visiblity_change:
  datastore: hive
  table_name: event.mediawiki_revision_visibility_change
  partitioning: "@hourly" 
  pre_partitions: ["datacenter=eqiad"]

We recently switched datacenters, and, because of current configuration, we now have to manually tell Airflow to start consuming datacenter=codfw instead.

But, for the hour when the switchover happened, we have data in both datacenters. This means both datacenters have some data to be consumed for that hour. Thus, for correctness, we should ingest both all the time like so:

hive_event_mediawiki_revision_visiblity_change:
  datastore: hive
  table_name: event.mediawiki_revision_visibility_change
  partitioning: "@hourly" 
  pre_partitions: [["datacenter=eqiad", "datacenter=codfw"]]

But we cannot achieve because we do not have canary events, and thus we would wait forever for the currently inactive datacenter. On event.mediawiki_page_content_change_v1 this is not an issue because that one does indeed have canary events enabled.

Discussed this with @Milimetric briefly, and we speculate that the changes on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventStreamConfig/+/957949 will make canary events more reliable, which can help the use case exposed in this ticket.

Also, while we're talking canaries, if it's just as easy, enabling them for all EventBus - sourced streams is a good idea. Otherwise we have the problem Xabriel explains above. Here's an example job using the page move events.

Also, while we're talking canaries, if it's just as easy, enabling them for all EventBus - sourced streams is a good idea. Otherwise we have the problem Xabriel explains above. Here's an example job using the page move events.

I think that was @Ottomata's intention with the parent ticket T266798.

@xcollazo, @Ottomata - is a wikitech-l announcement sufficient (like we did for page_change)? We could work on drafting something like that.

Yes, I think that would be fine. If we are going to announce this, we should probably plan to do T266798: [Event Platform] Enable canary events for all MediaWiki streams as well. We don't have to do them right now, but we should indicate that we are going to enable canaries for all streams by some date.

cc @dcausse (does search use revision-visibility-change?)

Noting that this has now happened multiple times now:

2023-09-17, 04:00:00
2023-09-18, 04:00:00
2023-09-23, 02:00:00
2023-09-27, 05:00:00

An up to date list of occurrences can be seen by SSHing to the analytics Airflow instance and navigating to http://localhost:8600/taskinstance/list/?_flt_0_task_id=wait_for_event_mediawiki_revision_visibility_change_partitions&_flt_0_state=failed

Ahoelzl renamed this task from Enable canary events for mediawiki.revision-visibility-change to [Event Platform] Enable canary events for mediawiki.revision-visibility-change.Oct 23 2023, 8:21 PM

We're going to all the streams once, so I'll merge this into the parent.