Page MenuHomePhabricator

Missing hourly partition for event.mediawiki_revision_recommandation_create
Closed, ResolvedPublic

Description

These events have the canary enabled, so they should have at least one event in every partition. event.mediawiki_revision_recommendation_create is missing the partition for datacenter=eqiad/year=2021/month=5/day=14/hour=1

Event Timeline

Hmm, is this a new limitation? We first deployed this stream in february and no events came through for a few months, but we were getting hourly partitions anyway from the canary events. Git log suggests this is not new, so i'm not sure what changed.

Canary events are enabled for this stream. They are used to make create the partitions, but are filtered out of the refined event database. So, they will be in the stream and the raw data, but not in the final event table.

There was an outage of the canary events producer last week, fixed by https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/691232/.

https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/691236 will help, but we need to do T270138: produce_canary_events job should not fail if a schema is missing examples to keep that from happening in general.

A short outage (< than an hour) of canary events is fine, but yeah...if it is broken for more than an hour, a topic with no other data will not result in any Hive event table partitions created.

I'll try to bump the priority of T283084, this really shouldn't happen like this.

Change 692934 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] ProduceCanaryEvents - produce events one at a time for better error handling

https://gerrit.wikimedia.org/r/692934

Change 692934 merged by Ottomata:

[analytics/refinery/source@master] ProduceCanaryEvents - produce events one at a time for better error handling

https://gerrit.wikimedia.org/r/692934

^ will be deployed next week, that should keep this from happening again.

Change 695340 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Bump refinery::job::canary_events to 0.1.12

https://gerrit.wikimedia.org/r/695340

Change 695340 merged by Ottomata:

[operations/puppet@production] Bump refinery::job::canary_events to 0.1.12

https://gerrit.wikimedia.org/r/695340

Change 695443 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Fix for ProduceCanaryEvents exit val

https://gerrit.wikimedia.org/r/695443

Change 695443 merged by jenkins-bot:

[analytics/refinery/source@master] Fix for ProduceCanaryEvents exit val

https://gerrit.wikimedia.org/r/695443

Change 695480 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Bump refinery::job::canary_events to 0.1.13

https://gerrit.wikimedia.org/r/695480

Change 695480 merged by Ottomata:

[operations/puppet@production] Bump refinery::job::canary_events to 0.1.13

https://gerrit.wikimedia.org/r/695480

Heya @Ottomata - Could you please provide a status summary on this (asked by @Gehel on IRC) - thanks :)

Ottomata added a project: Analytics-Kanban.

Oh, I think I forgot to update this because we never groomed it via Analytics tag.

I haven't done any work to backfill any partitions; there really were no events at all for that hour because ProduceCanaryEvents failed during that time.

ProduceCanaryEvents is now fixed so that a single stream produce failure won't cause all streams to have canary events failed to be produced, so hopefully this won't happen again (at least it shouldn't for that reason.)