Page MenuHomePhabricator

mediawiki-event-enrichment - set deterministic meta.id
Open, MediumPublic

Description

The Event Platform meta.id field is used as a unique identifier for a specific event. This field is use for deduplication when ingesting into the Data Lake

When serializing events, the wikimedia-event-utilities Java library will auto generate this field to a new random uuid if it is not set.

In an at least one guarantee streaming pipeline, it is possible that the same input event is processed multiple times. This will cause 'duplicate' events in the output pipeline that have different meta.id fields.

Instead, we could set the output event's meta.id to a deterministic uuid5 based on the input event's own meta.id.

Something like

uuid_namespace = <static uuid4 here>
event['meta']['id'] = uuid.uuid5(uuid_namespace, source_event['meta']['id'])

We could consider upstreaming this operation in to event-utilities java, but I'm not sure we want the serialization framework to always overwrite meta.id. We should probably keep control of the value of meta.id in the hands of the producer.

Done is

  • reusable function or map or process step in mediawiki-event-enrichment (or in eventutilities-python?) to set meta.id to a deterministic id.
  • deterministic meta.id set for:
    • html content enrichment
    • html feature count enrichment

It would be nice to do this for all other enrichment jobs too, but let's discuss and consider before we alter them.

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
ResolvedAKhatun_WMF
OpenNone
OpenNone
ResolvedAKhatun_WMF
OpenNone
ResolvedOttomata
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
OpenNone
OpenNone
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenNone
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedOttomata
ResolvedOttomata
OpenJMonton-WMF
OpenJMonton-WMF
OpenNone
OpenNone
OpenOttomata

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The Event Platform meta.id field is used as a unique identifier for a specific event. This field is use for deduplication when ingesting into the Data Lake

We will still have duplicate events due to event producer resends, no? Because those will show a different meta.id?

We will still have duplicate events due to event producer resends, no?

Yes. This change would prevent undetected duplicates due to reprocessing a message causing different meta.ids to be set. E.g. If a job restarts and has to start from latest checkpointed offset, it will reprocess a few messages.

Ahoelzl triaged this task as Medium priority.