The Event Platform meta.id field is used as a unique identifier for a specific event. This field is use for deduplication when ingesting into the Data Lake
When serializing events, the wikimedia-event-utilities Java library will auto generate this field to a new random uuid if it is not set.
In an at least one guarantee streaming pipeline, it is possible that the same input event is processed multiple times. This will cause 'duplicate' events in the output pipeline that have different meta.id fields.
Instead, we could set the output event's meta.id to a deterministic uuid5 based on the input event's own meta.id.
Something like
uuid_namespace = <static uuid4 here> event['meta']['id'] = uuid.uuid5(uuid_namespace, source_event['meta']['id'])
We could consider upstreaming this operation in to event-utilities java, but I'm not sure we want the serialization framework to always overwrite meta.id. We should probably keep control of the value of meta.id in the hands of the producer.
Done is
- reusable function or map or process step in mediawiki-event-enrichment (or in eventutilities-python?) to set meta.id to a deterministic id.
- deterministic meta.id set for:
- html content enrichment
- html feature count enrichment
It would be nice to do this for all other enrichment jobs too, but let's discuss and consider before we alter them.