Page MenuHomePhabricator

eventutilities-python source and destination stream must be versioned
Closed, ResolvedPublic3 Estimated Story Points

Description

Background/Goal

Source and destination streams must declare a schema version (latest should not be allowed)

Key Tasks/Dependencies

  • implement versioning for stream schemas
Acceptance criteria:
  • stream names contain a version according to the following scheme <stream>:<version>
  • dependent applications (e.g. mediawiki-event-enrichment) have been updated

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think a user specifying latest for sources is okay. Or, perhaps a major version compatibility (although that would be annoying to handle). Maybe log warning message when using latest for sources?

But, the code can look up the actual version of latest and use that in its eventutilities code when reading the schema.

I think a user specifying latest for sources is okay. Or, perhaps a major version compatibility (although that would be annoying to handle). Maybe log warning message when using latest for sources?

I would not want source / destination to have different behaviour though. That can get confusing for end users and ops. Either they both allow latest or neither should IMHO.

But, the code can look up the actual version of latest and use that in its eventutilities code when reading the schema.

+1.
I can see benefits in resolving the version for logging / bookkeeping. Are there other use cases I might be missing?

JArguello-WMF set the point value for this task to 3.Jan 25 2023, 3:25 PM

I would not want source / destination to have different behaviour though. That can get confusing for end users and ops. Either they both allow latest or neither should IMHO.

Sinks must specify the version. Producer code (which is usually why the someone is writing one of these pipelines) implicitly knows the schema that they will produce, as they are setting the fields in the dict. The enrichment pipeline is THE producer of this data.

Sources however are multi use, and many consumers can use them. Consumers almost always want the latest schema, and as long as schemas are backwards compatible, their code won't care about the version. Fields added in a newer schema will not be used or referenced by older consumer code.

Perhaps: the stream_manager interface should require the version, so that production pipelines are very specific, but the flink.py EventDataStreamFactory stuff wouldn't mind either way? I mostly want the tooling to be a more flexible. Making the production pipeline be strict sounds good.

Sinks must specify the version. Producer code (which is usually why the someone is writing one of these pipelines) implicitly knows the schema that they will produce, as they are setting the fields in the dict. The enrichment pipeline is THE producer of this data.

Sources however are multi use, and many consumers can use them. Consumers almost always want the latest schema, and as long as schemas are backwards compatible, their code won't care about the version. Fields added in a newer schema will not be used or referenced by older consumer code.

Makes sense. I updated the design doc with this comment. My concern was mostly around the stream_manager API, which you addressed below.

Perhaps: the stream_manager interface should require the version, so that production pipelines are very specific, but the flink.py EventDataStreamFactory stuff wouldn't mind either way? I mostly want the tooling to be a more flexible. Making the production pipeline be strict sounds good.

+1. I'm following this principle in the implementation.

lbowmaker updated the task description. (Show Details)