Page MenuHomePhabricator

Decide how to handle avro schemas changes
Closed, DeclinedPublic

Description

Messages are produced with some schema, but that is not directly encoded in the message itself. We need some way to manage changing a schema and having the different moving parts pick that up.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a project: CirrusSearch.
EBernhardson subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here is one solution that has worked at linkedin:

https://issues.apache.org/jira/browse/AVRO-1124?focusedCommentId=13564387&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13564387

Our setup works as follows:

  1. We have a giant directory of version controlled schemas for the whole company. We have another directory called "includes" which includes any shared record type that is included in multiple schemas.
  2. We always fully expand referenced types in the schemas. So if you have a type Header defined in your record and it is not found in that same file, we look in the includes directory and try to get it from there.
  3. We don't use the idl since our setup predates that.
  4. We send each message with a schema id which is the checksum of the fully expanded schema.
  5. This means that any change to either the record or any includes effectively changes the version, but this is fine, since version changes are automatically handle when the md5 of the fully expanded schema changes. This approach has worked really well for us. The way to think about includes is just as a concise notation for the fully expanded schema.

I think the above applies to our setup as follows:

  1. giant directory === meta.wikimedia.org/wiki/Schema:* ? We arn't doing anything complicated yet like sharing definitions between schemas, i think we can safely punt on that for now.
  1. punted
  2. not relevant
  3. I looked around a bit in camus and while it uses avro by default I havn't figured out yet how it does schema resolution. Maybe we could use some sort of envelope when producing messages through kafka. The envelope would be something like:
{
    "type": "record",
    "name": "LogEvelope",
    "namespace": "org.wikimedia.mediawiki.logging",
    "fields": [
        { "name": "fingerprint", "type": { "type": "fixed", "name": "MD5", "size": 16 } },
        { "name": "payload", "type": "bytes" }
    ]
}

But i really need to figure out what the camus->hive line looks like before knowing anything

It looks like, at least in theory (untested), we just create external tables in hive that point at the files camus creates, with the right avro schema, and it will "just work".

Aklapper removed a project: Discovery-ARCHIVED.
Gehel subscribed.

Not relevant anymore.