We don't have any way to add an arbitrary flag like "We ran a language detection on this" to our search data in Hive. This will help us do that, which will be useful when we're analysing our data.
|operations/mediawiki-config : master||Use event-schemas repository for avro schemas|
|operations/mediawiki-config : master||Add 2 payloads map<string,string> fields to CirrusSearchRequestSet avro schema|
|mediawiki/extensions/CirrusSearch : master||Add 2 map<string,string> payloads to CirrusSearchRequestSet|
|analytics/refinery/source : master||Add CirrusSearchRequestSet avro schema to local schema repo|
|analytics/refinery : master||Create CirrusSearchRequestSet table|
|Resolved||dcausse||T117575 Setup oozie task for adding and removing CirrusSearchRequestSet partitions in hive|
|Resolved||dcausse||T118570 Add a map<string,string> field to CirrusSearchRequestSet|
|Resolved||None||T121483 Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version|
In fact it works with hive and old data but this does not work for our chain mediawiki -> kafka -> camus.
If we want to handle schema updates we will have to find a way to keep track of the schema used on mediawiki to produce the avro message.
Today we use AvroDecoder with one schema only.
But it looks like if you want to handle schema updates you need to use another constructor :
new GenericDatumReader(writerSchema, readerSchema);
So I'm not sure what to do here...
I can see that some uses a SchemaRegistry service : https://github.com/confluentinc/schema-registry
We could maybe implement a hackish solution where the topic name includes the schema version:
Let's work on this together tomorrow. we should not need a schema registry. Just specifying new fields with union defaults should be sufficient:
Removal of fields is not supported for backwards compatibility.
I was wondering also why it works with hive.
In fact hive will store the schema with each record:
So it's why it can read old records because the schema used to generate the avro binary is always available. Then it can use its conversion technique by using writerSchema and readerSchema.
Unless we find a solution to workaround this problem, here is a list of possible solutions :
- Do not support schema evolution:
- stop the camus cronjob for mediawiki
- deploy the new schema in mediawiki
- flush the topic
- deploy the schema to camus
- restart camus
- never update the schema again
- Support schema evolution :
- Add a schema rev_id in the message (kafka header?, an integer before the avro binary payload, topic name?)
- Adapt AvroMessageDecoders in refinery camus to support schema rev_id
- Keep track of all schema revisions on the consumer: write a custom SchemaRegistry that supports rev_id
- On any schema updates refinery-camus must deployed first.
- Do not use avro binary but avro json
- Never tested
- never updating the schema seems like a giant pain, i think we almost certainly want the ability to add fields in the future as we realize different things we need. This can be somewhat worked around using the payload fields from this patch though.
- a packed integer preceding the message would be relatively easy to implement on the php side, not sure how hard that would be on the java (camus) end. Rather than a single packed integer we could pack a sha (or md5, whatever) hash of the schema used to the beginning which is more resilient. I know when i talked to ottomata about this previously he was dubious of using a special envelope format that any potential kafka consumer would have to understand. The consumer already requires special knowledge about the topic => schema mapping so perhaps its not the end of the world.
- producing avro json from php is possible, but currently unsupported. The avro php library only has the ability to generate binary. I think with some reading of the documentation we could work up whatever transformation is needed to generate this though.
Worked on this with @dcausse and he is correct, the union types when not present are "represented" with an "empty" byte on the binary payload. Thus, if we evolve the schema such we add a new field it will not be able to validate old records as they will not include this "null" byte. Will try a bit more to work around this problem but I am not very hopeful.
- Agreed with @EBernhardson, doesn't seem doable
- This is not possible without a schema registry, with it no changes are needed to decoders as java supports id + avro payload out of the box. But , again, supporting schema evolution requires a registry rather than (as we had planned) always validate with latest schema.
3.This seems best of three options as if schema evolution includes only addition of fields we will not have any issues with validating always with latest schema.
This looks mostly ready. The analytics end of the pipeline has been updated. The necessary code for mediawiki will roll out with the train on december 10th. I will deploy the config changes necessary for mediawiki to start producing the new schema version in the 4pm EST/12pm UTC swat window after train deployment.