Page MenuHomePhabricator

Add a map<string,string> field to CirrusSearchRequestSet
Closed, ResolvedPublic

Description

We don't have any way to add an arbitrary flag like "We ran a language detection on this" to our search data in Hive. This will help us do that, which will be useful when we're analysing our data.

Details

Related Gerrit Patches:

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a subscriber: EBernhardson.
Restricted Application added a project: Discovery. · View Herald TranscriptNov 13 2015, 4:31 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
dcausse claimed this task.Nov 13 2015, 4:39 PM
dcausse set Security to None.

Change 252956 had a related patch set uploaded (by DCausse):
Add 2 payloads map<string,string> fields to CirrusSearchRequestSet avro schema

https://gerrit.wikimedia.org/r/252956

Change 252957 had a related patch set uploaded (by DCausse):
Add 2 payloads map<string,string> fields to CirrusSearchRequestSet avro schema

https://gerrit.wikimedia.org/r/252957

Change 252958 had a related patch set uploaded (by DCausse):
Add 2 payloads map<string,string> fields to CirrusSearchRequestSet avro schema

https://gerrit.wikimedia.org/r/252958

Nuria added a subscriber: Nuria.Nov 16 2015, 7:51 PM

Did you try to validate your data with avro tools (in java) ? To rule out that data produced by php might not work with java schema bindings.

In fact it works with hive and old data but this does not work for our chain mediawiki -> kafka -> camus.

If we want to handle schema updates we will have to find a way to keep track of the schema used on mediawiki to produce the avro message.

Today we use AvroDecoder with one schema only.
But it looks like if you want to handle schema updates you need to use another constructor :
new GenericDatumReader(writerSchema, readerSchema);

So I'm not sure what to do here...
I can see that some uses a SchemaRegistry service : https://github.com/confluentinc/schema-registry

We could maybe implement a hackish solution where the topic name includes the schema version:
mediawiki.CirrusSearchRequestSet_1

Nuria added a comment.Nov 16 2015, 9:53 PM

Let's work on this together tomorrow. we should not need a schema registry. Just specifying new fields with union defaults should be sufficient:

See:

https://gerrit.wikimedia.org/r/#/c/253393/2/refinery-camus/src/main/avro/CirrusSearchRequestSet3.avsc

Removal of fields is not supported for backwards compatibility.

dcausse added a comment.EditedNov 17 2015, 10:43 AM

My test was misleading.
I've uploaded a new one where I wrote a binary file with an old schema then I try to read it with the new schema and the union type.
It fails with EOFException :(
(see https://issues.apache.org/jira/browse/AVRO-1661 and http://s.apache.org/2IM)

I was wondering also why it works with hive.
In fact hive will store the schema with each record:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-serde/0.7.1-cdh3u5/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java

So it's why it can read old records because the schema used to generate the avro binary is always available. Then it can use its conversion technique by using writerSchema and readerSchema.

Unless we find a solution to workaround this problem, here is a list of possible solutions :

  1. Do not support schema evolution:
    • stop the camus cronjob for mediawiki
    • deploy the new schema in mediawiki
    • flush the topic
    • deploy the schema to camus
    • restart camus
    • never update the schema again
  2. Support schema evolution :
    • Add a schema rev_id in the message (kafka header?, an integer before the avro binary payload, topic name?)
    • Adapt AvroMessageDecoders in refinery camus to support schema rev_id
    • Keep track of all schema revisions on the consumer: write a custom SchemaRegistry that supports rev_id
    • On any schema updates refinery-camus must deployed first.
  3. Do not use avro binary but avro json
    • Never tested
EBernhardson added a comment.EditedNov 17 2015, 6:17 PM
  1. never updating the schema seems like a giant pain, i think we almost certainly want the ability to add fields in the future as we realize different things we need. This can be somewhat worked around using the payload fields from this patch though.
  1. a packed integer preceding the message would be relatively easy to implement on the php side, not sure how hard that would be on the java (camus) end. Rather than a single packed integer we could pack a sha (or md5, whatever) hash of the schema used to the beginning which is more resilient. I know when i talked to ottomata about this previously he was dubious of using a special envelope format that any potential kafka consumer would have to understand. The consumer already requires special knowledge about the topic => schema mapping so perhaps its not the end of the world.
  1. producing avro json from php is possible, but currently unsupported. The avro php library only has the ability to generate binary. I think with some reading of the documentation we could work up whatever transformation is needed to generate this though.

Worked on this with @dcausse and he is correct, the union types when not present are "represented" with an "empty" byte on the binary payload. Thus, if we evolve the schema such we add a new field it will not be able to validate old records as they will not include this "null" byte. Will try a bit more to work around this problem but I am not very hopeful.

  1. Agreed with @EBernhardson, doesn't seem doable
  1. This is not possible without a schema registry, with it no changes are needed to decoders as java supports id + avro payload out of the box. But , again, supporting schema evolution requires a registry rather than (as we had planned) always validate with latest schema.

3.This seems best of three options as if schema evolution includes only addition of fields we will not have any issues with validating always with latest schema.

Deskana updated the task description. (Show Details)Nov 19 2015, 5:52 PM

Change 255101 had a related patch set uploaded (by DCausse):
Add 2 map<string,string> payloads to CirrusSearchRequestSet

https://gerrit.wikimedia.org/r/255101

Change 252956 merged by Nuria:
Create CirrusSearchRequestSet table

https://gerrit.wikimedia.org/r/252956

Change 252958 merged by Nuria:
Add CirrusSearchRequestSet avro schema to local schema repo

https://gerrit.wikimedia.org/r/252958

Change 255101 merged by jenkins-bot:
Add 2 map<string,string> payloads to CirrusSearchRequestSet

https://gerrit.wikimedia.org/r/255101

Change 252957 abandoned by DCausse:
Add 2 payloads map<string,string> fields to CirrusSearchRequestSet avro schema

Reason:
agreed

https://gerrit.wikimedia.org/r/252957

Change 255135 had a related patch set uploaded (by EBernhardson):
Use event-schemas repository for avro schemas

https://gerrit.wikimedia.org/r/255135

This looks mostly ready. The analytics end of the pipeline has been updated. The necessary code for mediawiki will roll out with the train on december 10th. I will deploy the config changes necessary for mediawiki to start producing the new schema version in the 4pm EST/12pm UTC swat window after train deployment.

Change 255135 merged by jenkins-bot:
Use event-schemas repository for avro schemas

https://gerrit.wikimedia.org/r/255135

Seems to work, hive can query the latest partition created on 2015/12/15 at 17h.

Deskana closed this task as Resolved.Dec 17 2015, 5:45 AM
Deskana triaged this task as Medium priority.
Deskana added a subscriber: Deskana.