Somehow we need to take the list of query+page combinations we want to train the model with, and collect all the relevant features. It should be relatively easy to produce all this data from spark into a kafka channel. There will be some additional complexity in finding out when exactly the feature generation is complete so we can collect all the data.
- Prior to generating any message record the end position of all partitions in the channel we expect to receive results back on. This is necessary to ensure we read all responses back.
- Generate one message per query to elasticsearch. This message should basically be a pre-formed query ready to send to elasticsearch.
- Message should include which server to send the query to, to make it easier to transition from relforge to a production cluster later, and to run tests against relforge
- Messages should include a token identifying the feature generation 'run'. This will allow multiple feature generation runs to possibly happen in parallel without stepping on each other toes
- After all possible queries have been produced generate an 'end run' sigil and send it to all partitions of the topic we were producing to
- Start consuming from the offsets recorded at the beginning. Suck up all generated features, filtering for our run id, and keep consuming until the end run sigil is found
- May be some difficulty doing this with spark, specifically that there may be quite a lot of data coming back and we don't necessarily want to push the entire kafka log into memory. Ideally we want to read from kafka and stream the results we care about to disk. Later we want to load these results into spark for processing with perhaps many more partitions (couple hundred?) than we have partitions on the kafka topic.
- mini-batches in spark streaming may be capable of doing what we want here, running a mini-batch every few minutes per topic and writing out a file to hdfs.
- There needs to be some sort of handling for error conditions, such as if the consumer at the other end somehow disapeared and stops sending back features, or never sends the end run sigil back to us.