Page MenuHomePhabricator

Kafka consumer to take learn to rank queries from a queue and run them against elasticsearch to generate relevance labels.
Closed, ResolvedPublic

Description

Open questions:

  • Do we really want to use relforge servers for this? It seems we could instead point the script at the hot spare elasticsearch cluster. Initially we should probably use relforge but long term may want to consider using the hot spare cluster to have the most up to date information.
  • How does the data go back into kafka? log4j handler, or should consumer parse the elasticsearch response and produce to kafka directly?
    • If using log4j that makes using a prod server a little more difficult, as changes to the plugin or log4j settings requires a full cluster restart.

Deliverable:

  • Consumer reads elasticsearch queries from kafka (analytics cluster) and sends them to elasticsearch
  • Results of queries are produced back into a different kafka log. Ideally these should be parsed down to a minimal representation of query+result page+detected feature values.
  • This should be generic enough that when we test changes to the LTR pipeline the changes are applied in the analytics cluster and the code running in production stays the same.

Event Timeline

debt moved this task from needs triage to Up Next on the Discovery-Search board.May 11 2017, 5:12 PM

Change 361010 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/MjoLniR@master] Add feature collection over kafka

https://gerrit.wikimedia.org/r/361010

Change 361010 merged by DCausse:
[search/MjoLniR@master] Add feature collection over kafka

https://gerrit.wikimedia.org/r/361010

debt closed this task as Resolved.Jul 7 2017, 9:05 PM