Page MenuHomePhabricator

Comparison Tool for Relevance Lab
Closed, ResolvedPublic

Description

Depends on T116869, T116870, and T116871, but can start before either is finished by using current output or stubs for the various comppnents under development.

This is the top level Relevance Lab command line interface. As such, it's pretty complicated. It may make sense to specify a configuration file rather than try to cram all this into command line arguments, at the developer's discretion.

Inputs:

  • Comparison: specify a name, and two query runs:
    • First query run: specify a name for the run, a query file, and optionally runSearchConf.json file
    • Second query run: specify a name for the run, a query file, and optionally runSearchConf.json file
    • (Should the relevancelab/ directory be in a canonical place, or specified by config file or command line? I suggest config file, but leave it to the developer's discretion)

Processing/Output:

  • make a reasonable directory name for the comparison and query runs based on their names (e.g., remove all non-alphanumerics)
  • set up an SSH tunnel between your local vagrant instance and the hypothesis testing cluster in labs
    • (is this known or specified by command line argument or config file?)
  • run runSearch.php with the first query file and the first runSearchConf.json
    • copy the runSearchConf.json to relevancelab/queries/<query1_dir>
    • store the output of runSearch.php (one line of JSON for each query) in relevancelab/queries/<query1_dir>/results
  • run runSearch.php with the second query file and the second runSearchConf.json
    • copy the runSearchConf.json to relevancelab/queries/<query2_dir>
    • store the output of runSearch.php (one line of JSON for each query) in relevancelab/queries/<query2_dir>/results
  • run the diff tool (see T116870), with
    • the two files relevancelab/queries/<query1_dir>/results and
      • relevancelab/queries/<query2_dir>/results
    • output directory relevancelab/comparisons/<comparison_dir>/diffs
  • run the metrics/report tool (see T116871), with
    • name "comparison name"
    • the two files relevancelab/queries/<query1_dir>/results and
      • relevancelab/queries/<query2_dir>/results
    • output directory relevancelab/comparisons/<comparison_dir>

Event Timeline

TJones raised the priority of this task from to High.
TJones updated the task description. (Show Details)
TJones added subscribers: TJones, dcausse, EBernhardson, Smalyshev.

@Smalyshev this would probably be good to prioritize over our hadoop work, we have a tentative goal to have a code freeze in december, so we need to focus on the language goals for this month and shift the hadoop work into december. Sorry to keep switching things up on you!

Proposal: the runner gets as input a file like this:

[settings]
labHost = suggesty.eqiad.wmflabs
searchCommand = sudo -u vagrant mwscript extensions/CirrusSearch/maintenance/runSearch.php 
workDir = /tmp/relevance/
jsonDiffTool = jsondiff.py
metricTool = metric

[test1]
name = Test 1
queries = test1.q

[test2]
name = Test 2
queries = test2.q
config = test2.json

Change 250909 had a related patch set uploaded (by Smalyshev):
Initial version of relevance lab runner

https://gerrit.wikimedia.org/r/250909

I like the .ini config file! Otherwise the command line is untenably complex.

Change 251177 had a related patch set uploaded (by Smalyshev):
Relevancy runner in python

https://gerrit.wikimedia.org/r/251177

Change 251177 merged by Smalyshev:
Relevancy runner in python

https://gerrit.wikimedia.org/r/251177

Change 250909 abandoned by Smalyshev:
Initial version of relevance lab runner

Reason:
moved to python

https://gerrit.wikimedia.org/r/250909