Create UDFs to replicate the Python script used for parsing the Cirrus logs
Closed, ResolvedPublic3 Story Points

Ironholds created this task.Nov 5 2015, 9:21 PM
Ironholds updated the task description. (Show Details)
Ironholds raised the priority of this task from to Needs Triage.
Ironholds added a project: Discovery.
Ironholds added a subscriber: Ironholds.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 5 2015, 9:21 PM
Ironholds moved this task from Needs triage to Analysis on the Discovery board.Nov 5 2015, 9:21 PM
Ironholds set Security to None.
Ironholds removed a subscriber: StudiesWorld.
Deskana triaged this task as Normal priority.Nov 10 2015, 9:11 PM
Deskana added a subscriber: Deskana.

@EBernhardson here's the ticket we have. Basically we just need UDFs to take one of the big complex types and:

  1. Work out if the last one was full-text
  2. Work out if the last one was prefix-search
  3. Work out if the last one returned zero results or not.

Want me to put it in the Cirrus sprint so you can keep track of it? Thanks for your help with this!

These shouldn't really need UDF's, but if you feel its necessary i can work them up. They can be expressed in hive directly. There is always guaranteed to be at least one element in requests, so these should be safe. For the zero hits, i think we should instead migrate to using a sum of the hits by all requests in the request, since we display multiples. If this isn't accurate enough we should probably add an explicit value for the number of results shown.

  1. requests[size(requests)-1].querytype = 'full_text'
  2. requests[size(requests)-1].querytype = 'prefix'
  3. array_sum( requests.hitstotal, -1 ) > 0 (this requires the ArraySumUDF i wrote that's in code review)

i'm also wondering if perhaps we should more explicitly track the query type, for example i noticed a number of queries to zhwiki that do a number of near_match searches, then finish off with a single full_text search. I'm not even sure where these come from yet, but they should probably be classed as near match instead of full_text. Oddities abound...

So, we'd classify things as a success if at one point or another we showed /some/ results, even if their final outcome was nil?

EBernhardson added a comment.EditedNov 25 2015, 5:18 PM

I think there is a misunderstanding of what this requests field is. This is the list of requests made between mediawiki and elasticsearch within a single php execution. The result of all of these internal requests is bunched together and is the final result. This is not showing results at some point or another, all of these internal requests to elasticsearch are returned to the user at the same moment in time.

EBernhardson added a comment.EditedNov 25 2015, 5:22 PM

To take an example, first does a prefix search, and then it does a full text search. These are reported independantly within the requests array. The result of the prefix search (title match) is added to the page here:
The result of the full text search is added to the page here:

These results are all in the same page. While this particular example would make it rather unlikely that they get a prefix match without also getting full text results, the concept is the same in other portions of the code base.

But I thought queries were being stored with the executorID unique to each row? IOW, the outcome for the "last" set of queries in a row is the one we're interested in.

EBernhardson added a comment.EditedNov 25 2015, 5:32 PM

the executorID represents a single php execution. The entirety of the requests array in a CirrusSearchRequestSet row have the same executorID, they all occured within a single execution of php. For reference php starts a new execution context on every web request (also known as a shared nothing architecture). The last query made within a php execution is arbitrary and not necessarily the most important one.

I should add this is exactly what is being recorded in CirrusSearchUserTesting logs, the top level hits parameter (4th field in tsv) is a sum of the hits for all the individual requests (which are represented as arrays within json encoded values in the 3rd and 9th fields of the tsv)

And does this single PHP execution represent a single user interaction? Because otherwise you've just told me that the logs we've been using for A/B testing for an entire quarter do not represent what we have been basing our analysis on.

And if it represents multiple user interactions (say: going to the search box, typing away, getting some hits on prefix, typing nonsense, hitting enter and getting fulltext) is there any way we can _extract_ the "last" search?

EBernhardson added a comment.EditedDec 10 2015, 3:29 PM

a single PHP execution is a single web request. (but also a single time boxed run of one job type, or a single maintenance script). For an example: If you click search on zhwiki, and we convert it into varients and perform 10 queries in the backend and return 1 combined result set, that will be 10 entries in the requests field of a single CirrusSearchRequestSet row. Same for the User Testing logs. Each request in these logs refers to a request between mediawiki and elasticsearch. The whole reason we created the UserTesting and CirrusSearchRequestSet logs was so we had one log row for one user interaction.

Aha, gotcha! Awesome; totally makes sense.

(I will try to get to this task and add more detail/thunk on what we are doing with it when I've sent this host of emails)

Ironholds added a comment.EditedDec 10 2015, 4:29 PM

Okay, so, here's where we at.

We need (either) UDFs or a hive query or a series of hive queries that produce, for a given set of partitions, the equivalent set of values to the existing Python code - but for the logs in Hive.

So that's:

  1. Zero results rate for queries with suggestions;
  2. Zero results rate for full-text queries;
  3. Zero results rate for prefix-search queries;
  4. Zero results rate for all queries.

(Alternately, just a count of how many queries fall into each possible permutation; we can easily roll that up into something that fits our existing data format)

I'm gonna move this into the Cirrus board, if that's okay, since it looks like it's an engineering rather than analysis task.

EBernhardson added a comment.EditedDec 10 2015, 8:32 PM

Basically you want this, you can further aggregate down from there. Basically run this query in hive from R, get back ~4k results, run another aggregation step that gives the final answers.

ADD JAR ${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-${refinery_jar_version}.jar

    COUNT(1) AS total,
    SUM(IF(zero_result, 1, 0)) AS zero_results
        wikiid AS wiki_id,
        length(concat_ws('', requests.suggestion)) > 0 AS has_suggestion,
        array_contains(requests.suggestionrequested, TRUE) AS requested_suggestion,
        requests[size(requests)-1].querytype AS query_type,
        array_sum(requests.hitstotal, -1) = 0 AS zero_result
        year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour}
) data_source
Ironholds claimed this task.
Ironholds edited a custom field.

actually minor adjustment, the array_sum usage should be:

array_sum(requests.hitstotal, -1) = 0 AS zero_result

the change is adding -1 as the second parameter, so something where the first request got 1 result, and the second request completely failed, doesn't result in a value of 0. On an arbitrarily chosen day (2015/12/5) there were only 2 requests that match this condition so it's not a big deal, but it's more correct :)

Change 259054 had a related patch set uploaded (by OliverKeyes):
Replace Python with Hive/R

Change 259054 merged by Bearloga:
Replace Python with Hive/R

Deskana closed this task as Resolved.Dec 16 2015, 6:12 AM