Page MenuHomePhabricator

Create UDFs to replicate the Python script used for parsing the Cirrus logs
Closed, ResolvedPublic3 Estimated Story Points

Event Timeline

Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a project: Discovery-ARCHIVED.
Ironholds subscribed.
Ironholds set Security to None.
Ironholds removed a subscriber: StudiesWorld.
Deskana subscribed.

@EBernhardson here's the ticket we have. Basically we just need UDFs to take one of the big complex types and:

  1. Work out if the last one was full-text
  2. Work out if the last one was prefix-search
  3. Work out if the last one returned zero results or not.

Want me to put it in the Cirrus sprint so you can keep track of it? Thanks for your help with this!

These shouldn't really need UDF's, but if you feel its necessary i can work them up. They can be expressed in hive directly. There is always guaranteed to be at least one element in requests, so these should be safe. For the zero hits, i think we should instead migrate to using a sum of the hits by all requests in the request, since we display multiples. If this isn't accurate enough we should probably add an explicit value for the number of results shown.

  1. requests[size(requests)-1].querytype = 'full_text'
  2. requests[size(requests)-1].querytype = 'prefix'
  3. array_sum( requests.hitstotal, -1 ) > 0 (this requires the ArraySumUDF i wrote that's in code review)

i'm also wondering if perhaps we should more explicitly track the query type, for example i noticed a number of queries to zhwiki that do a number of near_match searches, then finish off with a single full_text search. I'm not even sure where these come from yet, but they should probably be classed as near match instead of full_text. Oddities abound...

So, we'd classify things as a success if at one point or another we showed /some/ results, even if their final outcome was nil?

I think there is a misunderstanding of what this requests field is. This is the list of requests made between mediawiki and elasticsearch within a single php execution. The result of all of these internal requests is bunched together and is the final result. This is not showing results at some point or another, all of these internal requests to elasticsearch are returned to the user at the same moment in time.

To take an example, https://github.com/wikimedia/mediawiki/blob/master/includes/specials/SpecialSearch.php#L272-L273 first does a prefix search, and then it does a full text search. These are reported independantly within the requests array. The result of the prefix search (title match) is added to the page here: https://github.com/wikimedia/mediawiki/blob/master/includes/specials/SpecialSearch.php#L366
The result of the full text search is added to the page here: https://github.com/wikimedia/mediawiki/blob/master/includes/specials/SpecialSearch.php#L379

These results are all in the same page. While this particular example would make it rather unlikely that they get a prefix match without also getting full text results, the concept is the same in other portions of the code base.

But I thought queries were being stored with the executorID unique to each row? IOW, the outcome for the "last" set of queries in a row is the one we're interested in.

the executorID represents a single php execution. The entirety of the requests array in a CirrusSearchRequestSet row have the same executorID, they all occured within a single execution of php. For reference php starts a new execution context on every web request (also known as a shared nothing architecture). The last query made within a php execution is arbitrary and not necessarily the most important one.

I should add this is exactly what is being recorded in CirrusSearchUserTesting logs, the top level hits parameter (4th field in tsv) is a sum of the hits for all the individual requests (which are represented as arrays within json encoded values in the 3rd and 9th fields of the tsv)

And does this single PHP execution represent a single user interaction? Because otherwise you've just told me that the logs we've been using for A/B testing for an entire quarter do not represent what we have been basing our analysis on.

And if it represents multiple user interactions (say: going to the search box, typing away, getting some hits on prefix, typing nonsense, hitting enter and getting fulltext) is there any way we can _extract_ the "last" search?

a single PHP execution is a single web request. (but also a single time boxed run of one job type, or a single maintenance script). For an example: If you click search on zhwiki, and we convert it into varients and perform 10 queries in the backend and return 1 combined result set, that will be 10 entries in the requests field of a single CirrusSearchRequestSet row. Same for the User Testing logs. Each request in these logs refers to a request between mediawiki and elasticsearch. The whole reason we created the UserTesting and CirrusSearchRequestSet logs was so we had one log row for one user interaction.

Aha, gotcha! Awesome; totally makes sense.

(I will try to get to this task and add more detail/thunk on what we are doing with it when I've sent this host of emails)

Okay, so, here's where we at.

We need (either) UDFs or a hive query or a series of hive queries that produce, for a given set of partitions, the equivalent set of values to the existing Python code - but for the logs in Hive.

So that's:

  1. Zero results rate for queries with suggestions;
  2. Zero results rate for full-text queries;
  3. Zero results rate for prefix-search queries;
  4. Zero results rate for all queries.

(Alternately, just a count of how many queries fall into each possible permutation; we can easily roll that up into something that fits our existing data format)

I'm gonna move this into the Cirrus board, if that's okay, since it looks like it's an engineering rather than analysis task.

Basically you want this, you can further aggregate down from there. Basically run this query in hive from R, get back ~4k results, run another aggregation step that gives the final answers.

ADD JAR ${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-${refinery_jar_version}.jar
CREATE TEMPORARY FUNCTION array_sum AS 'org.wikimedia.analytics.refinery.hive.ArraySumUDF';

SELECT
    wiki_id,
    source,
    has_suggestion,
    requested_suggestion,
    query_type,
    COUNT(1) AS total,
    SUM(IF(zero_result, 1, 0)) AS zero_results
FROM (
    SELECT
        wikiid AS wiki_id,
        source,
        length(concat_ws('', requests.suggestion)) > 0 AS has_suggestion,
        array_contains(requests.suggestionrequested, TRUE) AS requested_suggestion,
        requests[size(requests)-1].querytype AS query_type,
        array_sum(requests.hitstotal, -1) = 0 AS zero_result
    FROM
        ${source_table}
    WHERE
        year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour}
) data_source
GROUP BY
    wiki_id,
    source,
    has_suggestion,
    requested_suggestion,
    query_type
;

actually minor adjustment, the array_sum usage should be:

array_sum(requests.hitstotal, -1) = 0 AS zero_result

the change is adding -1 as the second parameter, so something where the first request got 1 result, and the second request completely failed, doesn't result in a value of 0. On an arbitrarily chosen day (2015/12/5) there were only 2 requests that match this condition so it's not a big deal, but it's more correct :)

Change 259054 had a related patch set uploaded (by OliverKeyes):
Replace Python with Hive/R

https://gerrit.wikimedia.org/r/259054

Change 259054 merged by Bearloga:
Replace Python with Hive/R

https://gerrit.wikimedia.org/r/259054