Create UDFs for categorising types of queries based the nature of the query
Closed, ResolvedPublic5 Story Points

Description

Create UDFs that categorise queries:

  1. queries with logical operators (AND, OR, NOT, -, !) a-e) each one independently
  1. queries with quotes

    a) queries consisting of a single "

    b) queries with an odd number of double quotes

    c) queries with an even number of double quotes
  1. queries with wildcards (* or ?) a) queries consisting of single * b) queries consisting of single ? c) queries with final ?
  2. queries with a tilde

    a) queries consisting of single ~

    b) queries with a word-final tilde (fuzzy search)

    c) queries with a query-initial tilde (results!)
  1. queries with insource:

    a) queries with insource: and something that looks like a regex /.../
  1. queries with prefix:
  2. queries with intitle:
  3. queries with incategory:
  4. queries that do not contain any of the letters A to Z, and are not all numbers punctuation, and spaces—so, having numbers if okay, if there are some other "interesting" characters, too. This is a rough attempt at getting non-ASCII writing systems. If you want to do something more clever, that's very cool.
  5. queries with quot as a separate word
  6. queries with {searchTerms}
  7. queries with {@}
Ironholds updated the task description. (Show Details)
Ironholds raised the priority of this task from to Needs Triage.
Ironholds added a subscriber: Ironholds.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 9 2015, 11:09 PM
Ironholds updated the task description. (Show Details)Nov 9 2015, 11:11 PM
Ironholds set Security to None.
Deskana triaged this task as Normal priority.Nov 10 2015, 9:11 PM
Deskana added a subscriber: Deskana.
mpopov claimed this task.Nov 18 2015, 10:06 PM
mpopov edited a custom field.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Change 254461 had a related patch set uploaded (by Bearloga):
Functions for categorizing queries. (Work In Progress)

https://gerrit.wikimedia.org/r/254461

@mpopov so this is actually a 10% time task (I should've specified that in the description); if you wanna do UDF-werk, the replacement for the random Python script is probably more valuable.

@Ironholds Given how every time I started working on it I was pulled away to work on some urgent broken thing, this was effectively a 10% task ;P

Deskana renamed this task from Create query categoriser UDFs to Create UDFs for categorising types of queries based the nature of the query.Dec 17 2015, 9:19 PM

Tweaked title somewhat to differentiate this task from T115919: Create UDFs for categorising referers; feel free to correct or refine further.

@Nuria & @JAllemandou,

Hello! So I'm wrapping up my work on https://gerrit.wikimedia.org/r/#/c/254461/ and I'm getting weird errors that I'm not sure how to address.

hive> ADD JAR /home/bearloga/Code/refineries/refinery-hive.jar;
Added [/home/bearloga/Code/refineries/refinery-hive.jar] to class path
Added resources: [/home/bearloga/Code/refineries/refinery-hive.jar]
hive> CREATE TEMPORARY FUNCTION deconstruct AS 'org.wikimedia.analytics.refinery.hive.DeconstructQueryUDF';
OK

hive> SELECT deconstruct("rèserve d'un hèritage") AS test;
OK
test
[has non-ASCII and is not all numbers/punctuation/spaces]

hive> SELECT deconstruct("word1 AND word2 OR word3") AS test;
OK
test
[has AND, has OR]

So that works fine and dandy like it does in unit tests.

SELECT requests.query[SIZE(requests.query)-1] AS query
  FROM wmf_raw.CirrusSearchRequestSet
  WHERE year = 2016 AND month = 1 AND day = 25 AND hour = 12
    AND requests.querytype[SIZE(requests.query)-1] = 'full_text'
  LIMIT 10;

Returns search queries just fine and dandy. But then trying to use deconstruct UDF on queries as in

SELECT query, deconstruct(query) AS features
FROM (
  SELECT requests.query[SIZE(requests.query)-1] AS query
  FROM wmf_raw.CirrusSearchRequestSet
  WHERE year = 2016 AND month = 1 AND day = 25 AND hour = 12
    AND requests.querytype[SIZE(requests.query)-1] = 'full_text'
  LIMIT 10) AS queries;

…consistently yields:

Hadoop job information for Stage-1: number of mappers: 10; number of reducers: 1
2016-01-27 00:03:37,426 Stage-1 map = 0%,  reduce = 0%
2016-01-27 00:03:43,729 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 7.82 sec
2016-01-27 00:03:44,782 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 14.3 sec
2016-01-27 00:03:45,837 Stage-1 map = 90%,  reduce = 0%, Cumulative CPU 19.33 sec
2016-01-27 00:03:46,892 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.2 sec
2016-01-27 00:04:09,140 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 21.2 sec
MapReduce Total cumulative CPU time: 21 seconds 200 msec
Ended Job = job_1441303822549_423144 with errors

Error: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:409)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
	... 9 more
Caused by: java.lang.RuntimeException: Reduce operator initialization failed
	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:166)
	... 14 more
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.wikimedia.analytics.refinery.hive.DeconstructQueryUDF
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:134)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.isStateful(FunctionRegistry.java:1706)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.isDeterministic(FunctionRegistry.java:1669)
	at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.isDeterministic(ExprNodeGenericFuncEvaluator.java:152)
	at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorFactory.iterate(ExprNodeEvaluatorFactory.java:97)
	at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorFactory.toCachedEvals(ExprNodeEvaluatorFactory.java:71)
	at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:59)
	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:385)
	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:469)
	at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:425)
	at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:410)
	at org.apache.hadoop.hive.ql.exec.LimitOperator.initializeOp(LimitOperator.java:41)
	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:385)
	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:469)
	at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:425)
	at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:385)
	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:159)
	... 14 more
Caused by: java.lang.ClassNotFoundException: org.wikimedia.analytics.refinery.hive.DeconstructQueryUDF
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:278)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132)
	... 31 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

Please help.

Hi @mpopov, just followed the exact same path you described and got a working result:

  1. Build jar with your patch
  2. Start hive, add jar and create UDF (copy pasted your code, just changed the path for the jar)
  3. Execute tests (for UDF and query text)
  4. Execute presumably faulty query:
SELECT query, deconstruct(query) AS features
FROM (
  SELECT requests.query[SIZE(requests.query)-1] AS query
  FROM wmf_raw.CirrusSearchRequestSet
  WHERE year = 2016 AND month = 1 AND day = 25 AND hour = 12
    AND requests.querytype[SIZE(requests.query)-1] = 'full_text'
  LIMIT 10) AS queries;

Got result:

query	features
Kamar	[is simple]
pit druk do wypełnienia	[has non-ASCII and is not all numbers/punctuation/spaces]
rèserve d'un hèritage	[has non-ASCII and is not all numbers/punctuation/spaces]
Ghost ship	[is simple]
Lagu Pop Indonesia	[is simple]
patient data	[is simple]
Diagrama de peirce	[is simple]
پای مر	[has non-ASCII and is not all numbers/punctuation/spaces]
michel angelo virus	[is simple]
masdfghjku	[is simple]
Time taken: 30.386 seconds, Fetched: 10 row(s)

May be a problem of hive needed to be restarted after multiple jar versions loaded (hive doesn't handle that well)?
Contact us on IRC if it still doesn't work for you!

Nuria added a comment.Jan 27 2016, 5:59 PM

Please take a look at how to override the main jar in hive to change an existing udf:

https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/QueryUsingUDF#Testing_changes_to_existing_udf

This is likely the source of your problem

mpopov added a subscriber: TJones.Jan 27 2016, 6:12 PM

@JAllemandou, hi! Awesome; yeah, it just started working for me too. Phew!

@TJones, you might be interested in this UDF :)

hive> SELECT features, n_queries
FROM (
   SELECT features, COUNT(1) AS n_queries
   FROM (
     SELECT deconstruct(requests.query[SIZE(requests.query)-1]) AS features
     FROM wmf_raw.CirrusSearchRequestSet
     WHERE year = 2016 AND month = 1 AND day = 25
       AND requests.querytype[SIZE(requests.query)-1] = 'full_text'
   ) AS queries
   GROUP BY features
) AS aggregates
ORDER BY n_queries DESC
LIMIT 20;

features	n_queries
[is simple]	31748994
[has non-ASCII and is not all numbers/punctuation/spaces]	5910927
[has OR, has wildcard, has even double quotes]	917259
[has OR, has logic inversion (-), has wildcard, has even double quotes]	724192
[has logic inversion (-)]	463032
[has even double quotes]	285775
[has non-ASCII and is not all numbers/punctuation/spaces, has even double quotes]	199582
[has logic inversion (-), has non-ASCII and is not all numbers/punctuation/spaces]	88284
[has quot]	75626
[has logic inversion (-), has even double quotes]	69251
[has OR, has even double quotes]	69121
[is intitle]	32720
[has wildcard, ends with ?]	27067
[has wildcard]	25211
[has OR, has logic inversion (-), has even double quotes]	23802
[is fuzzy search]	12467
[is fuzzy search, is intitle]	10892
[is incategory, has even double quotes]	9023
[is incategory, has non-ASCII and is not all numbers/punctuation/spaces, has even double quotes]	8745
[has AND, has quot]	7980

@mpopov, thanks for the link. I am interested—so interested I wrote the description! (Hence the slightly out of place "If you want to do something more clever, that's very cool." comment.)

Still so much quot! I thought @Ironholds found them and told them to behave.

(BTW, #9 was not just some non-ASCII, which is interesting, but no ASCII, so it would count queries wholly in non-Latin writing systems.)

Cool stuff!

Nuria added a comment.EditedJan 27 2016, 7:41 PM

This ticket explains the what but not the why of doing this, can we get some info on that regard please?

Because being able to see what proportion of queries have common syntactic attributes is a great way of being able to identify repetitive automata and/or identify what features are worth testing - nobody cares about a tweak on the OR operator if it's only used sixty times a day.

Because being able to see what proportion of queries have common syntactic attributes is a great way of being able to identify repetitive automata and/or identify what features are worth testing - nobody cares about a tweak on the OR operator if it's only used sixty times a day.

We definitely care. However, as you say, if something isn't used a massive amount then the priority for making improvements to it is usually much lower. Having that information would be very helpful.

@TJones Do you have specific preference for the format of the deconstruction output? Currently it's: [feature1, ..., featureN]

If there's a particular format you want, let me know :) And let me know if you want to keep the human-friendly is's and has's (my personal preference but I'm also the secondary consumer on this).

TJones added a comment.EditedJan 28 2016, 6:04 PM

@mpopov, the format above looks fine.

This is awaiting code review by Analytics.

The code was reviewed! There are some changes which we need to make, nothing major. Moving back into the backlog for now.

Deskana removed mpopov as the assignee of this task.
mpopov claimed this task.Feb 18 2016, 12:06 AM
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Change 254461 merged by Nuria:
Functions for categorizing queries.

https://gerrit.wikimedia.org/r/254461

Deskana closed this task as Resolved.Mar 31 2016, 10:26 PM