Code Review - Error
secondary index must be enabled for comment:pginer
Also for operators message: and file: (hence tentatively marking as blocker of bug 36467, bug 49674).
I hear that «as part of the secondary indexing change its no longer allowed
to run in its naive form. I think that is OK, we are going to
encourage everyone to turn on secondary indexing.»
Is this the infamous Lucene search? If yes, is this a WONTFIX or what blocks it other than caching (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>)?
P.s.: Pasting here IRC discussion about the URL above for lack of a better place. http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-dev/20140217.txt
[10:37:35] <ori> Krinkle: have you seen mwgrep yet?
[10:38:23] <ori> ssh to tin and run: mwgrep mw.user.anon
[10:40:40] <hoo> ori: ooowwwm, super handy... thanks :)
[10:41:09] <hoo> I've been using a hacky bash script for that until now :P
[10:41:38] <ori> how did it work?
[10:42:06] <hoo> ori: It basically just pulled in all JS pages I wan interested in (mostly common.js) and ran grep on them :P
[10:42:11] <hoo> * was
[10:42:14] <ori> ahh
[10:48:49] <Krinkle> ori: how performant is that (how much is search engine, how much is local script), you think we could expose this publicly somehow?
[10:49:28] <Krinkle> I wrote two separate toolserver tools over the past years to do essentially that, all of them extremely hacky and slow.
[10:49:44] <hoo> hoo@terbium:~$ time mwgrep 453894375dfs
[10:49:44] <hoo> [...]
[10:49:44] <hoo> real 0m0.176s
[10:49:52] <ori> the local script is basically constructing the URL for a request; it's all elasticsearch
[10:50:09] <Krinkle> couldn't use any API or search backend for it at the time, as it mangled code quite badly (punctuation normalisation etc.), and no way to search multiple wikis
[10:50:11] <ori> nik (manybubbles) said the load it generates is laughably small
[10:50:28] <Krinkle> yeah, looked at the source code already
[10:50:41] <Krinkle> wasn't sure where ""_source['text'].contains('%s')" % args.term" ran though
[10:50:42] <hoo> ori: Does that service used in there have a public host?
[10:50:58] <hoo> http://search.svc.eqiad.wmnet:9200
[10:51:09] <ori> you could change it to a regexp search and as long as it was restricted to the mw ns it would be fine (again, per nik)
[10:51:53] <ori> it's not publicly-accessible; pathological queries will run it into the ground, so you need to have a layer in front that restricts the types of searches
[10:52:02] <ori> in our case, the mediawiki search interface
[10:52:43] <Krinkle> ori: Hm.. it currently doesn't take regex though, right? Would need a minor change to the script. Just verifying I'm not using it incorrectly
[10:52:56] <ori> yeah, doesn't take a regexp at the moment
[10:53:29] <ori> because this particular query is not risky (because ns:8, title ends with 'js' or 'css' restricts the pool of candidates), we could have a public front-end for it, but chad's plan is to make the entire elastic index queryable from labs
[10:53:40] <Nemo_bis> hoo: does this help the non-protocol-relative scripts searches?
[10:54:07] <hoo> Nemo_bis: Not much w/o regexp support, yet
[10:54:13] <Nemo_bis> ok
[10:54:29] <hoo> But I already have quite a list of stuff to clean up which I gathered using my old bash script
[10:55:52] <ori> Krinkle: http://p.defau.lt/?Yswucu3jR5S9QqZfFre7Vg is the query (first line is path, everything below is the json query, which should be sent as POST payload)
[10:56:56] <ori> Krinkle, hoo: you can replace 'contains' with 'matches' to make it a regexp search
[10:57:29] <Krinkle> ori: what script language is that?
[10:57:44] <hoo> python
[10:57:48] <Krinkle> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html
[10:57:56] <Krinkle> ok
[10:58:24] <ori> Krinkle: no, MVEL
[10:58:26] <ori> <http://github.com/mvel/mvel>
[10:58:36] <ori> context: http://p.defau.lt/?EZ6aPeZRMWdjZtwpOnFgWQ
[10:58:53] <Krinkle> yeah, it says that
[10:59:05] <Krinkle> <a>scripts</a> -> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
[10:59:09] <Krinkle> uses by default mvel
[10:59:24] <ori> ah, yep
[11:00:27] <hoo> ori: mh... regexp don't seem to work after I did the change you suggested...
[11:00:34] <hoo> or aren't that perl-style regexp?
[11:02:18] <Krinkle> Im trying to find docs on mvel regarding available string methods
[11:02:20] <Krinkle> no luck
[11:03:18] <Krinkle> http://mvel.codehaus.org/MVEL+2.0+Operators shows a different syntax ( foo contains "x", and, foo ~= '[a-z].+' )
[11:03:47] <ori> Krinkle: "By default, like Java, MVEL provides all java.lang.* classes pre-imported as part of the compiler and runtime."
[11:03:54] <Krinkle> right
[11:04:23] <ori> so since _source['text'] is a string, you can call http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#matches(java.lang.String)
[11:05:06] <ori> I don't know how that is functionally different from using MVEL's ~= operator
[11:05:10] <Krinkle> Ah, this is going to be important to document and protect against. Working 3 or 4 levels of abstraction deep to end up evaluating java.
[11:05:22] <Krinkle> java, mvel, json, http
[11:05:52] <ori> elastic's http interface is designed to be a private api tho
[11:06:00] <Krinkle> yeah
[11:06:10] <ori> but yeah, it is a bit like a matryoshka doll
[11:06:29] <ori> https://upload.wikimedia.org/wikipedia/commons/4/45/7babushkadolls.JPG
[11:06:33] <Krinkle> except that its bigger on the inside
[11:06:34] <ori> i'll let you decide which one is java :P
[11:06:51] <Krinkle> http < json < mvel < java
[11:07:16] <Krinkle> well, kinda sorta
[11:07:17] <Krinkle> nvm
[11:07:41] <ori> don't forget shell-style argv parsing > json
[11:08:06] <Krinkle> yeah, but out final API isn't going to do that :P
[11:08:31] <ori> at least i hope not
[11:08:59] <Krinkle> ApiAwesome.php, $apiResult = `"/bin/mwgrep escapeshellarg($_GET['q'])"`
[11:09:25] <ori> I'm sure we could come up with something more hideous than that
[11:09:33] <ori> it needs Puppet
[11:09:41] <ori> and ideally also XML
[11:10:00] <Krinkle> and php serialise/wakeup
[11:10:37] <ori> see also <http://stackoverflow.com/questions/7797217/how-to-index-source-code-with-elasticsearch> for the "proper" way to do code searches without resulting to this sort of trickery
[11:12:17] <ori> basically teach lucene to run <https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/js/Parser.java> against the text and index names
[11:12:35] * ori chuckles at 'import com.google.caja.SomethingWidgyHappenedError;'\x01
[11:13:56] <Nemo_bis> If ElasticSearch can be used for proper code searches, maybe gitblit can be convinced to use it instead of Lucene?
[11:14:23] <Nemo_bis> If yes please file a bug upstream, the request for gitweb-like grep search was rejected.
[11:14:37] <ori> elastic uses lucene under the hood
[11:16:04] <ori> I've been poking at its overall performance (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>), it's hideously slow atm
[11:18:37] <Nemo_bis> I know it uses Lucene, hence I'd think it would be slightly easier to convince them to adopt it. :)
[11:19:21] <Nemo_bis> Oh, thanks for the link, I thought I had starred that one.
[11:22:47] <Nemo_bis> And thanks for working on that