Code Review - Error
secondary index must be enabled for comment:pginer
Also for operators message: and file: (hence tentatively marking as blocker of bug 36467, bug 49674).
I hear that «as part of the secondary indexing change its no longer allowed
to run in its naive form. I think that is OK, we are going to
encourage everyone to turn on secondary indexing.»
Is this the infamous Lucene search? If yes, is this a WONTFIX or what blocks it other than caching (https://code.google.com/p/gitblit/issues/detail?id=274#c9)?
P.s.: Pasting here IRC discussion about the URL above for lack of a better place. http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-dev/20140217.txt
[10:37:35] <ori> Krinkle: have you seen mwgrep yet? [10:38:23] <ori> ssh to tin and run: mwgrep mw.user.anon [10:40:40] <hoo> ori: ooowwwm, super handy... thanks :) [10:41:09] <hoo> I've been using a hacky bash script for that until now :P [10:41:38] <ori> how did it work? [10:42:06] <hoo> ori: It basically just pulled in all JS pages I wan interested in (mostly common.js) and ran grep on them :P [10:42:11] <hoo> * was [10:42:14] <ori> ahh [10:48:49] <Krinkle> ori: how performant is that (how much is search engine, how much is local script), you think we could expose this publicly somehow? [10:49:28] <Krinkle> I wrote two separate toolserver tools over the past years to do essentially that, all of them extremely hacky and slow. [10:49:44] <hoo> hoo@terbium:~$ time mwgrep 453894375dfs [10:49:44] <hoo> [...] [10:49:44] <hoo> real 0m0.176s [10:49:52] <ori> the local script is basically constructing the URL for a request; it's all elasticsearch [10:50:09] <Krinkle> couldn't use any API or search backend for it at the time, as it mangled code quite badly (punctuation normalisation etc.), and no way to search multiple wikis [10:50:11] <ori> nik (manybubbles) said the load it generates is laughably small [10:50:28] <Krinkle> yeah, looked at the source code already [10:50:41] <Krinkle> wasn't sure where ""_source['text'].contains('%s')" % args.term" ran though [10:50:42] <hoo> ori: Does that service used in there have a public host? [10:50:58] <hoo> http://search.svc.eqiad.wmnet:9200 [10:51:09] <ori> you could change it to a regexp search and as long as it was restricted to the mw ns it would be fine (again, per nik) [10:51:53] <ori> it's not publicly-accessible; pathological queries will run it into the ground, so you need to have a layer in front that restricts the types of searches [10:52:02] <ori> in our case, the mediawiki search interface [10:52:43] <Krinkle> ori: Hm.. it currently doesn't take regex though, right? Would need a minor change to the script. Just verifying I'm not using it incorrectly [10:52:56] <ori> yeah, doesn't take a regexp at the moment [10:53:29] <ori> because this particular query is not risky (because ns:8, title ends with 'js' or 'css' restricts the pool of candidates), we could have a public front-end for it, but chad's plan is to make the entire elastic index queryable from labs [10:53:40] <Nemo_bis> hoo: does this help the non-protocol-relative scripts searches? [10:54:07] <hoo> Nemo_bis: Not much w/o regexp support, yet [10:54:13] <Nemo_bis> ok [10:54:29] <hoo> But I already have quite a list of stuff to clean up which I gathered using my old bash script [10:55:52] <ori> Krinkle: http://p.defau.lt/?Yswucu3jR5S9QqZfFre7Vg is the query (first line is path, everything below is the json query, which should be sent as POST payload) [10:56:56] <ori> Krinkle, hoo: you can replace 'contains' with 'matches' to make it a regexp search [10:57:29] <Krinkle> ori: what script language is that? [10:57:44] <hoo> python [10:57:48] <Krinkle> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html [10:57:56] <Krinkle> ok [10:58:24] <ori> Krinkle: no, MVEL [10:58:26] <ori> <http://github.com/mvel/mvel> [10:58:36] <ori> context: http://p.defau.lt/?EZ6aPeZRMWdjZtwpOnFgWQ [10:58:53] <Krinkle> yeah, it says that [10:59:05] <Krinkle> <a>scripts</a> -> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html [10:59:09] <Krinkle> uses by default mvel [10:59:24] <ori> ah, yep [11:00:27] <hoo> ori: mh... regexp don't seem to work after I did the change you suggested... [11:00:34] <hoo> or aren't that perl-style regexp? [11:02:18] <Krinkle> Im trying to find docs on mvel regarding available string methods [11:02:20] <Krinkle> no luck [11:03:18] <Krinkle> http://mvel.codehaus.org/MVEL+2.0+Operators shows a different syntax ( foo contains "x", and, foo ~= '[a-z].+' ) [11:03:47] <ori> Krinkle: "By default, like Java, MVEL provides all java.lang.* classes pre-imported as part of the compiler and runtime." [11:03:54] <Krinkle> right [11:04:23] <ori> so since _source['text'] is a string, you can call http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#matches(java.lang.String) [11:05:06] <ori> I don't know how that is functionally different from using MVEL's ~= operator [11:05:10] <Krinkle> Ah, this is going to be important to document and protect against. Working 3 or 4 levels of abstraction deep to end up evaluating java. [11:05:22] <Krinkle> java, mvel, json, http [11:05:52] <ori> elastic's http interface is designed to be a private api tho [11:06:00] <Krinkle> yeah [11:06:10] <ori> but yeah, it is a bit like a matryoshka doll [11:06:29] <ori> https://upload.wikimedia.org/wikipedia/commons/4/45/7babushkadolls.JPG [11:06:33] <Krinkle> except that its bigger on the inside [11:06:34] <ori> i'll let you decide which one is java :P [11:06:51] <Krinkle> http < json < mvel < java [11:07:16] <Krinkle> well, kinda sorta [11:07:17] <Krinkle> nvm [11:07:41] <ori> don't forget shell-style argv parsing > json [11:08:06] <Krinkle> yeah, but out final API isn't going to do that :P [11:08:31] <ori> at least i hope not [11:08:59] <Krinkle> ApiAwesome.php, $apiResult = `"/bin/mwgrep escapeshellarg($_GET['q'])"` [11:09:25] <ori> I'm sure we could come up with something more hideous than that [11:09:33] <ori> it needs Puppet [11:09:41] <ori> and ideally also XML [11:10:00] <Krinkle> and php serialise/wakeup [11:10:37] <ori> see also <http://stackoverflow.com/questions/7797217/how-to-index-source-code-with-elasticsearch> for the "proper" way to do code searches without resulting to this sort of trickery [11:12:17] <ori> basically teach lucene to run <https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/js/Parser.java> against the text and index names [11:12:35] * ori chuckles at 'import com.google.caja.SomethingWidgyHappenedError;'\x01 [11:13:56] <Nemo_bis> If ElasticSearch can be used for proper code searches, maybe gitblit can be convinced to use it instead of Lucene? [11:14:23] <Nemo_bis> If yes please file a bug upstream, the request for gitweb-like grep search was rejected. [11:14:37] <ori> elastic uses lucene under the hood [11:16:04] <ori> I've been poking at its overall performance (<https://code.google.com/p/gitblit/issues/detail?id=274#c9>), it's hideously slow atm [11:18:37] <Nemo_bis> I know it uses Lucene, hence I'd think it would be slightly easier to convince them to adopt it. :) [11:19:21] <Nemo_bis> Oh, thanks for the link, I thought I had starred that one. [11:22:47] <Nemo_bis> And thanks for working on that