Think of all the awesome tools people could write with search data :)
Major things to figure out:
- replication strategy (including skipping private wikis)
- making sure we've got the resources in labs to take a replica of the prod indices
Use cases:
- Provides access to the community to query arbitrary data out of elasticsearch. The mediawiki search api pales in comparison to what can be done with the ES api directly.
- The Elasticsearch query and document format is, for many tasks, much easier to use. In the mysql labs database getting a page and all its information is a complicated join. In ES it is a simple query[1], and by default the returned document contains everything we know[2].
- Allows tools to be built by the community to take advantage of all this data in elasticsearch.
- Gives discovery department access to constantly updated indices in labs for analysis and research into potential changes
- Replace use of mwgrep (which can only be used by people with shell access)
- Allow people to search across multiple projects for deprecated JS code
[1] http://elasticsearch/enwiki_content/page/_search?q=title:Jimmy_Wales
[2] https://en.wikipedia.org/wiki/Jimmy_Wales?action=cirrusdump