Hi,
I'd like to get ElasticSearch access for my tool 'similarity', which is the backend of a browser extension I'm working on [1]. My plan is to index in ES plaintext versions of (most of) the articles under enwiki's Category:All_articles_needing_additional_references, then have a browser extension perform MoreLikeThis [2] queries with text extracted from the current page, for news websites.
This amounts to about 180000 documents, or ~1.6GiB on my local machine. I can try to further clean up the articles or downsample if that's too much, but ideally I'd index them in the current format at first to validate this approach.
Thank you!
1- See https://lists.wikimedia.org/pipermail/cloud/2017-September/000003.html for previous discussion
2- https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-mlt-query.html