Page MenuHomePhabricator

Elasticsearch credential request for 'similarity'
Closed, ResolvedPublic

Description

Hi,

I'd like to get ElasticSearch access for my tool 'similarity', which is the backend of a browser extension I'm working on [1]. My plan is to index in ES plaintext versions of (most of) the articles under enwiki's Category:All_articles_needing_additional_references, then have a browser extension perform MoreLikeThis [2] queries with text extracted from the current page, for news websites.

This amounts to about 180000 documents, or ~1.6GiB on my local machine. I can try to further clean up the articles or downsample if that's too much, but ideally I'd index them in the current format at first to validate this approach.

Thank you!

1- See https://lists.wikimedia.org/pipermail/cloud/2017-September/000003.html for previous discussion
2- https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-mlt-query.html

Event Timeline

bd808 claimed this task.

@Surlycyborg your credentials are in /data/project/similarity/.elasticsearch.ini

This elasticsearch cluster is multi-tenant and there really is not anything that is limiting your disk and cpu usage. Please try to use it responsibly and keep an eye on how fast you add new documents to your indexes. If there are complains of performance issues from the other tenants your access may be revoked temporarily or permanently as we try to adjust things.

Have fun building your tool. It sounds interesting. After you get something working as a proof of concept you might want to tell the Search Platform folks and find out if there is a way you can get the data you need from the main CirrusSearch elasticsearch backend which is kept up to date with the live edit feed. Someday™ we hope to have a mirror of those indexes in Data-Services to make building tools like this easier.