Elasticsearch credential request for 'similarity'
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Surlycyborg
	Oct 29 2017, 7:01 PM

Description

Hi,

I'd like to get ElasticSearch access for my tool 'similarity', which is the backend of a browser extension I'm working on [1]. My plan is to index in ES plaintext versions of (most of) the articles under enwiki's Category:All_articles_needing_additional_references, then have a browser extension perform MoreLikeThis [2] queries with text extracted from the current page, for news websites.

This amounts to about 180000 documents, or ~1.6GiB on my local machine. I can try to further clean up the articles or downsample if that's too much, but ideally I'd index them in the current format at first to validate this approach.

Thank you!

1- See https://lists.wikimedia.org/pipermail/cloud/2017-September/000003.html for previous discussion
2- https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-mlt-query.html

Event Timeline

Surlycyborg created this task.Oct 29 2017, 7:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2017, 7:01 PM

bd808 added a project: cloud-services-team (Kanban).Oct 29 2017, 7:07 PM

bd808 moved this task from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.

@Surlycyborg your credentials are in /data/project/similarity/.elasticsearch.ini

This elasticsearch cluster is multi-tenant and there really is not anything that is limiting your disk and cpu usage. Please try to use it responsibly and keep an eye on how fast you add new documents to your indexes. If there are complains of performance issues from the other tenants your access may be revoked temporarily or permanently as we try to adjust things.

Have fun building your tool. It sounds interesting. After you get something working as a proof of concept you might want to tell the Search Platform folks and find out if there is a way you can get the data you need from the main CirrusSearch elasticsearch backend which is kept up to date with the live edit feed. Someday™ we hope to have a mirror of those indexes in Data-Services to make building tools like this easier.

Restricted Application added a project: User-bd808. · View Herald TranscriptOct 31 2017, 10:01 PM

bd808 moved this task from Clinic Duty to Done on the cloud-services-team (Kanban) board.Oct 31 2017, 11:36 PM

bd808 moved this task from To Do to Done on the User-bd808 board.Jul 15 2020, 9:17 PM

Elasticsearch credential request for 'similarity'Closed, ResolvedPublicActions

Description

Event Timeline

Elasticsearch credential request for 'similarity'
Closed, ResolvedPublic
Actions