Re-evaluate mapping for keywords
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Feb 28 2017, 5:45 AM

Description

We decided to avoid the keyword type and use text for everything, utilizing truncation in the analyzer stage to keep tokens within the lucene limits. It seems this hasn't worked though, importing the enwiki dumps into relforge came across an error on the external_links field for the page: https://en.wikipedia.org/wiki/Wikipedia:Editor_assistance/Requests/Archive_122?action=cirrusdump

The error was:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="external_link" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

This should be found at https://relforge1001.eqiad.wmnet:9243/crosswiki_enwiki_general/page/45641485 but the entire document was not indexed because of the above exception. We need to put together a test case and fix our mapping to handle this case.

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T151324 [epic] System level upgrade for cirrus / elasticsearch
Resolved	• Deskana	T154501 [Epic, Q3 Goal] Upgrade search systems to Elasticsearch 5
Resolved	Gehel	T156150 Install ES 5.x to relforge100[12]
Resolved	• Deskana	T158680 Upgrade codfw to ES 5.x
Resolved	EBernhardson	T159203 Re-evaluate mapping for keywords

Event Timeline

This seems a bit of a blocker for release, although for the entire enwiki content+general there was only a single error. This may end up more common on wikis that use primarily utf-8, but 32k is still a hell of a big url either way.

Both fixing the mapping and making sure we upgrade appropriately when 5.x loads 2.x needs to be done. I'll look into this tomorrow.

On this index I can't seem to find the definition of the keyword analyzer (which should include the truncate token filter), is it possible that you recreated this index by copying mapping and settings from production directly?

EBernhardson claimed this task.Feb 28 2017, 6:05 PM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Did some tests, so basically:

Creating page with current 2.x code, everything works fine
Upgrade elasticsearch from 2.x to 5.x, don't reindex on 5.x: indexing fails (separate bug, we have to turn off super_detect_noop). After disabling super_detect_noop indexing works
Update cirrus/elastica code to es5 branch, indexing succeeds.
Reindexed in 5.x, indexing succeeds

The initial problem seen in relforge is strictly related to how we use a 2.x mapping, but don't (can't) load it into elasticsearch as 2.x. So no bug here.

Re-evaluate mapping for keywordsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Re-evaluate mapping for keywords
Closed, ResolvedPublic
Actions

Related Objects
Search...