Page MenuHomePhabricator

Re-evaluate mapping for keywords
Closed, ResolvedPublic


We decided to avoid the keyword type and use text for everything, utilizing truncation in the analyzer stage to keep tokens within the lucene limits. It seems this hasn't worked though, importing the enwiki dumps into relforge came across an error on the external_links field for the page:

The error was:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="external_link" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

This should be found at https://relforge1001.eqiad.wmnet:9243/crosswiki_enwiki_general/page/45641485 but the entire document was not indexed because of the above exception. We need to put together a test case and fix our mapping to handle this case.

Event Timeline

This seems a bit of a blocker for release, although for the entire enwiki content+general there was only a single error. This may end up more common on wikis that use primarily utf-8, but 32k is still a hell of a big url either way.

Both fixing the mapping and making sure we upgrade appropriately when 5.x loads 2.x needs to be done. I'll look into this tomorrow.

On this index I can't seem to find the definition of the keyword analyzer (which should include the truncate token filter), is it possible that you recreated this index by copying mapping and settings from production directly?

EBernhardson closed this task as Resolved.EditedFeb 28 2017, 6:39 PM

Did some tests, so basically:

  • Creating page with current 2.x code, everything works fine
  • Upgrade elasticsearch from 2.x to 5.x, don't reindex on 5.x: indexing fails (separate bug, we have to turn off super_detect_noop). After disabling super_detect_noop indexing works
  • Update cirrus/elastica code to es5 branch, indexing succeeds.
  • Reindexed in 5.x, indexing succeeds

The initial problem seen in relforge is strictly related to how we use a 2.x mapping, but don't (can't) load it into elasticsearch as 2.x. So no bug here.