Page MenuHomePhabricator

Evaluate problems reported by the elasticsearch migration plugin on prod search clusters
Closed, ResolvedPublic

Description

Utilize the elasticsearch-migration plugin and figure out what, if any, actions need to be taken based on the reported errors.

Problems reported related to node configuration:

Node attributes move to attr namespace:

File Descriptors:

  • At least 65536 file descriptors must be available to Elasticsearch - We have 65535, worth fixing?

Index settings

  • index.indexing.slowlog.threshold.index.debug can no longer be set in the config file
  • index.indexing.slowlog.threshold.index.info can no longer be set in the config file
  • index.indexing.slowlog.threshold.index.trace can no longer be set in the config file
  • index.indexing.slowlog.threshold.index.warn can no longer be set in the config file
  • index.merge.scheduler.max_thread_count can no longer be set in the config file
  • index.search.slowlog.threshold.fetch.debug can no longer be set in the config file
  • index.search.slowlog.threshold.fetch.info can no longer be set in the config file
  • index.search.slowlog.threshold.fetch.trace can no longer be set in the config file
  • index.search.slowlog.threshold.fetch.warn can no longer be set in the config file

Thread pool settings

  • threadpool.bulk.queue_size has been renamed to thread_pool.bulk.queue_size
  • threadpool.bulk.size has been renamed to thread_pool.bulk.size
  • threadpool.bulk.type has been renamed to thread_pool.bulk.type

Removed settings

Renamed settings

Unknown settings

  • action.disable_shutdown will be moved to the archived namespace on upgrade
  • Shutdown api was removed without replacement, no longer necessary. https://gerrit.wikimedia.org/r/333969
  • cluster.routing.allocation.balance.primary will be moved to the archived namespace on upgrade
  • This was deprecated in 1.3.8, and hasn't existed since 2.x. Elasticsearch is just finally getting strict about unknown configuration. Removed in https://gerrit.wikimedia.org/r/333969
  • discovery.zen.ping.multicast.enabled will be moved to the archived namespace on upgrade
  • The multicast-discovery plugin has been removed. We already use unicast, so no big deal. Does require some re-jiggering of puppet code. Removed in https://gerrit.wikimedia.org/r/333969
  • discovery.zen.ping.multicast.group will be moved to the archived namespace on upgrade
  • The multicast-discovery plugin has been removed. We already use unicast, so no big deal. Does require some re-jiggering of puppet code. Removed in https://gerrit.wikimedia.org/r/333969
  • foreground will be moved to the archived namespace on upgrade
  • The elasticsearch bootstrap process in 2.x sets this. It will just disapear in 5.x
  • indices.cache.filter.size will be moved to the archived namespace on upgrade
  • Renamed to indices.queries.cache.size in 2.0, so this hasn't done anything for some time. Renamed in https://gerrit.wikimedia.org/r/333969 TODO: Remove instead?
  • indices.recovery.concurrent_streams will be moved to the archived namespace on upgrade
  • deprecated in 1.x, removed in 5.x. Superseded by cluster.routing.allocation.node_concurrent_recoveries (which is already set). Removed in https://gerrit.wikimedia.org/r/333969
  • monitor.jvm.gc.ConcurrentMarkSweep.debug will be moved to the archived namespace on upgrade
  • monitor.jvm.gc.ConcurrentMarkSweep.info will be moved to the archived namespace on upgrade
  • monitor.jvm.gc.ConcurrentMarkSweep.warn will be moved to the archived namespace on upgrade
  • monitor.jvm.gc.ParNew.debug will be moved to the archived namespace on upgrade
  • monitor.jvm.gc.ParNew.info will be moved to the archived namespace on upgrade
  • monitor.jvm.gc.ParNew.warn will be moved to the archived namespace on upgrade
  • After reviewing the JvmGcMonitorService, AFAICT the above settings are all completely valid. Elasticsearch 5 also happily starts up without complaining when they are set. Leaving as-is.
  • profile will be moved to the archived namespace on upgrade
  • This was used for our tests of the language detection plugin, but we went a different direction. Removed in https://gerrit.wikimedia.org/r/333969

Problems reported related to indices

  • Indices created before v2.0.0 must be reindexed with the Reindex Helper
  • reindex is in progress. T157505

Mappings

  • default similarity renamed to classic
  • patches merged to use 'BM25' similarity, rather than default.type = BM25. Reindex of all indices will finish the fix
  • Geo-point parameters geohash, geohash_prefix, geohash_precision, and lat_lon no longer supported
  • Upon loading the indices into 5.x these will be ignored.
  • Completion field [titlesuggest]:suggest will not be compatible with new completion fields in 5.x
  • Completion field [titlesuggest]:suggest-geo will not be compatible with new completion fields in 5.x
  • Completion field [titlesuggest]:suggest-stop will not be compatible with new completion fields in 5.x
  • Completion field [titlesuggest]:suggest-stop-geo will not be compatible with new completion fields in 5.x
  • Completion field [titlesuggest]:suggest-subphrases will not be compatible with new completion fields in 5.x
  • Completion suggester will be disabled during deploy. Indices will be rebuilt with es5.x before re-enabling.
  • Unknown index settings - index.cache.field.type will be moved to the archived namespace on upgrade
  • Only exists in apifeatureusage. https://gerrit.wikimedia.org/r/338469
  • New indices may not have more than 1000 fields. This index has 1763.
  • Only exists in stas_wikidata_test. Tested and this does not block loading the 2.x index into 5. Will be resolved at T158278

Event Timeline

First order of business: the plugin makes no attempt to deduplicate errors, so we get ~3000 problem reports. This bit of javascript will do the deduplication:

(function (undefined) {
  function hashCode(str) {
    var hash = 0, i, chr, len;
    if (str.length === 0) return hash;
    for (i = 0, len = str.length; i < len; i++) {
      chr   = str.charCodeAt(i);
      hash  = ((hash << 5) - hash) + chr;
      hash |= 0; // Convert to 32bit integer
    }
    return hash;
  };


  function merge($container, target_selector) { 
    var errors = {},
        $targets = $container.children(target_selector);
  
    $targets.each(function (idx, el) {
      var $el = $(el),
          name = $el.children('span').text(),
          hash = hashCode($el.children('ul').html());
  
      if (errors[hash] === undefined) {
          errors[hash] = {
              indices: [name],
              content: $el.children('ul').html(),
              status: $el.attr('class')
          }
      } else {
          errors[hash].indices.push(name)
      }
    });
  
 
    $targets.remove();
 
    for (hash in errors) {
      if (!errors.hasOwnProperty(hash)) {
        continue;
      }
      $container.append($('<li>')
          .attr('class', errors[hash].status)
          .html('<span><code>' + errors[hash].indices.join('</code></span><span><code>') + '</code></span><ul>' + errors[hash].content + '</ul>')
      );
    }
  }

  function mergeIndices() {
    // We don't care about warmers, and they vary
    $( 'li.section.warmers.blue' ).remove();
    merge($('span:contains("Indices")').siblings('ul'), 'li.index:not(.status.green)');
  }

  function mergeNodes() {
    merge($('span:contains("Node Settings")').siblings('ul'), 'li.node:not(.status.green)');
  }

  mergeIndices();
  mergeNodes();
})();

Everything on the list has in-progress work. For reference the deduplicated list of patches/tickets that need to be resolved is:

Even after the update to use BM25 directly and reindexing, i'm still seeing a handful of wikis in codfw (the only cluster currently finished) referencing classic similarity. Will investigate more.

Ignore that, they are supposed to refer to classic similarity, they are the spaceless languages.

I re-reviewed the output and everything in eqiad and codfw looks good to go now. All related patches have been merged, moving to done.