Page MenuHomePhabricator

Use unicast instead of multicast for node communication
Closed, ResolvedPublic

Description

This is recommended by Elastic

https://www.elastic.co/guide/en/elasticsearch/guide/current/_important_configuration_changes.html

Elasticsearch is configured to use multicast discovery out of the box. Multicast works by sending UDP pings across your  
local network to discover nodes. Other Elasticsearch nodes will receive these pings and respond. A cluster is formed 
shortly after.

Multicast is excellent for development, since you don’t need to do anything. Turn a few nodes on, and they automatically 
find each other and form a cluster.


This ease of use is the exact reason you should disable it in production. The last thing you want is for nodes to 
accidentally join your production network, simply because they received an errant multicast ping. There is nothing wrong 
with multicast per se. Multicast simply leads to silly problems, and can be a bit more fragile (for example, a network 
engineer fiddles with the network without telling you—and all of a sudden nodes can’t find each other anymore).```

This recommended by a lot of scaling blogs and white papers

https://www.loggly.com/blog/nine-tips-configuring-elasticsearch-for-high-performance/

we are having some host discovery issues at the moment and dynamic node discovery in our setup that does not out-scale is an extra layer of complication we don't require.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL [2016-04-28T12:50:37Z] <gehel> restarting elasticsearch server elastic1005.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T13:59:26Z] <gehel> restarting elasticsearch server elastic1006.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T15:12:24Z] <gehel> restarting elasticsearch server elastic1007.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T16:15:26Z] <gehel> restarting elasticsearch server elastic1008.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T16:57:03Z] <gehel> restarting elasticsearch server elastic1009.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T18:19:15Z] <gehel> restarting elasticsearch server elastic1010.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T19:47:43Z] <gehel> restarting elasticsearch server elastic1011.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-28T21:08:19Z] <gehel> restarting elasticsearch server elastic1012.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T04:58:33Z] <gehel> restarting elasticsearch server elastic1013.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T05:42:50Z] <gehel> restarting elasticsearch server elastic1014.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T07:52:37Z] <gehel> restarting elasticsearch server elastic1015.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T08:20:49Z] <gehel> restarting elasticsearch server elastic1016.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T09:42:58Z] <gehel> restarting elasticsearch server elastic1016.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T09:43:03Z] <gehel> restarting elasticsearch server elastic1017.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T12:30:15Z] <gehel> restarting elasticsearch server elastic1018.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T13:09:35Z] <gehel> restarting elasticsearch server elastic1019.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T14:32:56Z] <gehel> restarting elasticsearch server elastic1020.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T15:17:08Z] <gehel> restarting elasticsearch server elastic1021.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T16:22:31Z] <gehel> restarting elasticsearch server elastic1022.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T16:56:47Z] <gehel> restarting elasticsearch server elastic1023.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T17:45:57Z] <gehel> restarting elasticsearch server elastic1024.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T19:07:24Z] <gehel> restarting elasticsearch server elastic1025.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T19:29:11Z] <gehel> restarting elasticsearch server elastic1026.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-29T20:59:15Z] <gehel> restarting elasticsearch server elastic1027.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-30T06:16:55Z] <gehel> restarting elasticsearch server elastic1028.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-30T06:32:46Z] <gehel> restarting elasticsearch server elastic1029.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-30T07:15:55Z] <gehel> restarting elasticsearch server elastic1030.eqiad.wmnet (T110236)

Mentioned in SAL [2016-04-30T08:28:16Z] <gehel> restarting elasticsearch server elastic1031.eqiad.wmnet (T110236)

First restart to enable unicast completed on eqiad and codfw. Second restart to come...

Mentioned in SAL [2016-05-02T09:54:29Z] <gehel> restart elasticsearch cluster to ensure multicast configuration is disabled (T110236)

Change 286410 had a related patch set uploaded (by Gehel):
Remove multicast from Elasticsearch

https://gerrit.wikimedia.org/r/286410

Change 286410 merged by Gehel:
Remove multicast from Elasticsearch

https://gerrit.wikimedia.org/r/286410

Mentioned in SAL [2016-05-02T20:21:45Z] <gehel> starting rolling restart of elasticsearch codfw cluster to disable multicast (T110236)

Mentioned in SAL [2016-05-02T20:23:37Z] <gehel> restarting elasticsearch server elastic2001.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T08:18:54Z] <gehel> restarting elasticsearch server elastic2002.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T09:11:10Z] <gehel> restarting elasticsearch server elastic2003.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T12:31:02Z] <gehel> restarting elasticsearch server elastic2004.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T13:02:26Z] <gehel> restarting elasticsearch server elastic2005.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T13:40:09Z] <gehel> restarting elasticsearch server elastic2006.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T14:36:29Z] <gehel> restarting elasticsearch server elastic2007.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T15:08:45Z] <gehel> restarting elasticsearch server elastic2008.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T15:48:46Z] <gehel> restarting elasticsearch server elastic2009.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T16:43:39Z] <gehel> restarting elasticsearch server elastic2010.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T17:13:39Z] <gehel> restarting elasticsearch server elastic2011.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T17:59:12Z] <gehel> restarting elasticsearch server elastic2012.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T19:01:18Z] <gehel> restarting elasticsearch server elastic2013.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-03T19:47:43Z] <gehel> restarting elasticsearch server elastic2014.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T06:20:38Z] <gehel> restarting elasticsearch server elastic2015.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T08:42:18Z] <gehel> restarting elasticsearch server elastic2016.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T10:05:11Z] <gehel> restarting elasticsearch server elastic2017.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T10:33:46Z] <gehel> restarting elasticsearch server elastic2018.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T11:03:26Z] <gehel> restarting elasticsearch server elastic2019.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T11:51:27Z] <gehel> restarting elasticsearch server elastic2020.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T12:08:39Z] <gehel> restarting elasticsearch server elastic2021.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T12:50:15Z] <gehel> restarting elasticsearch server elastic2022.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T13:24:44Z] <gehel> restarting elasticsearch server elastic2023.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T14:12:12Z] <gehel> restarting elasticsearch server elastic2024.codfw.wmnet (T110236)

Mentioned in SAL [2016-05-04T15:28:06Z] <gehel> restarting elasticsearch server elastic1001.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-04T16:03:04Z] <gehel> restarting elasticsearch server elastic1002.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-04T16:37:06Z] <gehel> restarting elasticsearch server elastic1003.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-04T17:14:11Z] <gehel> restarting elasticsearch server elastic1004.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-04T17:49:53Z] <gehel> restarting elasticsearch server elastic1005.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-04T18:10:32Z] <gehel> restarting elasticsearch server elastic1006.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-05T04:55:26Z] <gehel> restarting elasticsearch server elastic1007.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-05T05:52:49Z] <gehel> restarting elasticsearch server elastic1008.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-05T21:37:18Z] <gehel> restarting elasticsearch server elastic1009.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-06T05:57:01Z] <gehel> restarting elasticsearch server elastic1010.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-06T08:05:14Z] <gehel> restarting elasticsearch server elastic1011.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-06T08:36:49Z] <gehel> restarting elasticsearch server elastic1012.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-06T13:53:06Z] <gehel> restarting elasticsearch server elastic1013.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-06T14:45:58Z] <gehel> restarting elasticsearch server elastic1014.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-08T06:07:44Z] <gehel> restarting elasticsearch server elastic1015.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-08T07:44:16Z] <gehel> restarting elasticsearch server elastic1016.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-09T06:11:21Z] <gehel> restarting elasticsearch server elastic1017.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-09T07:38:36Z] <gehel> restarting elasticsearch server elastic1018.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-09T08:18:22Z] <gehel> restarting elasticsearch server elastic1019.eqiad.wmnet (T110236)

Mentioned in SAL [2016-05-09T09:27:58Z] <gehel> restarting elasticsearch server elastic1020.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T10:54:51Z] <gehel> restarting elasticsearch server elastic1021.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T11:17:46Z] <gehel> restarting elasticsearch server elastic1022.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T12:32:19Z] <gehel> restarting elasticsearch server elastic1023.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T12:53:22Z] <gehel> restarting elasticsearch server elastic1024.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T13:14:27Z] <gehel> restarting elasticsearch server elastic1025.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T13:37:39Z] <gehel> restarting elasticsearch server elastic1026.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T14:06:56Z] <gehel> restarting elasticsearch server elastic1027.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T14:32:30Z] <gehel> restarting elasticsearch server elastic1028.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T15:41:56Z] <gehel> restarting elasticsearch server elastic1029.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T16:12:40Z] <gehel> restarting elasticsearch server elastic1030.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T16:34:23Z] <gehel> restarting elasticsearch server elastic1031.eqiad.wmnet (T110236), includes JDK upgrade

Mentioned in SAL [2016-05-09T17:05:49Z] <gehel> cluster restart completed for eqiad / codfw elasticsearch (T110236)$

Mentioned in SAL [2016-05-09T20:26:18Z] <gehel> restarting logstash server logstash1001.eqiad.wmnet (T110236)upgrade

Mentioned in SAL [2016-05-09T20:29:31Z] <gehel> restarting logstash server logstash100[26].eqiad.wmnet (T110236)

All nodes restarted (including logstash), documentation updated.

debt added a subscriber: debt.

Looks like this is resolved - closing.