Page MenuHomePhabricator

rack/setup/install conf1004-conf1006
Closed, ResolvedPublic

Description

This task will track the receiving, racking and setup of three new conf hosts in eqiad, ordered on T162429.

Racking Proposal: Keep out of the racks of existing conf1001-1003 hosts, so not in racks A2, C7, or D8. This will increase or horizontal redundancy. Please maximize horizontal redundancy by placing one in row B, if possible.

conf1004:

  • - receive in system on T162429
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal subnet) https://gerrit.wikimedia.org/r/#/c/363372/
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

conf1005:

  • - receive in system on T162429
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal subnet) https://gerrit.wikimedia.org/r/#/c/363372/
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

conf1006:

  • - receive in system on T162429
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal subnet) https://gerrit.wikimedia.org/r/#/c/363372/
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Racking request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decommissioned.

Racked these in A4/B4/D4. Updated racktables w/basic info and rack location.

Change 360377 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for conf1004-6 T166081

https://gerrit.wikimedia.org/r/360377

Change 360377 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for conf1004-6 T166081

https://gerrit.wikimedia.org/r/360377

@RobH if you have the time to get these going that would be great

irc update: these will need to be installed with jessie

Change 363372 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] conf100[456] production dns entries

https://gerrit.wikimedia.org/r/363372

Change 363372 merged by RobH:
[operations/dns@master] conf100[456] production dns entries

https://gerrit.wikimedia.org/r/363372

Change 363374 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] set install params for conf100[456]

https://gerrit.wikimedia.org/r/363374

Change 363374 merged by RobH:
[operations/puppet@production] set install params for conf100[456]

https://gerrit.wikimedia.org/r/363374

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

Assigned to @elukey for service implementation. (If this isn't done by you, but someone else, please assign this task to them.)

This task can be resolved once whoever is implementing services on these hosts is aware they are ready.

Just had a chat with Joe, and the approach that we'd like to follow is:

  1. expand the current conf100[123] cluster with the conf100[456] nodes
  2. verify that everything works fine
  3. decom conf100[123] one at the time

In this way all the services that leverage zk/etcd (Kafka, Pybal, Hadoop, etc..) will not need to be stopped.

Note for adding a node to etcd: https://wikitech.wikimedia.org/wiki/Etcd#Adding_a_new_member_to_the_cluster

Due to the new Kafka Jumbo cluster (and other things like the Eventlogging cleaner script) I didn't get much time to schedule/plan this work, that may end up to end of Q1 or beginning of Q2.

Nuria moved this task from Radar to Incoming on the Analytics board.

Change 395863 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add conf100[456] with role spare

https://gerrit.wikimedia.org/r/395863

Change 395863 merged by Dzahn:
[operations/puppet@production] site: add conf100[456] with role spare

https://gerrit.wikimedia.org/r/395863

Given that this task is stalled for a while now, we should reimage these servers with stretch before eventually putting them into production?

I think that we can proceed in this way:

  1. Check in labs what zookeeper version would end up in stretch. On conf100[123] we have 3.4.5+dfsg-2+deb8u2 and I believe that we should keep the same on stretch.
  1. Reimage conf100[456] to stretch
  1. Deploy a temporary role to conf100[456] containing only zookeeper profile/hiera/etc.. and expand the current conf100[123] cluster. Eventually decom conf100[123].
  1. Whenever we have time, migrated etcd to conf100[456]. This might mean deploy a new etcd 3 cluster or just expand the current one.
  1. Check in labs what zookeeper version would end up in stretch. On conf100[123] we have 3.4.5+dfsg-2+deb8u2 and I believe that we should keep the same on stretch.

Stretch has 3.4.9-3, that should be compatible, it's the same 3.4.x series after all?

  1. Check in labs what zookeeper version would end up in stretch. On conf100[123] we have 3.4.5+dfsg-2+deb8u2 and I believe that we should keep the same on stretch.

Stretch has 3.4.9-3, that should be compatible, it's the same 3.4.x series after all?

IIRC you patched 3.4.5+dfsg-2+deb8u2 right? Are you suggesting to move zookeeperd to its Debian upstream version, rather than the cdh one? The alternative would be to build/upload 3.4.5+dfsg-2+deb8u2 to stretch-wikimedia and use that one..

Nevermind I am stupid, I confused the long version 3.4.5+cdh5.10.0+104-1.cdh5.10.0.p0.71~jessie-cdh5.10.0 with 3.4.5+dfsg-2+deb8u2, now I got it. So on conf100[123] we are not using the cdh version, so I'll proceed with my tests to see if a cluster mixed with 3.4.5+dfsg-2+deb8u2 and 3.4.9-3 can work fine.

Change 410940 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] linux-host-entries: set stretch for conf100[456]

https://gerrit.wikimedia.org/r/410940

Change 410940 merged by Elukey:
[operations/puppet@production] linux-host-entries: set stretch for conf100[456]

https://gerrit.wikimedia.org/r/410940

Ack, the deb8u2 patch for jessie was for a security fix which is also fixed in the stretch version.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['conf1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201802151440_elukey_20899.log.

Reporting just in case:

Loading Linux 4.9.0-5-amd64 ...
Loading initial ramdisk ...
[    0.078831] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
  WARNING: Failed to connect to lvmetad. Falling back to device scanning.
/dev/md0: clean, 37269/1831424 files, 401579/7319808 blocks
[    5.635107] power_meter ACPI000D:00: Ignoring unsafe software power cap!

Debian GNU/Linux 9 conf1004 ttyS1

conf1004 login:

It doesn't seem a serious bug since everything boots fine but :)

Completed auto-reimage of hosts:

['conf1004.eqiad.wmnet']

and were ALL successful.

Change 410957 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: remove explicit java-7 dependency

https://gerrit.wikimedia.org/r/410957

In labs I've extended the one-zookeeper-node analytics project's cluster to three nodes, adding two stretch hosts. Except the puppet issue with java 7 (https://gerrit.wikimedia.org/r/410957) I can see the cluster working fine.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['conf1005.eqiad.wmnet', 'conf1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201802151631_elukey_7934.log.

Completed auto-reimage of hosts:

['conf1005.eqiad.wmnet', 'conf1006.eqiad.wmnet']

and were ALL successful.

Change 412744 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/zookeeper@master] Simplify zookeeper's default template to be systemd friendly

https://gerrit.wikimedia.org/r/412744

Change 410957 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: remove explicit java-7 dependency

https://gerrit.wikimedia.org/r/410957

Change 412744 merged by Elukey:
[operations/puppet/zookeeper@master] Simplify zookeeper's default template to be systemd friendly

https://gerrit.wikimedia.org/r/412744

Change 412857 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update zookeeper's module to its latest revision

https://gerrit.wikimedia.org/r/412857

Change 412857 merged by Elukey:
[operations/puppet@production] Update zookeeper's module to its latest revision

https://gerrit.wikimedia.org/r/412857

Change 412859 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: add the correct package name for default-jdk

https://gerrit.wikimedia.org/r/412859

Change 412859 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: add the correct package name for default-jdk

https://gerrit.wikimedia.org/r/412859

Change 419358 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] etcd: add class for v3 basic installation

https://gerrit.wikimedia.org/r/419358

Change 420012 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::etcd::tlsproxy: allow more configuration options

https://gerrit.wikimedia.org/r/420012

Change 420014 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] role: add configcluster_stretch

https://gerrit.wikimedia.org/r/420014

Change 420012 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::etcd::tlsproxy: allow more configuration options

https://gerrit.wikimedia.org/r/420012

Change 421814 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] conf: update the netboot recipe for conf servers with SSDs

https://gerrit.wikimedia.org/r/421814

Change 421814 merged by Giuseppe Lavagetto:
[operations/puppet@production] conf: update the netboot recipe for conf servers with SSDs

https://gerrit.wikimedia.org/r/421814

Change 421832 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/dns@master] Add the SRV record for the new etcd cluster

https://gerrit.wikimedia.org/r/421832

Change 421832 merged by Giuseppe Lavagetto:
[operations/dns@master] Add the SRV record for the new etcd cluster

https://gerrit.wikimedia.org/r/421832

Change 419358 merged by Giuseppe Lavagetto:
[operations/puppet@production] etcd: add class for v3 basic installation

https://gerrit.wikimedia.org/r/419358

Change 421856 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] site: apply configcluster_stretch to conf1004-6

https://gerrit.wikimedia.org/r/421856

Change 420014 merged by Giuseppe Lavagetto:
[operations/puppet@production] role: add configcluster_stretch

https://gerrit.wikimedia.org/r/420014

Change 421856 merged by Giuseppe Lavagetto:
[operations/puppet@production] site: apply configcluster_stretch to conf1004-6

https://gerrit.wikimedia.org/r/421856

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

['conf1005.eqiad.wmnet', 'conf1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803261352_oblivian_10356.log.

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

['conf1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803261435_oblivian_20906.log.

Change 422911 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster_stretch: add IPv6 static addresses

https://gerrit.wikimedia.org/r/422911

Before proceeding I'd wait for @Joe's confirmation. I'd like to:

  1. add static IPv6 addresses to conf100[456] with https://gerrit.wikimedia.org/r/422911
  2. add those as AAAA records in ops/dns
  3. update the zookeeper's network constants with new ipv4 and new ipv6 addresses

Change 422911 merged by Elukey:
[operations/puppet@production] role::configcluster_stretch: add IPv6 static addresses

https://gerrit.wikimedia.org/r/422911

Change 425292 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add AAAA and PTR records for conf100[456]

https://gerrit.wikimedia.org/r/425292

Change 425292 merged by Elukey:
[operations/dns@master] Add AAAA and PTR records for conf100[456]

https://gerrit.wikimedia.org/r/425292

Zookeeper has been moved out as part of https://phabricator.wikimedia.org/T182924, so only etcd is remaining. Removing myself from the task since we'd need to figure out the next steps before, but I am willing to work on etcd too if needed!