rack/setup/deploy conf200[123]
Closed, ResolvedPublic5 Story Points

Description

@Papaul:

The systems for this task have been ordered on procurement task T130080. Once these arrive on-site, they should NOT be racked in the same rack. Presently, Racks a5, c5, and d5 are misc services racks and have the most space. So please rack one of these in each rack.

conf2001

  • - receive in normally via T130080
  • - rack in a5-codfw
  • - add mgmt dns entries for both asset tag and hostname
  • - add production dns entries (private vlan)
  • - create sub-task with network port info for setup (can include all three conf200[123] hosts on sub-task)
  • - update install_server module (use raid1, lvm, ext4, srv settings)
  • - install OS - Jessie
  • - service implementation (hand off to @Ottomata for this as initial requestor on T121882)

conf2002

  • - receive in normally via T130080
  • - rack in c5-codfw
  • - add mgmt dns entries for both asset tag and hostname
  • - add production dns entries (private vlan)
  • - create sub-task with network port info for setup (can include all three conf200[123] hosts on sub-task)
  • - update install_server module (use raid1, lvm, ext4, srv settings)
  • - install OS - Jessie
  • - service implementation (hand off to @Ottomata for this as initial requestor on T121882)

conf2003

  • - receive in normally via T130080
  • - rack in d5-codfw
  • - add mgmt dns entries for both asset tag and hostname
  • - add production dns entries (private vlan)
  • - create sub-task with network port info for setup (can include all three conf200[123] hosts on sub-task)
  • - update install_server module (use raid1, lvm, ext4, srv settings)
  • - install OS - Jessie
  • - service implementation (hand off to @Ottomata for this as initial requestor on T121882)
RobH created this task.Apr 6 2016, 5:55 PM
Restricted Application added a project: Operations. · View Herald TranscriptApr 6 2016, 5:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a subtask: Unknown Object (Task).Apr 6 2016, 5:55 PM
RobH triaged this task as "Normal" priority.
RobH changed the title from "rack conf100[123]" to "rack/setup/deploy conf100[123]".
Southparkfan changed the title from "rack/setup/deploy conf100[123]" to "rack/setup/deploy conf200[123]".Apr 8 2016, 7:49 PM
Papaul closed subtask Unknown Object (Task) as "Resolved".Apr 15 2016, 3:56 PM
Papaul reopened subtask Unknown Object (Task) as "Open".Apr 15 2016, 3:59 PM
Papaul edited the task description. (Show Details)Apr 15 2016, 5:30 PM
mark closed subtask Unknown Object (Task) as "Resolved".Apr 18 2016, 10:12 AM
Papaul edited the task description. (Show Details)Apr 18 2016, 5:40 PM
Papaul edited the task description. (Show Details)Apr 18 2016, 6:20 PM
Papaul edited the task description. (Show Details)Apr 18 2016, 6:33 PM
Papaul edited the task description. (Show Details)Apr 18 2016, 9:48 PM
Papaul reassigned this task from Papaul to Ottomata.

Setup and installation complete

Ottomata reassigned this task from Ottomata to elukey.Apr 19 2016, 1:32 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 1:32 PM
Joe added a subscriber: Joe.Apr 19 2016, 1:33 PM

@elukey don't install etcd on these machines for now, we need to come up with a good plan for that.

Change 285393 had a related patch set uploaded (by Elukey):
Add conf200[123] Zookeeper Service nodes in codfw.

https://gerrit.wikimedia.org/r/285393

Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptApr 26 2016, 1:31 PM

Submitted a code review with my understanding of the change, but I -1 it since I have doubts. The cleanest way to proceed in my opinion would be to move the zookeeper variables in hiera from common to eqiad/codfw specific config files, but I think that this way of proceeding might get us into trouble during a DC switch. The change would work for the Kafka main-{eqiad,codfw} cluster since they will be replicated completely, but probably it might affect outstanding connections/value stored in ZK for the analytics hadoop and kafka clusters (that won't be replicated in codfw). If my concerns are valid then we could set a new config value in hiera called zookeeper_host_main_clusters (or something similar) in both eqiad and codfw yaml leaving zookeeper_host untouched in common.yaml.

Thoughts?

Naw, this won’t matter. The info in the 2 zookeeper clusters is totally
independent. Everything equal will only talk to the eqiad zookeeper
cluster. Same goes for codfw. You have done the right thing! I commented
some on the gerrit patch.

Change 285393 merged by Elukey:
Add conf200[123] Zookeeper Service nodes in codfw.

https://gerrit.wikimedia.org/r/285393

elukey@conf2001:~$ /usr/share/zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: follower

elukey@conf2002:~$ /usr/share/zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: leader

elukey@conf2003:~$ /usr/share/zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: follower

@Ottomata: if you want to check before closing, the zk nodes should be ready :)

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou set the point value for this task to 5.Apr 28 2016, 4:30 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.Apr 29 2016, 1:02 PM
elukey closed this task as "Resolved".May 3 2016, 8:08 AM
Dzahn added a subscriber: Dzahn.EditedMar 4 2017, 12:35 AM

Icinga said "CRITICAL - degraded: The system is operational but one or more units failed." on conf2002.codfw.wmnet

looking at the check_command for that, i saw it's parsing the output of /bin/systemctl. and when running that manually i got details which service failed:

service "etcdmirror-conftool-eqiad-wmnet.service not-found failed failed    etcdmirror-conftool-eqiad-wmnet.service"  on node
Dzahn added a comment.Mar 4 2017, 12:40 AM

This is running: etcdmirror--eqiad-wmnet.service loaded active running Etcd mirrormaker
But this is not found: etcdmirror-conftool-eqiad-wmnet.service not-found failed failed etcdmirror-conftool-eqiad-wmnet.service

Mentioned in SAL (#wikimedia-operations) [2017-03-04T00:52:25Z] <mutante> conf2002 - ran "systemctl reset-failed" to fix Icinga alert about broken systemd state due to formerly existing but failed service etcdmirror-eqiad-wmnet. turns out you need this to remove missing units. found on http://serverfault.com/questions/606520/how-to-remove-missing-systemd-units (T131959)

Dzahn added a comment.Mar 4 2017, 12:54 AM

The fix was systemctl reset-failed to get rid of the removed and now missing unit.

< icinga-wm> RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational