Page MenuHomePhabricator

rack/setup/deploy conf200[123]
Closed, ResolvedPublic5 Estimated Story Points

Description

@Papaul:

The systems for this task have been ordered on procurement task T130080. Once these arrive on-site, they should NOT be racked in the same rack. Presently, Racks a5, c5, and d5 are misc services racks and have the most space. So please rack one of these in each rack.

conf2001

  • - receive in normally via T130080
  • - rack in a5-codfw
  • - add mgmt dns entries for both asset tag and hostname
  • - add production dns entries (private vlan)
  • - create sub-task with network port info for setup (can include all three conf200[123] hosts on sub-task)
  • - update install_server module (use raid1, lvm, ext4, srv settings)
  • - install OS - Jessie
  • - service implementation (hand off to @Ottomata for this as initial requestor on T121882)

conf2002

  • - receive in normally via T130080
  • - rack in c5-codfw
  • - add mgmt dns entries for both asset tag and hostname
  • - add production dns entries (private vlan)
  • - create sub-task with network port info for setup (can include all three conf200[123] hosts on sub-task)
  • - update install_server module (use raid1, lvm, ext4, srv settings)
  • - install OS - Jessie
  • - service implementation (hand off to @Ottomata for this as initial requestor on T121882)

conf2003

  • - receive in normally via T130080
  • - rack in d5-codfw
  • - add mgmt dns entries for both asset tag and hostname
  • - add production dns entries (private vlan)
  • - create sub-task with network port info for setup (can include all three conf200[123] hosts on sub-task)
  • - update install_server module (use raid1, lvm, ext4, srv settings)
  • - install OS - Jessie
  • - service implementation (hand off to @Ottomata for this as initial requestor on T121882)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH renamed this task from rack conf100[123] to rack/setup/deploy conf100[123].Apr 6 2016, 5:55 PM
RobH triaged this task as Medium priority.
RobH added a subtask: Unknown Object (Task).
Southparkfan renamed this task from rack/setup/deploy conf100[123] to rack/setup/deploy conf200[123].Apr 8 2016, 7:49 PM
Papaul closed subtask Unknown Object (Task) as Resolved.Apr 15 2016, 3:56 PM
Papaul reopened subtask Unknown Object (Task) as Open.Apr 15 2016, 3:59 PM
mark closed subtask Unknown Object (Task) as Resolved.Apr 18 2016, 10:12 AM
Papaul updated the task description. (Show Details)

Setup and installation complete

@elukey don't install etcd on these machines for now, we need to come up with a good plan for that.

Change 285393 had a related patch set uploaded (by Elukey):
Add conf200[123] Zookeeper Service nodes in codfw.

https://gerrit.wikimedia.org/r/285393

Submitted a code review with my understanding of the change, but I -1 it since I have doubts. The cleanest way to proceed in my opinion would be to move the zookeeper variables in hiera from common to eqiad/codfw specific config files, but I think that this way of proceeding might get us into trouble during a DC switch. The change would work for the Kafka main-{eqiad,codfw} cluster since they will be replicated completely, but probably it might affect outstanding connections/value stored in ZK for the analytics hadoop and kafka clusters (that won't be replicated in codfw). If my concerns are valid then we could set a new config value in hiera called zookeeper_host_main_clusters (or something similar) in both eqiad and codfw yaml leaving zookeeper_host untouched in common.yaml.

Thoughts?

Naw, this won’t matter. The info in the 2 zookeeper clusters is totally
independent. Everything equal will only talk to the eqiad zookeeper
cluster. Same goes for codfw. You have done the right thing! I commented
some on the gerrit patch.

Change 285393 merged by Elukey:
Add conf200[123] Zookeeper Service nodes in codfw.

https://gerrit.wikimedia.org/r/285393

elukey@conf2001:~$ /usr/share/zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: follower

elukey@conf2002:~$ /usr/share/zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: leader

elukey@conf2003:~$ /usr/share/zookeeper/bin/zkServer.sh status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: follower

@Ottomata: if you want to check before closing, the zk nodes should be ready :)

JAllemandou set the point value for this task to 5.Apr 28 2016, 4:30 PM

Icinga said "CRITICAL - degraded: The system is operational but one or more units failed." on conf2002.codfw.wmnet

looking at the check_command for that, i saw it's parsing the output of /bin/systemctl. and when running that manually i got details which service failed:

service "etcdmirror-conftool-eqiad-wmnet.service not-found failed failed    etcdmirror-conftool-eqiad-wmnet.service"  on node

This is running: etcdmirror--eqiad-wmnet.service loaded active running Etcd mirrormaker
But this is not found: etcdmirror-conftool-eqiad-wmnet.service not-found failed failed etcdmirror-conftool-eqiad-wmnet.service

Mentioned in SAL (#wikimedia-operations) [2017-03-04T00:52:25Z] <mutante> conf2002 - ran "systemctl reset-failed" to fix Icinga alert about broken systemd state due to formerly existing but failed service etcdmirror-eqiad-wmnet. turns out you need this to remove missing units. found on http://serverfault.com/questions/606520/how-to-remove-missing-systemd-units (T131959)

The fix was systemctl reset-failed to get rid of the removed and now missing unit.

< icinga-wm> RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational