Page MenuHomePhabricator

Setup poolcounter servers for codfw
Closed, ResolvedPublic

Description

procure and install 2 spare servers as poolcounters in codfw

System Deployment Steps:

  • - system mgmt setup and tested - T93272 & T93284 @Papaul
  • - system dns setup (both mgmt and production entries) - internal vlans/ips - @RobH handed via patchset 198108
  • - network switch setup (port description & vlan) - @RobH to handle, dependent on T93272 & T93284
  • - install-server module updated (dhcp and netboot/partitioning) @Dzahn gerrit:198250, gerrit:198245
  • - install OS done. subra and suhail both up. @Dzahn
  • - accept/sign puppet/salt keys @Dzahn
  • - service implementation - poolcounter running @Dzahn

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added a project: acl*sre-team.
Dzahn added a subscriber: Dzahn.

the current poolservers in eqiad are potassium (WMF3287) and helium (WMF3137). the new ones should be similar.helium is a Dell PowerEdge R310, potassium is unlabeled

Dzahn renamed this task from poolcounter servers for codfw to Setup poolcounter servers for codfw.Mar 19 2015, 8:54 PM
Dzahn set Security to None.

ticket from the past to setup the ones in eqiad: https://rt.wikimedia.org/Ticket/Display.html?id=3407

so, these should be

  • in different racks
  • don't need public IPs, private vlan
  • should be like "Dell PowerEdge R310, Single Intel Xeon X3450, 8GB Memory (2) 500GB 3.5 SATA" and Robh says he has spares like that in codfw
RobH triaged this task as Medium priority.Mar 19 2015, 10:16 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a subscriber: RobH.

Change 198245 had a related patch set uploaded (by Dzahn):
add hosts subra and suhail to DHCP (poolcounter)

https://gerrit.wikimedia.org/r/198245

Change 198245 merged by Dzahn:
add hosts subra and suhail to DHCP (poolcounter)

https://gerrit.wikimedia.org/r/198245

Change 198250 had a related patch set uploaded (by Dzahn):
add subra,suhail to netboot. partman/raid1-1part

https://gerrit.wikimedia.org/r/198250

Change 198250 merged by Dzahn:
add subra,suhail to netboot. partman/raid1-1part

https://gerrit.wikimedia.org/r/198250

Dzahn updated the task description. (Show Details)
Dzahn added a subscriber: Papaul.

on subra:

PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Broadcom PXE ROM.

No boot device available.

on suhail:

no DHCP offers...

on suhail: carbon never receives any requests from suhail's MAC. suhail then fails with "no offers".

is the VLAN config on the switch correct?

RobH updated the task description. (Show Details)

I just finished setting up the network ports, so I'm now assignign this task to @Dzahn for install.

thanks. retried. confirmed, both subra and suhail now talk to carbon unlike before.

But now:

Mar 20 18:55:39 carbon dhcpd: DHCPDISCOVER from d4:ae:52:ad:62:75 via 10.192.16.2: network 10.192.16.0/22: no free leases

Mar 20 18:53:51 carbon dhcpd: DHCPDISCOVER from 90:b1:1c:00:ae:28 via 10.192.0.3: network 10.192.0.0/22: no free leases

Change 198286 had a related patch set uploaded (by RobH):
My initial patch had subra/suhail rows reversed from reality

https://gerrit.wikimedia.org/r/198286

Change 198286 merged by RobH:
My initial patch had subra/suhail rows reversed from reality

https://gerrit.wikimedia.org/r/198286

suhail works now and boots into the installer, the next problem it faces is the partman recipe:

┌────────────────────┤ [!!] Partition disks ├─────────────────────┐
│                                                                 │
│                   Error while setting up RAID                   │
│ An unexpected error occurred while setting up a preseeded RAID  │
│ configuration.                                                  │

subra still gets the "no free leases"

/lib/partman/init.d/25md-devices: *******************************************************
/lib/partman/init.d/30parted: *******************************************************
parted_server: ======= Starting the server
parted_server: main_loop: iteration 1
parted_server: Opening infifo
parted_server: Read command: PARTITIONS
parted_server: The device =dev=sda is not opened.
parted_server: Line 1420. CRITICAL ERROR!!!  EXITING.
/lib/partman/init.d/30parted: IN: OPEN =dev=sda /dev/sda

Change 198295 had a related patch set uploaded (by Dzahn):
netboot: raid1-lvm partman recipe for subra/suhail

https://gerrit.wikimedia.org/r/198295

Change 198295 merged by Dzahn:
netboot: raid1-lvm partman recipe for subra/suhail

https://gerrit.wikimedia.org/r/198295

Tried different partman recipes because they failed with different errors.

https://gerrit.wikimedia.org/r/#/c/198304/ this is the one that should have worked on the identical hardware per git blame and looking for the old hostname(s) these had before.

on subra now:

┌──────────────────────────┤ [!] Detect disks ├───────────────────────────┐
│                                                                         │
│ No disk drive was detected. If you know the name of the driver needed   │
│ by your disk drive, you can select it from the list.                    │

:/

Change 198437 had a related patch set uploaded (by Dzahn):
add subra/suhail to site.pp as codfw poolcounters

https://gerrit.wikimedia.org/r/198437

update: suhail now works. OS install finished.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=suhail

just subra has the issue detecting disks.

Change 198437 merged by Dzahn:
add subra/suhail to site.pp as codfw poolcounters

https://gerrit.wikimedia.org/r/198437

applied poolcounter role on suhail:

Notice: /Stage[main]/Poolcounter/Package[poolcounter]/ensure: ensure changed 'purged' to 'present'

(yay, package is here for trusty)

ii  poolcounter                         1.0.3                            amd64        Network resource manager by Platonides for Wikimedia.

poolcou+ 19590  0.0  0.0  12956  1576 ?        S    00:08   0:00 /usr/bin/poolcounterd

Change 198440 had a related patch set uploaded (by Dzahn):
have base::firewall on codfw poolcounters

https://gerrit.wikimedia.org/r/198440

Change 198442 had a related patch set uploaded (by Dzahn):
add ferm service for poolcounterd

https://gerrit.wikimedia.org/r/198442

monitoring now also includes check for process running and TCP connect to 7531 working

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=suhail

patches pending to add firewalling

Change 198442 merged by Dzahn:
add ferm service for poolcounterd

https://gerrit.wikimedia.org/r/198442

Change 198440 merged by Dzahn:
have base::firewall on codfw poolcounters

https://gerrit.wikimedia.org/r/198440

subra is also up now.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=subra

poolcounter - OK
Poolcounter connection - TCP OK - 0.043 second response time on port 7531

[subra:~] $ ps aux | grep pool
poolcou+ 7359 0.0 0.0 12956 1580 ? S 05:03 0:00 /usr/bin/poolcounterd

also has firewalling and accepts from all our own networks:

sudo iptables -L | grep 7531

@Dzahn I'll prepare and submit the mediawiki config change to use those two servers in codfw, thanks!

Change 200608 had a related patch set uploaded (by RobH):
subra/suhail need to be raided

https://gerrit.wikimedia.org/r/200608

Change 200608 merged by RobH:
subra/suhail need to be raided

https://gerrit.wikimedia.org/r/200608

unfortunately we have to install them one more time because we didn't get RAID setup, just LVM without RAID.

reinstalled. both are up again, new puppet certs etc. and have a software RAID now. /dev/md0