Page MenuHomePhabricator

Decommission osmium.eqiad.wmnet
Closed, ResolvedPublic

Description

Follows-up: T158837: Consolidate performance website and related software

osmium is no longer used by the Performance Team. We're currently working on refreshing our hardware. Part of that is consolidating and moving around most of our services. While our plans do include moving some services to a new host, it probably isn't worth reclaiming this server given its age. (Also, our current draft plan is aiming to use VMs, so we wouldn't need osmium anyway.)


Checklist

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

Event Timeline

Krinkle created this task.Sep 5 2017, 10:49 PM
Krinkle updated the task description. (Show Details)
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Change 376151 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles

https://gerrit.wikimedia.org/r/376151

Krinkle moved this task from Radar to Inbox on the Performance-Team board.Sep 6 2017, 5:25 PM
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Dzahn claimed this task.Sep 6 2017, 6:12 PM

Change 376151 merged by Dzahn:
[operations/puppet@production] jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles

https://gerrit.wikimedia.org/r/376151

Dzahn updated the task description. (Show Details)Sep 7 2017, 12:06 AM

it is using role spare:;system now and the only remnants are partman/DHCP and they should usually stay until the end.

Giving task back to pool to continue with uninterruptible steps (see checkboxes), i can't do them due to lack of switch access.

Dzahn removed Dzahn as the assignee of this task.Sep 7 2017, 12:07 AM
Dzahn added a project: ops-eqiad.
Dzahn added a subscriber: RobH.
Dzahn added a subscriber: Dzahn.

HW warranty expiration: 2017-03-23

https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=2160

^ so i assume this is final decom? or misc spares out of warranty? @RobH

Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Sep 7 2017, 4:01 PM
Krinkle moved this task from Inbox to Radar on the Performance-Team board.Sep 12 2017, 7:55 AM
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Dzahn triaged this task as Medium priority.Sep 15 2017, 8:28 PM
Krinkle removed a project: Patch-For-Review.
Krinkle removed a subscriber: Krinkle.Sep 22 2017, 6:08 PM

Is it too late for us to run something on that machine before it's fully decommissioned? We've been trying something in labs for T176361 and we're at the point where we're wondering if bare metal would be better for what we're trying to achieve.

If the experiment works out, we might still want to decommission osmium and request new hardware instead. But we would like a spare bare metal machine to run something for a couple of weeks and @fgiunchedi suggested osmium, since it was our team's.

AFAICT the machine is online with spare::system role applied. Once a role giving performance team access is applied it should be usable again. (cc @RobH @Dzahn )

Dzahn changed the task status from Open to Stalled.Oct 20 2017, 4:16 PM

setting ticket to stalled so it doesn't get fully decom'ed/wiped yet.

Change 385399 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] osmium: tmp restore access for perf-roots admins

https://gerrit.wikimedia.org/r/385399

Change 385399 merged by Dzahn:
[operations/puppet@production] osmium: tmp restore access for perf-roots admins

https://gerrit.wikimedia.org/r/385399

Dzahn added a subscriber: Krinkle.Oct 20 2017, 4:28 PM

@Gilles Your access has been (temp) restored. You should be able to login again (and @Krinkle as well). Note that it doesn't have any special role, just "spare", that installs basic tools and adds standard Icinga monitoring but that is about it. Let us know when the experiment has concluded. I think you want to request new hardware either way because this one is out of warranty.

@Dzahn perfect, thank you!

@Dzahn is networking restricted? I can't seem to be able download things from the outside world. I need to install a few things that aren't available as Debian packages.

Dzahn added a comment.Oct 23 2017, 5:10 PM

@Gilles You'll have to use a http_proxy. see https://wikitech.wikimedia.org/wiki/HTTP_proxy Let me know if that fixes the issue for you.

elukey added a subscriber: elukey.Oct 24 2017, 8:12 AM

Removed the apache2 logrotate cron on osmium to avoid the following cronspam:

/etc/cron.daily/logrotate:
Job for apache2.service failed because the control process exited with error code.
See "systemctl status apache2.service" and "journalctl -xe" for details.
error: error running shared postrotate script for '/var/log/apache2/*.log '
run-parts: /etc/cron.daily/logrotate exited with return code 1
Dzahn added a comment.Nov 6 2017, 8:43 AM

I see that puppet is disabled on this host since 2 weeks, is this necessary for the temp test?

Gilles changed the task status from Stalled to Open.Nov 6 2017, 4:47 PM

Yes, it was to avoid "random" network requests happening. Our test is over, you can resume taking osmium behind the barn and decommissioning it.

RobH claimed this task.Nov 6 2017, 5:51 PM

I'll old yeller this server.

Change 389971 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove chromium module

https://gerrit.wikimedia.org/r/389971

RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a comment.Nov 8 2017, 4:50 PM

@Krinkle: Please note in my puppet cleanup, these two files reference this host. I did not touch them, as they are scripts and the hostname reference may just be cosmetic. However, they should likely be cleaned up to the new host that replaced osmium:

modules/osm/files/process-osm-data.sh:# - osmium-tool
modules/osm/files/process-osm-data.sh: osmium apply-changes -v --fsync "$PLANET_DIR/osm-data.osm.pbf" "$PLANET_DIR/changes.osc" -o "$PLANET_DIR/osm-data-new.osm.pbf"

Change 390028 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom of osmium

https://gerrit.wikimedia.org/r/390028

Change 390029 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] remove osmium production dns entries

https://gerrit.wikimedia.org/r/390029

Change 390029 merged by RobH:
[operations/dns@master] remove osmium production dns entries

https://gerrit.wikimedia.org/r/390029

Change 390028 merged by RobH:
[operations/puppet@production] decom of osmium

https://gerrit.wikimedia.org/r/390028

RobH reassigned this task from RobH to Cmjohnson.Nov 8 2017, 4:57 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

All non-interrupt steps complete, this is now pending on site wipe and remaining checkbox steps. Assigned to Chris for completion.

[..]. I did not touch them, as they are scripts and the hostname reference may just be cosmetic.

modules/osm/files/process-osm-data.sh:# - osmium-tool
modules/osm/files/process-osm-data.sh: osmium apply-changes -v --fsync "$PLANET_DIR/osm-data.osm.pbf" "$PLANET_DIR/changes.osc" -o "$PLANET_DIR/osm-data-new.osm.pbf"

These are unrelated (not even cosmetic). OpenStreetMap (Osm) has a thing called Osmium which is unrelated to our hostname.

Change 389971 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove chromium module

https://gerrit.wikimedia.org/r/389971

Cmjohnson moved this task from Decommission to Up next on the ops-eqiad board.Apr 4 2018, 7:23 PM

Change 425326 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Reoving mgmt dns for osmium

https://gerrit.wikimedia.org/r/425326

Change 425326 merged by Cmjohnson:
[operations/dns@master] Reoving mgmt dns for osmium

https://gerrit.wikimedia.org/r/425326

Cmjohnson closed this task as Resolved.Apr 10 2018, 6:09 PM
Cmjohnson updated the task description. (Show Details)