Page MenuHomePhabricator

rack/setup/install torrelay1001.wikimedia.org
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of the replacement tor relay server for eqiad. As we've moved more hostnames with dedicated roles to role based naming, this seems a prime candidate, as it cannot do anything else but act as a tor relay.

Once this system is fully online and in service, radium will be decommissioned due to its age.

Hostname proposal: torrelay1001.wikimedia.org

Racking Proposal: This can go in any 1G rack, as the system its replacing is in our normal public subnet(s).

torrelay1001:

  • - receive in system on procurement task T195417
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run - as role spare
  • - handoff for service implementation
  • - implement service and switch over

migration plan:

goal: keep the same fingerprints

  • stop tor service on radium
  • rsync datadir contents (/var/lib/tor/ from radium to torrelay1001
  • delete datadir and config on radium or otherwise ensure it cant come back with the same fingerprints
  • start service on torrelay1001

Related Objects

StatusSubtypeAssignedTask
ResolvedDzahn
ResolvedJclark-ctr

Event Timeline

RobH triaged this task as Medium priority.Jun 7 2018, 9:13 PM
RobH created this task.
Vvjjkkii renamed this task from rack/setup/install torrelay1001.wikimedia.org to efbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from efbaaaaaaa to rack/setup/install torrelay1001.wikimedia.org.Jul 2 2018, 1:22 PM
CommunityTechBot assigned this task to Cmjohnson.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

You can assign this to me after the initial setup to implement service.

Change 449460 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt/productin dns torrelay1001

https://gerrit.wikimedia.org/r/449460

Change 449460 merged by Cmjohnson:
[operations/dns@master] Adding mgmt/productin dns torrelay1001

https://gerrit.wikimedia.org/r/449460

Cmjohnson updated the task description. (Show Details)
Cmjohnson updated the task description. (Show Details)
Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.

This is ready to be installed, assigning to @RobH for help finishing the installation.

Change 451033 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] adding torrelay1001 ipv6 entries

https://gerrit.wikimedia.org/r/451033

Change 451033 merged by RobH:
[operations/dns@master] adding torrelay1001 ipv6 entries

https://gerrit.wikimedia.org/r/451033

Change 451044 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] torrelay1001 install params

https://gerrit.wikimedia.org/r/451044

Change 451044 merged by RobH:
[operations/puppet@production] torrelay1001 install params

https://gerrit.wikimedia.org/r/451044

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH edited projects, added Tor; removed ops-eqiad.

IRC Sync/Update:

I've chatted with @Dzahn via irc and he is expecting this task reassignment. He'll be handling pushing this into service, and filing a decommission-hardware task for radium once this is done.

Change 455742 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] fix typo: torrealy -> torrelay

https://gerrit.wikimedia.org/r/455742

Change 455742 merged by Dzahn:
[operations/dns@master] fix typo: torrealy -> torrelay

https://gerrit.wikimedia.org/r/455742

Change 455744 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] move tor_relay role to torrelay1001, decom radium

https://gerrit.wikimedia.org/r/455744

migration plan:

goal: keep the same fingerprints

  • stop tor service on radium
  • rsync datadir contents (/var/lib/tor/ from radium to torrelay1001
  • delete datadir and config on radium or otherwise ensure it cant come back with the same fingerprints
  • start service on torrelay1001

Change 455745 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tor_relay: temp allow rsync of datadir for migration

https://gerrit.wikimedia.org/r/455745

migration plan:

goal: keep the same fingerprints

  • stop tor service on radium
  • rsync datadir contents (/var/lib/tor/ from radium to torrelay1001
  • delete datadir and config on radium or otherwise ensure it cant come back with the same fingerprints
  • start service on torrelay1001

Plan looks good, two things to consider:

  • We use the debs provided by the Tor project, not the ones from Debian. In jessie this was in the thirdparty component which was present by default. On stretch the thirdparty/tor component needs to be explicitly added via puppet
  • The puppet manifests ensure that the tor service is running, so masking the systemd unit will not be enough, maybe simply rename /var/lib/tor to /var/lib/to instead (so that it's still around in case of a rollback)

Change 455745 merged by Dzahn:
[operations/puppet@production] tor_relay: temp allow rsync of datadir for migration

https://gerrit.wikimedia.org/r/455745

Change 456056 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tor::relay: add configurable thirdparty APT source

https://gerrit.wikimedia.org/r/456056

Change 456056 merged by Dzahn:
[operations/puppet@production] tor::relay: add configurable thirdparty APT source

https://gerrit.wikimedia.org/r/456056

Change 458339 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tor: make it possible to config service running/stopped in Hiera

https://gerrit.wikimedia.org/r/458339

Plan looks good, two things to consider:

  • ..On stretch the thirdparty/tor component needs to be explicitly added via puppet

Solved by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/456056/

  • The puppet manifests ensure that the tor service is running, so masking the systemd unit will not be enough

Should be solved by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458339/

Change 458339 merged by Dzahn:
[operations/puppet@production] tor: make it possible to config service running/stopped in Hiera

https://gerrit.wikimedia.org/r/458339

Change 455744 merged by Dzahn:
[operations/puppet@production] site: add tor_relay role to torrelay1001

https://gerrit.wikimedia.org/r/455744

Change 458839 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] replace radium with torrelay1001 as tor-eqiad-1.wikimedia.org

https://gerrit.wikimedia.org/r/458839

Change 458840 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] lower TTL for tor-eqiad-1.wikimedia.org to 5M

https://gerrit.wikimedia.org/r/458840

Change 458840 merged by Dzahn:
[operations/dns@master] lower TTL for tor-eqiad-1.wikimedia.org to 5M

https://gerrit.wikimedia.org/r/458840

Change 458848 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tor: ensure libzstd1 is installed and required if on stretch

https://gerrit.wikimedia.org/r/458848

Change 458848 merged by Dzahn:
[operations/puppet@production] tor: ensure libzstd1 is installed and required if on stretch

https://gerrit.wikimedia.org/r/458848

Change 458932 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tor_relay: stop service on radium, start on torrelay1001

https://gerrit.wikimedia.org/r/458932

Change 458932 merged by Dzahn:
[operations/puppet@production] tor_relay: stop service on radium, start on torrelay1001

https://gerrit.wikimedia.org/r/458932

Change 458839 merged by Dzahn:
[operations/dns@master] replace radium with torrelay1001 as tor-eqiad-1.wikimedia.org

https://gerrit.wikimedia.org/r/458839

Mentioned in SAL (#wikimedia-operations) [2018-09-08T00:16:30Z] <mutante> tor relay switched over from radium to torrelay1001, fixed /var/lib/tor permissions, restarted service, flipped DNS CNAME (5M TTL), traffic can be seen with "arm", monitoring all green (T196701)

Change 458944 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tor: enable logging at 'notice' level (recommended)

https://gerrit.wikimedia.org/r/458944

Change 458944 merged by Dzahn:
[operations/puppet@production] tor: enable logging at 'notice' level (recommended)

https://gerrit.wikimedia.org/r/458944

Change 458946 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: turn radium into a spare system

https://gerrit.wikimedia.org/r/458946

Dzahn changed the status of subtask T203861: decom radium from Open to Stalled.Sep 8 2018, 1:00 AM

Change 458946 merged by Dzahn:
[operations/puppet@production] site: turn radium into a spare system

https://gerrit.wikimedia.org/r/458946

Mentioned in SAL (#wikimedia-operations) [2018-09-08T01:10:35Z] <mutante> also rsyncing /var/lib/tor-instances/ data for second instance and restarting service (T196701)

wikimedia-eqiad1 is fine.

unfortunately the secondary one, wikimediaeqiad2 got started with a different fingerprint at first (synced /var/lib/tor but not /var/lib/tor-instances before re-enabling puppet). corrected that after the fact

https://metrics.torproject.org/rs.html#search/wikimedia

Dzahn changed the status of subtask T203861: decom radium from Stalled to Open.Sep 13 2018, 12:59 AM