Page MenuHomePhabricator

rack/setup/install scandium.eqiad.wmnet (parsoid test box)
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of scandium.eqiad.wmnet. This is the new parsoid test box, replacing the 5+ year old system ruthenium. Once this system is online and services migrated, a task will need to be created (or linked) for the decommission-hardware of ruthenium.

Racking Proposal: Any 1G rack will do, this will use the internal vlan.

Hostname Proposal: Not sure if we should call this something other than a misc element name, parsoid-test1001? Open to suggestion. Otherwise scandium was selected as a currently unused element name.

scandium.eqiad.wmnet:

  • - receive in system on procurement task T195418
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - ensure uprightdiff package exists on stretch
  • - install npm from stretch-backports and node-10 from stretch-wikimedia, component node10 and puppetize it
  • - fix various puppet dependency problems / git cloning
  • - handoff for service implementation

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+12 -5
operations/puppetproduction+9 -1
operations/puppetproduction+1 -2
operations/puppetproduction+2 -0
operations/puppetproduction+6 -6
operations/puppetproduction+8 -2
operations/puppetproduction+9 -2
operations/puppetproduction+3 -3
operations/puppetproduction+1 -0
operations/puppetproduction+10 -2
operations/puppetproduction+5 -1
operations/puppetproduction+23 -0
operations/puppetproduction+4 -2
operations/puppetproduction+0 -1
operations/puppetproduction+9 -8
operations/puppetproduction+1 -1
operations/puppetproduction+12 -9
operations/dnsmaster+2 -1
operations/puppetproduction+12 -1
operations/dnsmaster+1 -4
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2018-09-18T17:29:19Z] <mutante> scandium - move from role(spare) to role(parsoid_testing), making it equal to ruthenium (T201366)

Change 461169 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testing: move Hiera values from host to role level

https://gerrit.wikimedia.org/r/461169

Change 461169 merged by Dzahn:
[operations/puppet@production] parsoid::testing: move Hiera values from host to role level

https://gerrit.wikimedia.org/r/461169

@ssastry @RobH

I have fixed the issue with multiple roles being applied on ruthenium by refactoring the puppet code. Now there is just "parsoid_testing" applied on ruthenium (without changing anything effectively on ruthenium).

After that i merged Rob's original change to add the role on scandium which was now possible without getting a jenkins-bot downvote.

So all the parsoid packges and config have been installed on scandium.

Then i noticed there are no shell users on scandium yet. This is because the admin groups were added in the past based on the (hardcoded) host name ruthenium in Hiera.

In https://gerrit.wikimedia.org/r/461169 i moved that to the role level so that it automatically applies to any host using the parsoid_testing role and we won't have to manually change this anymore in the future.

After that also all the shell users have been created on scandium and you should now be able to login.

What it also changed is that logstash config changed like:

-  name: parsoid
+  name: parsoid-tests

That being said..there are more issues due to the fact this is now on stretch and not jessie anymore. They are at least:

  • E: Unable to locate package npm
  • E: Unable to locate package uprightdiff**

And i don't think i can help with these specifically. This would need discussion how to get these on stretch / whether parsoid-testing can be on stretch as of today.

For now i just scheduled a really long downtime for the "puppet" and "parsoid" Icinga alerts but have notifications for the host itself and other basic checks enabled.

Dzahn changed the task status from Open to Stalled.Sep 18 2018, 5:52 PM

You should be able to SSH to scandium and use it.. but setting this to stalled because of the missing packages which prevents it from fully working.

@Muehlenhoff How are chances to get npm and uprightdiff packages on stretch?

@RobH It might have to be reinstalled with jessie (for now).

@ssastry Did you expect either jessie or stretch specifically? Aware of not having npm in stretch?

https://packages.debian.org/search?suite=jessie&searchon=names&keywords=npm
https://packages.debian.org/search?keywords=npm&searchon=names&suite=stable&section=all

Aware of not having npm in stretch?

The nodejs package should include the npm bin

We should try putting this role on a cloud VPS and then manually install the jessie packages on stretch. As pointed out by Moritz this might be a valid workaround here and better than reinstalling with jessie.

Added downtime up to Jan 31st, icinga was complaining about parsoid not running. Don't have a lot of context but feel free to remove downtime and add notifications/alerts disabled in case it is better :)

Dzahn changed the task status from Stalled to Open.Jan 3 2019, 9:00 PM

unstalling because npm is now available for stretch via backports

@RobH @ssastry @Arlolra This should be finally unblocked now .

@ssastry i don't see any mention of the npm package in the puppet code, yet it is installed on ruthenium. was it installed manually?

edit: nevermind, i found it. it comes from the testreduce classes

Change 482150 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: if on stretch, use stretch-backports to get npm package

https://gerrit.wikimedia.org/r/482150

npm 5.8 is now finally available in stretch-backports: https://lists.debian.org/debian-backports-changes/2018/12/threads.html

It now finds the package but is running into these issues:

The following packages have unmet dependencies:
 npm : Depends: node-abbrev (>= 1.1.1~) but 1.0.9-1 is to be installed
       Depends: node-ansi-regex (>= 3.0~) but 2.0.0-1 is to be installed
       Depends: node-cacache (>= 10.0.4~) but it is not going to be installed
       Depends: node-config-chain (>= 1.1.11~) but it is not going to be installed
       Depends: node-glob (>= 7.1.2~) but 7.1.1-1 is to be installed
       Depends: node-hosted-git-info (>= 2.6~) but 2.1.5-1 is to be installed
       Depends: node-ini (>= 1.3.5~) but 1.1.0-1 is to be installed
       Depends: node-npm-package-arg but it is not going to be installed
       Depends: node-jsonstream (>= 1.3.2~) but 1.0.3-4 is to be installed
       Depends: node-libnpx (>= 10.0.1~) but it is not going to be installed
       Depends: node-lockfile (>= 1.0.3~) but 0.4.1-1 is to be installed
       Depends: node-lru-cache (>= 4.1.1~) but 4.0.2-1 is to be installed
       Depends: node-move-concurrently (>= 1.0.1~) but it is not going to be installed
       Depends: node-normalize-package-data (>= 2.4~) but 2.3.5-2 is to be installed
       Depends: node-gyp (>= 3.6.2~) but 3.4.0-1 is to be installed
       Depends: node-resolve-from (>= 4.0~) but 2.0.0-1 is to be installed
       Depends: node-npmlog (>= 4.1.2~) but 0.0.4-1 is to be installed
       Depends: node-osenv (>= 0.1.5~) but 0.1.0-1 is to be installed
       Depends: node-read-package-json (>= 2.0.13~) but 1.2.4-1 is to be installed
       Depends: node-request (>= 2.83~) but 2.26.1-1 is to be installed
       Depends: node-retry (>= 0.10.1~) but 0.6.0-1 is to be installed
       Depends: node-rimraf (>= 2.6.2~) but 2.5.4-2 is to be installed
       Depends: node-semver (>= 5.5~) but 5.3.0-1 is to be installed
       Depends: node-sha (>= 2.0.1~) but 1.2.3-1 is to be installed
       Depends: node-slide (>= 1.1.6~) but 1.1.4-1 is to be installed
       Depends: node-strip-ansi (>= 4.0~) but 3.0.1-1 is to be installed
       Depends: node-tar (>= 4.4~) but 2.2.1-1 is to be installed
       Depends: node-boxen (>= 1.2.1~) but it is not going to be installed
       Depends: node-which (>= 1.3~) but 1.2.11-1 is to be installed
E: Unable to correct problems, you have held broken packages.
apt-get -t stretch-backports install npm
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 npm : Depends: node-cacache (>= 10.0.4~) but it is not going to be installed
       Depends: node-move-concurrently (>= 1.0.1~) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Change 482150 merged by Dzahn:
[operations/puppet@production] testreduce: if on stretch, use stretch-backports to get npm package

https://gerrit.wikimedia.org/r/482150

Mentioned in SAL (#wikimedia-operations) [2019-01-04T23:07:31Z] <mutante> scandium apt-get remove nodejs nodes-legacy ; puppet agent -tv - after merging gerrit:482150 this fixed "you have held broken packages" issue, now we are at a puppet dependecy cycle with apt::pin T201366

Change 482380 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: break dependency cycle between apt::pin and require_package

https://gerrit.wikimedia.org/r/482380

Change 482380 merged by Dzahn:
[operations/puppet@production] testreduce: break dependency cycle between apt::pin and require_package

https://gerrit.wikimedia.org/r/482380

Change 482381 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: use regular package{} instead of require_package

https://gerrit.wikimedia.org/r/482381

Change 482381 merged by Dzahn:
[operations/puppet@production] testreduce: use regular package{} instead of require_package

https://gerrit.wikimedia.org/r/482381

some issues solved (no more broken packages, icinga happy),

but blocked on T212987 and still has a dependency issue with apt::pin

Thanks to Legoktm uploading the package in T212987 and puppet, the uprightdiff package has been installed automatically.

from:
Jan 11 14:55:53 scandium puppet-agent[23431]: (/Stage[main]/Visualdiff/Git::Clone[integration/visualdiff]/Exec[git_clone_integration/visualdiff]) Dependency Package[uprightdiff] has failures: true

to:
Jan 11 15:25:46 scandium puppet-agent[29778]: (/Stage[main]/Packages::Uprightdiff/Package[uprightdiff]/ensure) created

Change 483889 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports

https://gerrit.wikimedia.org/r/483889

Change 483889 merged by Dzahn:
[operations/puppet@production] testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports

https://gerrit.wikimedia.org/r/483889

Change 483891 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] service::node: do not install nodejs-legacy if on stretch

https://gerrit.wikimedia.org/r/483891

After quite some fight we now have nodejs 8 and npm installed via puppet and APT pinning works finally.

The next issue is that the service::node class attempts to install nodejs-legacy package as well which conflicts. So my next change above is about not doing that anymore if on stretch for all services.

Dzahn raised the priority of this task from Medium to High.Jan 12 2019, 2:01 AM

Change 483891 merged by Dzahn:
[operations/puppet@production] service::node: do not install nodejs-legacy if on stretch

https://gerrit.wikimedia.org/r/483891

Change 484342 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] visualdiff: ensure git clone happens before creating pngs dir

https://gerrit.wikimedia.org/r/484342

Change 484343 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: ensure /srv/deployment/parsoid exists before cloning

https://gerrit.wikimedia.org/r/484343

Change 484579 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: use component/node10 instead of stretch-backports

https://gerrit.wikimedia.org/r/484579

Change 484343 merged by Dzahn:
[operations/puppet@production] service: ensure parent dir exists before git cloning

https://gerrit.wikimedia.org/r/484343

Change 484602 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] services: add missing 'mediawiki/services' prefix to git cloning

https://gerrit.wikimedia.org/r/484602

Change 484342 merged by Dzahn:
[operations/puppet@production] visualdiff: ensure git clone happens before creating pngs dir

https://gerrit.wikimedia.org/r/484342

Change 484579 merged by Dzahn:
[operations/puppet@production] testreduce: use component/node10 for node 10 on stretch

https://gerrit.wikimedia.org/r/484579

Change 484811 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: no require_package for nodejs, avoid dependency cycle

https://gerrit.wikimedia.org/r/484811

Mentioned in SAL (#wikimedia-operations) [2019-01-23T01:05:05Z] <mutante> scandium - deleting /etc/apt/preferences.d/stretch_backports.pref ; apt-get remove nodejs ; apt-get install -t stretch-backports npm ; now has nodejs 10 and npm from backports installed (T201366)

Mentioned in SAL (#wikimedia-operations) [2019-01-23T01:12:36Z] <mutante> scandium - git cloning parsoid from gerrit - mediawiki/services/parsoid/deploy to /srv/deployment/parsoid/deploy ; still needs https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484602/ (T201366)

Mentioned in SAL (#wikimedia-operations) [2019-01-23T01:15:28Z] <mutante> scandium - puppet run now without errors for the first time for the parsoid testing role on stretch instead of jessie. nodejs 10. - @subbu @Arlolra you can start using it to replace ruthenium (T201366)

Mentioned in SAL (#wikimedia-operations) [2019-01-23T01:15:28Z] <mutante> scandium - puppet run now without errors for the first time for the parsoid testing role on stretch instead of jessie. nodejs 10. - @ssastry @Arlolra you can start using it to replace ruthenium (T201366)

Mentioned in SAL (#wikimedia-operations) [2019-01-23T01:15:28Z] <mutante> scandium - puppet run now without errors for the first time for the parsoid testing role on stretch instead of jessie. nodejs 10. - @ssastry @Arlolra you can start using it to replace ruthenium (T201366)

Thanks a lot! So, a few more things:

  • on ruthenium, we've "sudo chgrp -R wikidev" and "sudo chmod -R g+w" all the code in /srv/deployment/parsoid so that update_parsoid.sh works (no matter who does the code update) ... should we just do that manually one time? I am okay with doing it .. but just flagging this in case you want to adjust anything in puppet.
  • we should grant permissions to scandium to access the testreduce and testreduce_0715 databases (I suppose that is a phab ticket?) .. let us wait on this till next week.
  • once the above two are done, we should update DNS entries for parsoid-rt-tests.wikimedia.org and parsoid-vd-tests.wikimedia.org. Is that a ticket? Or another todo item for this ticket?

Change 486185 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: pin npm to stretch-backports and use install_options

https://gerrit.wikimedia.org/r/486185

  • on ruthenium, we've "sudo chgrp -R wikidev" and "sudo chmod -R g+w" all the code in /srv/deployment/parsoid so that update_parsoid.sh works (no matter who does the code update) ... should we just do that manually one time? I am okay with doing it .. but just flagging this in case you want to adjust anything in puppet.

We should be able to just use "group" parameter of git clone and i see we do. This is my fault because i cloned manually. Don't do anything and let me fix that.

  • we should grant permissions to scandium to access the testreduce and testreduce_0715 databases (I suppose that is a phab ticket?) .. let us wait on this till next week.

Yes, data base access changes need a patch and then deployment from a DBA, so that would be good as a sub-task.

  • once the above two are done, we should update DNS entries for parsoid-rt-tests.wikimedia.org and parsoid-vd-tests.wikimedia.org. Is that a ticket? Or another todo item for this ticket?

Not necessary for that, i can handle the DNS change as part of this ticket. Let's just make it a checkbox.

Change 486420 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: pin npm to stretch-backports

https://gerrit.wikimedia.org/r/486420

Change 486420 merged by Dzahn:
[operations/puppet@production] testreduce: pin npm to stretch-backports

https://gerrit.wikimedia.org/r/486420

Mentioned in SAL (#wikimedia-operations) [2019-01-25T03:03:53Z] <mutante> scandium - apt-get -t stretch-backports install npm ; run puppet ; remove manually created /apt/preferences.d/npm.pref ; puppet created npm_stretch_backports.pref ; puppet run without errors again (T201366)

Now we have this puppetized APT pinning setup:

Pinned packages:
     nodejs -> 10.4.0~dfsg-1+wmf2 with priority 1005
     nodejs -> 6.11.0~dfsg-1+wmf5 with priority 1005
     npm -> 5.8.0+ds6-2~bpo9+1 with priority 1004
     nodejs-dev -> 10.4.0~dfsg-1+wmf2 with priority 1005
     nodejs-dev -> 6.11.0~dfsg-1+wmf5 with priority 1005

except i still had to run a single command to install npm, avoiding that is hard and details in https://gerrit.wikimedia.org/r/c/operations/puppet/+/486185#message-9b8bd084e4d0479eed4fa11adf155ebb6422e8d4

nevertheless.. the box can be used and puppet is happy now

Mentioned in SAL (#wikimedia-operations) [2019-01-25T03:12:43Z] <mutante> scandium sudo chgrp -R wikidev /srv/deployment/parsoid/deploy/ ; sudo chmod -R g+w /srv/deployment/parsoid/deploy/ (T201366)

@ssastry nodejs 10 and npm are installed, the puppet run is not broken and i changed the ownership of the parsoid deployment files.

There are 2 pending changes in Gerrit that would be nice to have but don't block anything and are just to eliminate the 2 manual commands and only for next time we setup a parsoid test host.

Change 486423 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] varnish/trafficserver: switch parsoid-tests backend, rename director

https://gerrit.wikimedia.org/r/486423

  • on ruthenium, we've "sudo chgrp -R wikidev" and "sudo chmod -R g+w" all the code in /srv/deployment/parsoid

done (and a long-term fix waiting in puppet but don't worry)

  • we should grant permissions to scandium to access the testreduce and testreduce_0715 databases (I suppose that is a phab ticket?) .. let us wait on this till next week.

Yes, please if you could do that and tag it DBA.

  • once the above two are done, we should update DNS entries for parsoid-rt-tests.wikimedia.org and parsoid-vd-tests.wikimedia.org. Is that a ticket? Or another todo item for this ticket?

I checked and that's actually not DNS, it's Varnish/Trafficserver. I prepared the necessary change for it:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/486423

It needs an ACK and/or merge from Traffic team but no extra ticket. I added them as reviewers and said it should only be merged after you say it's fine.

Mentioned in SAL (#wikimedia-operations) [2019-02-04T22:05:42Z] <mutante> scandium - systemctl start parsoid-vd (T201366)

Change 487964 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/parsoid: no monitoring notifications on test servers

https://gerrit.wikimedia.org/r/487964

Change 486423 merged by Dzahn:
[operations/puppet@production] varnish/trafficserver: switch parsoid-tests backend, rename director

https://gerrit.wikimedia.org/r/486423

@ssastry ^ After i copied /srv/visualdiff/testreduce/testrun.ids from ruthenium to scandium and restarted parsoid-vd.. it ran for over 5 hours now and i could confirm in Icinga it has not been alerting in between.

So i merged the caching server change and your backend is now scandium. Please go ahead and check.

@ssastry also regarding our chat about how to handle the testrun.ids next time .. i just saw T215049 has been created

Dzahn lowered the priority of this task from High to Medium.Feb 8 2019, 1:54 AM

lowering priority since Subbu is unblocked and can use the new box and we have switched varnish over. the remaining part is just some cleanup i should do to make it better next time we upgrade

Change 487964 merged by Dzahn:
[operations/puppet@production] icinga/parsoid: no monitoring notifications on test servers

https://gerrit.wikimedia.org/r/487964

Change 484811 merged by Dzahn:
[operations/puppet@production] testreduce: no require_package for nodejs, avoid dependency cycle

https://gerrit.wikimedia.org/r/484811

Change 486185 merged by Dzahn:
[operations/puppet@production] testreduce: pin npm to backports, use install_options

https://gerrit.wikimedia.org/r/486185

Dzahn updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-02-13T20:54:11Z] <mutante> ruthenium - shell access for parsoid-testers revoked by puppet, please use scandium.eqiad.wmnet (T201366)

Change 484602 merged by Dzahn:
[operations/puppet@production] services: add missing 'mediawiki/services' prefix to git cloning

https://gerrit.wikimedia.org/r/484602

Mentioned in SAL (#wikimedia-operations) [2019-02-14T01:52:48Z] <mutante> scandium - removing parsoid deploy dir and letting puppet re-clone it after merging gerrit fix 484602 - replace manual clone with proper puppetization (T201366)

Change 490662 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] service::deploy::gitclone: add '/deploy' suffix to clone dir

https://gerrit.wikimedia.org/r/490662

Change 490662 merged by Dzahn:
[operations/puppet@production] service::deploy::gitclone: add '/deploy' suffix to clone dir

https://gerrit.wikimedia.org/r/490662

Mentioned in SAL (#wikimedia-operations) [2019-04-02T15:52:58Z] <mutante> icinga - re-enabling notifications for scandium. setup task is resolved yet systemd is alerting, should not have been turned off anymore (T201366)

Mentioned in SAL (#wikimedia-operations) [2019-04-02T15:55:46Z] <mutante> scandium - systemctl start parsoid-vd was failed (T201366)