rack/setup/install scandium.eqiad.wmnet (parsoid test box)
Open, HighPublic

Description

This task will track the racking, setup, and installation of scandium.eqiad.wmnet. This is the new parsoid test box, replacing the 5+ year old system ruthenium. Once this system is online and services migrated, a task will need to be created (or linked) for the decommission of ruthenium.

Racking Proposal: Any 1G rack will do, this will use the internal vlan.

Hostname Proposal: Not sure if we should call this something other than a misc element name, parsoid-test1001? Open to suggestion. Otherwise scandium was selected as a currently unused element name.

scandium.eqiad.wmnet:

  • - receive in system on procurement task T195418
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - ensure uprightdiff package exists on stretch
  • - install npm and nodejs 8 from stretch-backports, puppetize it
  • - fix various puppet dependency problems / git cloning
  • - handoff for service implementation
There are a very large number of changes, so older changes are hidden. Show Older Changes
Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Aug 7 2018, 2:01 PM
Cmjohnson updated the task description. (Show Details)Aug 8 2018, 3:05 PM
Cmjohnson updated the task description. (Show Details)Aug 16 2018, 3:23 PM

Change 453150 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing second mgmt dns entry for scandium

https://gerrit.wikimedia.org/r/453150

Change 453150 merged by Cmjohnson:
[operations/dns@master] Removing second mgmt dns entry for scandium

https://gerrit.wikimedia.org/r/453150

Cmjohnson updated the task description. (Show Details)Aug 21 2018, 4:32 PM
Cmjohnson reassigned this task from Cmjohnson to RobH.
Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.

@RobH this is ready for install

Change 454423 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding scandium install params

https://gerrit.wikimedia.org/r/454423

Change 454426 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] scandium prod dns entries

https://gerrit.wikimedia.org/r/454426

Change 454423 merged by RobH:
[operations/puppet@production] adding scandium install params

https://gerrit.wikimedia.org/r/454423

Change 454426 merged by RobH:
[operations/dns@master] scandium prod dns entries

https://gerrit.wikimedia.org/r/454426

RobH lowered the priority of this task from High to Normal.Aug 21 2018, 10:34 PM
RobH removed projects: Patch-For-Review, ops-eqiad.
RobH updated the task description. (Show Details)
RobH removed RobH as the assignee of this task.Aug 21 2018, 11:20 PM
RobH updated the task description. (Show Details)

I'm not quite sure who on the Parsoid handling team would be involved in pushign this into service to replace ruthenium.

If no one chimes in by next Monday, I'll be listing this in the SRE team meeting.

I'm not quite sure who on the Parsoid handling team would be involved in pushign this into service to replace ruthenium.

If no one chimes in by next Monday, I'll be listing this in the SRE team meeting.

As far as we are concerned, we don't care about the specifics as long as (a) the services on this match what is available on ruthenium .. which hopefully puppet will guarantee :) (b) server has sufficient cpu / ram / disk similar to ruthenium (which I presume will be the case).

Once scandium is up and hooked up, please give me and @Arlolra a heads up and we can test our services there and then switch over and you all can decomm ruthenium at that time.

Change 454443 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] pushing scandium into parsoid test service

https://gerrit.wikimedia.org/r/454443

RobH added a comment.Aug 22 2018, 12:38 AM

So the change to apply things the same as ruthnium won't work quite right, since we applied more than one role to that system.

Change 456402 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add a parameter to generate(User|Title)Vars hooks to specify context

https://gerrit.wikimedia.org/r/456402

Daimona added a subscriber: Daimona.

Ergh, sorry, wrong task.

Daimona removed a subscriber: Daimona.Aug 30 2018, 3:06 PM

So the change to apply things the same as ruthnium won't work quite right, since we applied more than one role to that system.

Sorry, I lost track of this. @Dzahn, @mobrovac .. something you could help us with here? I don't fully understand the issue here.

Dzahn added a comment.EditedSep 14 2018, 7:22 PM

@ssastry The issue Rob is mentioning is that there are direct includes in site.pp which are a violation of puppet lint/style checks:

00:35:26 wmf-style: total violations delta 4
00:35:26 NEW violations:
00:35:26 manifests/site.pp:2010 wmf-style: node 'scandium.eqiad.wmnet' includes class ::role::test
00:35:26 manifests/site.pp:2011 wmf-style: node 'scandium.eqiad.wmnet' includes class ::role::parsoid::rt_server
00:35:26 manifests/site.pp:2012 wmf-style: node 'scandium.eqiad.wmnet' includes class ::role::parsoid::rt_client
00:35:26 manifests/site.pp:2013 wmf-style: node 'scandium.eqiad.wmnet' includes class ::role::parsoid::diffserver

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/454443/1/manifests/site.pp

This is also the case on ruthenium. It just shows up here because it's an addition to the code so introducing new issue as opposed to an existing one.

The fix is to have just a single role per node. A role like role(parsoid::testing) should include everything that is needed without extra includes. If it turns out we have a different combo of classes for different nodes then we should just have 2 separate roles.

This single role should be applied using the special "role()" keyword.

(The other fix is to override jenkins and ignore the V-1 and merge it anyways and then fix both ruthenium and scandium together later).

Dzahn added a comment.Sep 14 2018, 7:37 PM

The best fix would be to rename/convert role::parsoid::rt_server, role::parsoid::vd_server", :role::parsoid::rt_client, ::role::parsoid::vd_client and ::role::parsoid::diffserver all to profiles instead of roles and only keep the production "role(parsoid)" and "role(parsoid_testing)" as roles. Only these 2 are actually applied directly no nodes. All other "roles" are just included elsewhere and not really roles in the newer sense.

Change 460605 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: role/profile refactoring

https://gerrit.wikimedia.org/r/460605

Dzahn added a comment.EditedSep 17 2018, 11:41 PM

@RobH @ssastry @mobrovac the change above should unblock this. It refactors the puppet code to profiles and ensures there is only a single role on the parsoid test host.

This compiler output shows it doesn't touch anything on prod parsoid and has changes on parsoid-test that are only related to renaming resources.

https://puppet-compiler.wmflabs.org/compiler1002/12484/

So if you are fine with that we can merge and then you go ahead putting this on scandium.

coincidentally this should also help with (the comments on) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460064/ where it is discussed whether hosts using the "test" role should send Icinga notifications or not. After the change a parsoid-test host won't be using the "test" role anymore so it would still send notifications as before.

Change 460605 merged by Dzahn:
[operations/puppet@production] parsoid: role/profile refactoring

https://gerrit.wikimedia.org/r/460605

Change 454443 merged by Dzahn:
[operations/puppet@production] pushing scandium into parsoid test service

https://gerrit.wikimedia.org/r/454443

Mentioned in SAL (#wikimedia-operations) [2018-09-18T17:29:19Z] <mutante> scandium - move from role(spare) to role(parsoid_testing), making it equal to ruthenium (T201366)

Change 461169 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testing: move Hiera values from host to role level

https://gerrit.wikimedia.org/r/461169

Change 461169 merged by Dzahn:
[operations/puppet@production] parsoid::testing: move Hiera values from host to role level

https://gerrit.wikimedia.org/r/461169

Dzahn added a comment.EditedSep 18 2018, 5:51 PM

@ssastry @RobH

I have fixed the issue with multiple roles being applied on ruthenium by refactoring the puppet code. Now there is just "parsoid_testing" applied on ruthenium (without changing anything effectively on ruthenium).

After that i merged Rob's original change to add the role on scandium which was now possible without getting a jenkins-bot downvote.

So all the parsoid packges and config have been installed on scandium.

Then i noticed there are no shell users on scandium yet. This is because the admin groups were added in the past based on the (hardcoded) host name ruthenium in Hiera.

In https://gerrit.wikimedia.org/r/461169 i moved that to the role level so that it automatically applies to any host using the parsoid_testing role and we won't have to manually change this anymore in the future.

After that also all the shell users have been created on scandium and you should now be able to login.

What it also changed is that logstash config changed like:

-  name: parsoid
+  name: parsoid-tests

That being said..there are more issues due to the fact this is now on stretch and not jessie anymore. They are at least:

  • E: Unable to locate package npm
  • E: Unable to locate package uprightdiff**

And i don't think i can help with these specifically. This would need discussion how to get these on stretch / whether parsoid-testing can be on stretch as of today.

For now i just scheduled a really long downtime for the "puppet" and "parsoid" Icinga alerts but have notifications for the host itself and other basic checks enabled.

Dzahn changed the task status from Open to Stalled.Sep 18 2018, 5:52 PM

You should be able to SSH to scandium and use it.. but setting this to stalled because of the missing packages which prevents it from fully working.

@Muehlenhoff How are chances to get npm and uprightdiff packages on stretch?

@RobH It might have to be reinstalled with jessie (for now).

@ssastry Did you expect either jessie or stretch specifically? Aware of not having npm in stretch?

https://packages.debian.org/search?suite=jessie&searchon=names&keywords=npm
https://packages.debian.org/search?keywords=npm&searchon=names&suite=stable&section=all

Aware of not having npm in stretch?

The nodejs package should include the npm bin

Dzahn added a comment.Sep 18 2018, 6:59 PM

We should try putting this role on a cloud VPS and then manually install the jessie packages on stretch. As pointed out by Moritz this might be a valid workaround here and better than reinstalling with jessie.

ssastry moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Sep 20 2018, 5:13 PM
elukey added a subscriber: elukey.Dec 20 2018, 7:15 AM

Added downtime up to Jan 31st, icinga was complaining about parsoid not running. Don't have a lot of context but feel free to remove downtime and add notifications/alerts disabled in case it is better :)

Dzahn changed the task status from Stalled to Open.Thu, Jan 3, 9:00 PM

unstalling because npm is now available for stretch via backports

@RobH @ssastry @Arlolra This should be finally unblocked now .

Dzahn added a comment.EditedThu, Jan 3, 9:08 PM

@ssastry i don't see any mention of the npm package in the puppet code, yet it is installed on ruthenium. was it installed manually?

edit: nevermind, i found it. it comes from the testreduce classes

Change 482150 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: if on stretch, use stretch-backports to get npm package

https://gerrit.wikimedia.org/r/482150

Dzahn added a comment.Thu, Jan 3, 9:28 PM

npm 5.8 is now finally available in stretch-backports: https://lists.debian.org/debian-backports-changes/2018/12/threads.html

It now finds the package but is running into these issues:

The following packages have unmet dependencies:
 npm : Depends: node-abbrev (>= 1.1.1~) but 1.0.9-1 is to be installed
       Depends: node-ansi-regex (>= 3.0~) but 2.0.0-1 is to be installed
       Depends: node-cacache (>= 10.0.4~) but it is not going to be installed
       Depends: node-config-chain (>= 1.1.11~) but it is not going to be installed
       Depends: node-glob (>= 7.1.2~) but 7.1.1-1 is to be installed
       Depends: node-hosted-git-info (>= 2.6~) but 2.1.5-1 is to be installed
       Depends: node-ini (>= 1.3.5~) but 1.1.0-1 is to be installed
       Depends: node-npm-package-arg but it is not going to be installed
       Depends: node-jsonstream (>= 1.3.2~) but 1.0.3-4 is to be installed
       Depends: node-libnpx (>= 10.0.1~) but it is not going to be installed
       Depends: node-lockfile (>= 1.0.3~) but 0.4.1-1 is to be installed
       Depends: node-lru-cache (>= 4.1.1~) but 4.0.2-1 is to be installed
       Depends: node-move-concurrently (>= 1.0.1~) but it is not going to be installed
       Depends: node-normalize-package-data (>= 2.4~) but 2.3.5-2 is to be installed
       Depends: node-gyp (>= 3.6.2~) but 3.4.0-1 is to be installed
       Depends: node-resolve-from (>= 4.0~) but 2.0.0-1 is to be installed
       Depends: node-npmlog (>= 4.1.2~) but 0.0.4-1 is to be installed
       Depends: node-osenv (>= 0.1.5~) but 0.1.0-1 is to be installed
       Depends: node-read-package-json (>= 2.0.13~) but 1.2.4-1 is to be installed
       Depends: node-request (>= 2.83~) but 2.26.1-1 is to be installed
       Depends: node-retry (>= 0.10.1~) but 0.6.0-1 is to be installed
       Depends: node-rimraf (>= 2.6.2~) but 2.5.4-2 is to be installed
       Depends: node-semver (>= 5.5~) but 5.3.0-1 is to be installed
       Depends: node-sha (>= 2.0.1~) but 1.2.3-1 is to be installed
       Depends: node-slide (>= 1.1.6~) but 1.1.4-1 is to be installed
       Depends: node-strip-ansi (>= 4.0~) but 3.0.1-1 is to be installed
       Depends: node-tar (>= 4.4~) but 2.2.1-1 is to be installed
       Depends: node-boxen (>= 1.2.1~) but it is not going to be installed
       Depends: node-which (>= 1.3~) but 1.2.11-1 is to be installed
E: Unable to correct problems, you have held broken packages.
Dzahn added a comment.Thu, Jan 3, 9:41 PM
apt-get -t stretch-backports install npm
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 npm : Depends: node-cacache (>= 10.0.4~) but it is not going to be installed
       Depends: node-move-concurrently (>= 1.0.1~) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Change 482150 merged by Dzahn:
[operations/puppet@production] testreduce: if on stretch, use stretch-backports to get npm package

https://gerrit.wikimedia.org/r/482150

Mentioned in SAL (#wikimedia-operations) [2019-01-04T23:07:31Z] <mutante> scandium apt-get remove nodejs nodes-legacy ; puppet agent -tv - after merging gerrit:482150 this fixed "you have held broken packages" issue, now we are at a puppet dependecy cycle with apt::pin T201366

Change 482380 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: break dependency cycle between apt::pin and require_package

https://gerrit.wikimedia.org/r/482380

Change 482380 merged by Dzahn:
[operations/puppet@production] testreduce: break dependency cycle between apt::pin and require_package

https://gerrit.wikimedia.org/r/482380

Change 482381 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: use regular package{} instead of require_package

https://gerrit.wikimedia.org/r/482381

Change 482381 merged by Dzahn:
[operations/puppet@production] testreduce: use regular package{} instead of require_package

https://gerrit.wikimedia.org/r/482381

some issues solved (no more broken packages, icinga happy),

but blocked on T212987 and still has a dependency issue with apt::pin

Dzahn claimed this task.Sat, Jan 5, 12:26 AM
Dzahn added a comment.Fri, Jan 11, 7:55 PM

Thanks to Legoktm uploading the package in T212987 and puppet, the uprightdiff package has been installed automatically.

from:
Jan 11 14:55:53 scandium puppet-agent[23431]: (/Stage[main]/Visualdiff/Git::Clone[integration/visualdiff]/Exec[git_clone_integration/visualdiff]) Dependency Package[uprightdiff] has failures: true

to:
Jan 11 15:25:46 scandium puppet-agent[29778]: (/Stage[main]/Packages::Uprightdiff/Package[uprightdiff]/ensure) created

Change 483889 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports

https://gerrit.wikimedia.org/r/483889

Change 483889 merged by Dzahn:
[operations/puppet@production] testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports

https://gerrit.wikimedia.org/r/483889

Change 483891 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] service::node: do not install nodejs-legacy if on stretch

https://gerrit.wikimedia.org/r/483891

Dzahn added a comment.Sat, Jan 12, 2:00 AM

After quite some fight we now have nodejs 8 and npm installed via puppet and APT pinning works finally.

The next issue is that the service::node class attempts to install nodejs-legacy package as well which conflicts. So my next change above is about not doing that anymore if on stretch for all services.

Dzahn raised the priority of this task from Normal to High.Sat, Jan 12, 2:01 AM

Change 483891 merged by Dzahn:
[operations/puppet@production] service::node: do not install nodejs-legacy if on stretch

https://gerrit.wikimedia.org/r/483891

Change 484342 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] visualdiff: ensure git clone happens before creating pngs dir

https://gerrit.wikimedia.org/r/484342

Change 484343 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: ensure /srv/deployment/parsoid exists before cloning

https://gerrit.wikimedia.org/r/484343

Dzahn updated the task description. (Show Details)Tue, Jan 15, 2:07 AM

Change 484579 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: use component/node10 instead of stretch-backports

https://gerrit.wikimedia.org/r/484579

Change 484343 merged by Dzahn:
[operations/puppet@production] service: ensure parent dir exists before git cloning

https://gerrit.wikimedia.org/r/484343

Change 484602 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] services: add missing 'mediawiki/services' prefix to git cloning

https://gerrit.wikimedia.org/r/484602

Change 484342 merged by Dzahn:
[operations/puppet@production] visualdiff: ensure git clone happens before creating pngs dir

https://gerrit.wikimedia.org/r/484342

Change 484579 merged by Dzahn:
[operations/puppet@production] testreduce: use component/node10 for node 10 on stretch

https://gerrit.wikimedia.org/r/484579

Change 484811 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: no require_package for nodejs, avoid dependency cycle

https://gerrit.wikimedia.org/r/484811