Page MenuHomePhabricator

rack and setup wtp1025-1048
Closed, ResolvedPublic

Description

This task will be used to track the receiving, racking, and setup of 24 new wtp systems. These systems were ordered on task T155645.

These will use the same hostname format as past wtp systems, so wtp1025-wtp1048.

This will show a single checklist, since there are 24 systems to install. This checklist should be applied to each system install as it is done.

Racking Plan: WTP is presently spread out between racks and rows. @Cmjohnson has gone ahead and racked these with 3 per rack, with two racks housing wtp systems per row. 3 (per rack) * 2 (racks per row) * 4 (rows) = 24 systems.

  • - receive in system on procurement task
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added
  • - network port setup (description, enable, vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 355106 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for new parsoid wtp125-1048 T165520

https://gerrit.wikimedia.org/r/355106

Change 357860 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357860

Change 357860 merged by Cmjohnson:
[operations/puppet@production] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357860

Change 357870 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 T165173 T166264 T165531 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357870

Change 357870 merged by Cmjohnson:
[operations/dns@master] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005/6 T165366 T165368 T165173 T166264 T165531 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357870

Change 357879 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076"

https://gerrit.wikimedia.org/r/357879

Change 357879 abandoned by RobH:
Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076"

https://gerrit.wikimedia.org/r/357879

Change 358960 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mac addresses for wtp1025-48 T165520

https://gerrit.wikimedia.org/r/358960

Change 358960 merged by Cmjohnson:
[operations/puppet@production] Adding mac addresses for wtp1025-48 T165520

https://gerrit.wikimedia.org/r/358960

I've confirmed that these are having issues with jessie booting faster than the disks can spin up. On many of them (wtp1027 as an example) won't quite get the disks spun up in time for jessie to mount the filesystems during post.

If we add a rootdelay to the comand line arguments for the OS load, it gives it enough time to spin up the disks. rootdelay=15 is long enough to get these disks to spin up and detect.

We can manually set it for the initial boot post installation, and then have puppet set it afterwards. I'm not sure if we are ok using it in this way though. We have in the past though.

Change 360876 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding rootdelay to jessie installs

https://gerrit.wikimedia.org/r/360876

I had a brief chat in IRC with Faidon about this. The boot issue where the LVM fails to detect (due to disks not detecting) should technically be fixed by addressing its root cause, not slapping a 5 second delay in. (Though we've totally done that in the past, systemd should fix this issue now.)

So delaying on pushing the rootdelay patch.

Additionally, these may be a canidate for stretch, rather than jessie.

Chatted with Alex.

These need to be jessie. Additionally, the codfw wtp systems have a rootdelay=5 installed in them. Alex mentioned it used to be in the installer, and was very recently removed. So we may want to re-introduce, or figure out the fix Faidon mentioned.

Why do these need to be jessie? Has anyone checked whether that rootdelay=5 workaround is still needed in stretch?

Why do these need to be jessie? Has anyone checked whether that rootdelay=5 workaround is still needed in stretch?

parsoid. It's nodejs 6.9.1 (from our own repos) and AFAICT we haven't yet ported nodejs 6.9.1 to stretch nor conducted any kind of tests of nodejs 6.x or parsoid on stretch. In the interest of not blocking this I 've said let's go with jessie and we can reimage later on after we 've tested this.

We talked about this a little bit on IRC. I think we agreed to try stretch with node.js 6, since we're going to have to do that at some point anyway and there doesn't seem to be any reason to leave this piece tech debt behind. Someone will have to put node.js 6 (possibly 6.11, that's the latest LTS and in Debian experimental right now?) in stretch-wikimedia, perhaps @MoritzMuehlenhoff?

I emailed out about fixing the disk detection issue a couple of days ago, still awaiting some kind of feedback on the rootdelay versus systemd solution. If no answer is provided shortly, I've already discussed with @mark just using the short root delay for now.

@faidon's last update leaves this at installing stretch, I'll get to reinstalling these to stretch and adding in the root delay to stretch (not jessie). Please correct me if this isn't right!

Let's see if stretch without the rootdelay works -- if not, feel free to add the rootdelay to unblock this and I'll try to take one of the machines out of rotation to investigate this further.

Sounds good! I'll work on re-imaging these shortly with stretch. Will test without root delay first.

Change 361937 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] install stretch on wtp1025 through wtp1048

https://gerrit.wikimedia.org/r/361937

Change 361937 merged by RobH:
[operations/puppet@production] install stretch on wtp1025 through wtp1048

https://gerrit.wikimedia.org/r/361937

I thought the issue was gone, but its still there, but stretch handles it far more elegantly:

WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Volume group "wtp1029-vg" not found
Cannot process volume group wtp1029-vg
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
/dev/md0: clean, 35404/3055616 files, 497521/12198656 blocks

Seems it fails to spin up fast enough, and then it simply scans all devices a second time and succeeds.

wtp1025-wtp1028 good to go, puppet/salt signed and os installed.

wtp-1029-1038 good to go, puppet/salt signed.

[            (1*installer)  2 shell  3 shell  4- log           ][ Jun 29 18:39 ]
                                                                                
                                                                                
                                                                                
   ┌────────────────┤ [!!] Download installer components ├─────────────────┐    
   │                                                                       │    
   │ No kernel modules were found. This probably is due to a mismatch      │    
   │ between the kernel used by this version of the installer and the      │    
   │ kernel version available in the archive.                              │    
   │                                                                       │    
   │ If you're installing from a mirror, you can work around this problem  │    
   │ by choosing to install a different version of Debian. The install     │    
   │ will probably fail to work if you continue without kernel modules.    │    
   │                                                                       │    
   │ Continue the install without loading kernel modules?                  │    
   │                                                                       │    
   │     <Go Back>                                       <Yes>    <No>     │    
   │                                                                       │    
   └───────────────────────────────────────────────────────────────────────┘    
                                                                                
                                                                                
                                                                                
                                                                                
<Tab> moves; <Space> selects; <Enter> activates buttons

This is now happening on stretch installs for the remainder of this wtp series. Switching to jessie works, so it seems to be an issue with our stretch mirror/settings.

Likely because of a mismatch of our netboot image and Debian's kernel image. I've updated our netboot image, can you try again?

Likely because of a mismatch of our netboot image and Debian's kernel image. I've updated our netboot image, can you try again?

Still happening as of 2017-06-30 @ 16:00 GMT.

RobH updated the task description. (Show Details)
RobH subscribed.

These systems are all online with stretch, puppet/salt signed. They have not been added to site.pp specifically, so they are just getting defaults.

Assigning to Alex for implementation. This task can be used to track implementation, or simply resolved.

We talked about this a little bit on IRC. I think we agreed to try stretch with node.js 6, since we're going to have to do that at some point anyway and there doesn't seem to be any reason to leave this piece tech debt behind. Someone will have to put node.js 6 (possibly 6.11, that's the latest LTS and in Debian experimental right now?) in stretch-wikimedia, perhaps @MoritzMuehlenhoff?

Sure, I can build nodejs for stretch-wikimedia. There's a new nodejs security release announced for the 10th of July, so the package will soon need an update anyway. How urgent is that, can the stretch package wait for the new release?

akosiaris changed the task status from Open to Stalled.Jul 3 2017, 8:12 AM

We talked about this a little bit on IRC. I think we agreed to try stretch with node.js 6, since we're going to have to do that at some point anyway and there doesn't seem to be any reason to leave this piece tech debt behind. Someone will have to put node.js 6 (possibly 6.11, that's the latest LTS and in Debian experimental right now?) in stretch-wikimedia, perhaps @MoritzMuehlenhoff?

Sure, I can build nodejs for stretch-wikimedia. There's a new nodejs security release announced for the 10th of July, so the package will soon need an update anyway. How urgent is that, can the stretch package wait for the new release?

Yes, we are not in any rush.

Change 360876 abandoned by RobH:
adding rootdelay to jessie installs

https://gerrit.wikimedia.org/r/360876

I've now built nodejs 6.11 with the recent security fixes and imported it to stretch-wikimedia.

wtp1031/wtp1032 are not fully installed, it seems like the initial puppet run after the installation didn't happen?

akosiaris changed the task status from Stalled to Open.Sep 13 2017, 11:14 AM

Getting back to this

Change 377960 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Assign to wtp1025-wtp1048 the parsoid role

https://gerrit.wikimedia.org/r/377960

Change 377960 merged by Alexandros Kosiaris:
[operations/puppet@production] Assign to wtp1025-wtp1048 the parsoid role

https://gerrit.wikimedia.org/r/377960

Change 377966 had a related patch set (by Alexandros Kosiaris) published:
[mediawiki/services/parsoid/deploy@master] Remove the dsh-targets file and use the dsh group. That file is automatically generated from confd and is guaranteed to be up to date, which is not the case for the dsh-targets file.

https://gerrit.wikimedia.org/r/377966

Change 377966 merged by Alexandros Kosiaris:
[mediawiki/services/parsoid/deploy@master] Remove the dsh-targets file and use the dsh group.

https://gerrit.wikimedia.org/r/377966

Mentioned in SAL (#wikimedia-operations) [2017-09-14T09:00:23Z] <akosiaris@tin> Started deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520

Mentioned in SAL (#wikimedia-operations) [2017-09-14T09:02:48Z] <akosiaris@tin> Finished deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520 (duration: 02m 25s)

Mentioned in SAL (#wikimedia-operations) [2017-09-14T09:32:03Z] <akosiaris@tin> Started deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520

Mentioned in SAL (#wikimedia-operations) [2017-09-14T09:41:33Z] <akosiaris@tin> Finished deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520 (duration: 09m 30s)

Change 377987 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Re-enable notifications for wtp1025-wtp1048

https://gerrit.wikimedia.org/r/377987

Change 377987 merged by Alexandros Kosiaris:
[operations/puppet@production] Re-enable notifications for wtp1025-wtp1048

https://gerrit.wikimedia.org/r/377987

I think we are done. The new parsoid boxes are up and running. They are running Debian stretch and nodejs 6.11. They are not pooled and do not serve any kind of traffic currently. Icinga is green and seems like all checks are running fine. @ssastry @Arlolra, would you like to test the new boxes and give a thumbs up (or down...) ?

I think we are done. The new parsoid boxes are up and running. They are running Debian stretch and nodejs 6.11. They are not pooled and do not serve any kind of traffic currently. Icinga is green and seems like all checks are running fine. @ssastry @Arlolra, would you like to test the new boxes and give a thumbs up (or down...) ?

I logged onto wtp1025 and ran parser tests and ran a parse of the barack obama page on the commandline ... those smoke tests passed just fine.

I think we are done. The new parsoid boxes are up and running. They are running Debian stretch and nodejs 6.11. They are not pooled and do not serve any kind of traffic currently. Icinga is green and seems like all checks are running fine. @ssastry @Arlolra, would you like to test the new boxes and give a thumbs up (or down...) ?

@Arlolra and I are seeing live traffic reaching these machines, except the Linter configuration wasn't updated to handle this.

Change 378073 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/mediawiki-config@master] Whitelist wtp10[25-48] for Linter

https://gerrit.wikimedia.org/r/378073

Currently,

arlolra@tin:/srv/deployment/parsoid/deploy$ confctl select dc=.*,cluster=parsoid,service=parsoid get
{"wtp2011.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2020.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2008.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2009.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2010.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2014.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2002.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2003.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2004.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2005.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2006.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2007.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2012.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2013.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2001.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2019.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2017.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2016.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2018.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp2015.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
{"wtp1028.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1008.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1010.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1016.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1038.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1027.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1037.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1035.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1015.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1017.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1018.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1019.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1034.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1025.eqiad.wmnet": {"pooled": "no", "weight": 1}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1039.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1012.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1013.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1014.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1023.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1001.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1002.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1006.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1007.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1020.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1044.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1036.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1022.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1041.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1031.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1030.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1003.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1005.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1011.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1024.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1047.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1040.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1032.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1043.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1026.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1029.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1042.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1045.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1004.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1009.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1021.eqiad.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1033.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1046.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}
{"wtp1048.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}

but I'm going to depool wtp10(25-48)

This is really peculiar then and the fault is on me for not double checking the state. That being said the default for parsoid is to have to explicitly pool a node[1]. And as you can see from SAL [2] I 've only temporarily pooled wtp1025 with a very low weight. I 'am also perplexed of that pooled state of wtp1042, wtp1044 and wtp1048. Why were they depooled while the others were pooled ? Maybe we 've hit a bug ?

[1] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/conftool-data/service/services.yaml;7bb62adc2462f417ffe68b377495081b7b7074d2$22
[2] https://tools.wmflabs.org/sal/log/AV5_xUbbwg13V6285mnf

There're a couple problems with this change,
https://gerrit.wikimedia.org/r/#/c/377966/

/etc/dsh/group/parsoid contains ruthenium.eqiad.wmnet resulting in,

17:55:07 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default', 'fetch', '--refresh-config'] on ruthenium.eqiad.wmnet returned [255]: ssh: connect to host ruthenium.eqiad.wmnet port 22: Connection timed out


https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/checks.yaml#L7
17:55:07 connection to ruthenium.eqiad.wmnet failed and future stages will not be attempted for this target
parsoid/deploy: fetch stage(s): 100% (ok: 6; fail: 1; left: 0)                  
17:55:07 1 targets had deploy errors

But, since failure_limit: 1 is set, the deploy finished.

Second is that because nodes are repooled after scap updates the code the new servers are back in production, as above. Once again, I'm going to manually depool them. But, we'll need an effective way of preventing that from happening again.

There're a couple problems with this change,
https://gerrit.wikimedia.org/r/#/c/377966/

/etc/dsh/group/parsoid contains ruthenium.eqiad.wmnet resulting in,

I am thinking we should remove ruthenium.eqiad.wmnet from scap::dsh::group::parsoid hieradata. Strictly speaking, ruthenium is not a parsoid box but rather a parsoid test box. Code is not deployed there via scap anyway (the #1 user of dsh groups). I am even doubting there is a valid reason for the box to be defined in dsh groups anyway and if there is it definitely should not be via scap::dsh::group. I 'll upload a patch to remove it.

17:55:07 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default', 'fetch', '--refresh-config'] on ruthenium.eqiad.wmnet returned [255]: ssh: connect to host ruthenium.eqiad.wmnet port 22: Connection timed out


https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/checks.yaml#L7
17:55:07 connection to ruthenium.eqiad.wmnet failed and future stages will not be attempted for this target
parsoid/deploy: fetch stage(s): 100% (ok: 6; fail: 1; left: 0)                  
17:55:07 1 targets had deploy errors

But, since failure_limit: 1 is set, the deploy finished.

Second is that because nodes are repooled after scap updates the code the new servers are back in production, as above. Once again, I'm going to manually depool them. But, we'll need an effective way of preventing that from happening again.

This is the reason the depooled boxes were pooled back the other day. I 've been having discussions with @Joe about this. In order to address this cleanly, we need to have a conftool schema change for this, exposing effectively an admin_pooled and an oper_pooled attribute (name bikeshedding welcome). admin_pooled would always reflect the actions of a human, oper_pooled would reflect the actions of tools like pooler-loop. We kind of have that already in the force of inactive but it is not sufficient as setting a host to inactive means scap will skip it during deploys which is not desirable in this case, right ?

Note that pybal already has it's own oper_pooled (not really named that way) mechanism which is related to the health checks it does and depools/pools hosts accordingly.

This is going to take a while to resolve though and might even be an overkill, given that we now know what the problems is, should we just pool all new wtp hosts ?

Change 379200 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ruthenium from scap::dsh::groups::parsoid

https://gerrit.wikimedia.org/r/379200

We kind of have that already in the force of inactive but it is not sufficient as setting a host to inactive means scap will skip it during deploys which is not desirable in this case, right ?

Right, we want the new hosts in sync, though being able to deploy code to a node while remaining depooled and being able to mark a node an inactive both seem like useful operations. (For posterity, what command do I need to issue to mark a node as inactive?)

This is going to take a while to resolve though and might even be an overkill, given that we now know what the problems is, should we just pool all new wtp hosts ?

That's fine with me, they seem to work. But, it's blocking on https://gerrit.wikimedia.org/r/#/c/378073/ otherwise we get a ton of logspam.

We kind of have that already in the force of inactive but it is not sufficient as setting a host to inactive means scap will skip it during deploys which is not desirable in this case, right ?

Right, we want the new hosts in sync, though being able to deploy code to a node while remaining depooled and being able to mark a node an inactive both seem like useful operations. (For posterity, what command do I need to issue to mark a node as inactive?)

It's just a different argument to pooled. Instead of pooled=yes or pooled=no you say pooled=inactive. Up to now we use it exclusively for hosts going through extensive maintenance (RMAs, kernel problems and so on)

This is going to take a while to resolve though and might even be an overkill, given that we now know what the problems is, should we just pool all new wtp hosts ?

That's fine with me, they seem to work. But, it's blocking on https://gerrit.wikimedia.org/r/#/c/378073/ otherwise we get a ton of logspam.

I can deploy that. It has 2 +1s already, I 'll do so

Change 378073 merged by Alexandros Kosiaris:
[operations/mediawiki-config@master] Whitelist wtp10[25-48] for Linter

https://gerrit.wikimedia.org/r/378073

It's just a different argument to pooled. Instead of pooled=yes or pooled=no you say pooled=inactive. Up to now we use it exclusively for hosts going through extensive maintenance (RMAs, kernel problems and so on)

Thanks, yes, I was thinking of cases like T146113

Mentioned in SAL (#wikimedia-operations) [2017-09-21T08:35:26Z] <akosiaris> pool all of wtp1025 to wtp1048 T165520

Add wtp1025 to wtp1048 hosts are pooled now. I 've also updated https://grafana.wikimedia.org/dashboard/db/parsoid-servers-cpu-usage?orgId=1 to list the new servers and they look fine, albeit slightly low on load. I am guessing we could increase the weight factor in conftool, but we are going anyway to deprecate the old hosts so that should do it as well :)

I am gonna resolve this.

Change 379200 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ruthenium from scap::dsh::groups::parsoid

https://gerrit.wikimedia.org/r/379200