Reimage ores* hosts with Debian Stretch
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Jul 27 2017, 2:12 PM

Description

We'd like to start using the new stat* machines to build models. It looks like they are running Stretch but our new ORES machines were installed with Jessie. What can be done to bring these machines in alignment?

Details

Subject	Repo	Branch	Lines +/-
Remove the cluster server group and related stuff	mediawiki/services/ores/deploy	master	+14 -22
Remove ORES profile from scb	operations/puppet	production	+10 -61
Add the ORES cluster to wikimedia_clusters	operations/puppet	production	+6 -0
Disable notification for role::ores	operations/puppet	production	+3 -0
Add ores scap::dsh::groups	operations/puppet	production	+7 -0
ores: Allow oresX00X to reach respective oresrdb	operations/puppet	production	+18 -0
ores: Set oresX00X hosts as role::ores	operations/puppet	production	+45 -16
Remove ores::stresstest as its no longer needed	operations/puppet	production	+0 -30
Set ores1* as spare::system	operations/puppet	production	+1 -1
Reimage ores100* as stretch	operations/puppet	production	+18 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T162039 Prepare to service applications from kubernetes
Resolved	akosiaris	T162041 Expand the infrastructure to codfw
Resolved	RobH	T161700 CODFW: (4) hardware access request for kubernetes
Resolved	RobH	T142578 codfw/eqiad:(9+9) hardware access request for ORES
		Unknown Object (Task)
Resolved	akosiaris	T165170 rack/setup/install ores2001-2009
Resolved	None	T176324 Scoring platform team FY18 Q2
Declined	None	T179501 Use external dsh group to list pooled ORES nodes
Resolved	akosiaris	T168073 Switch ORES to dedicated cluster
Resolved	Halfak	T185901 Preliminary deployment of ORES to new cluster
Resolved	akosiaris	T171851 Reimage ores* hosts with Debian Stretch
Resolved	awight	T169246 Stress/capacity test new ores* cluster
Resolved	Halfak	T174402 Review and fix file handle management in worker and celery processes
Resolved	awight	T177036 Clean up file handle and Redis connection management in ORES worker and celery processes
Resolved	None	T175736 Give ores admins read access to /srv/log/ores/main.log*
Resolved	awight	T179095 Wheels built on ores-misc-01 are incompatible with ores* and scb*
Resolved	awight	T181806 Problem with Redis server configuration on new ORES cluster
Resolved	Halfak	T182799 Make sure ORES is compatible with stretch
Resolved	Halfak	T184072 Rebuild ORES models on Stretch
Resolved	Halfak	T184073 Provision a Stretch box we can use to build ORES models
Resolved	akosiaris	T184074 Verify that all enchant/spelling dictionaries are available on Stretch. Port if needed.
Resolved	Halfak	T184135 Rebuild ORES wheels on Stretch
Resolved	Halfak	T184296 Convert CloudVPS instances to stretch.
Resolved	Halfak	T184766 Convert ores-misc-01 to stretch
Resolved	Sumit	T184765 Back up ores-misc-01 to ores-staging-01

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T171851#3478511, @Halfak wrote:

I just talked to @fgiunchedi about this in IRC and he said that reimaging while they aren't getting traffic makes sense. @akosiaris, how difficult is this?

Quite easy. But let's do T169246 first

\o/ Sounds good.

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Aug 16 2017, 5:26 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Aug 21 2017, 4:46 PM

awight moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.Sep 6 2017, 5:18 PM

Halfak added a subtask: T169246: Stress/capacity test new ores* cluster.Oct 27 2017, 3:45 PM

@akosiaris We ran into a little glitch, that wheels created on ores-misc-01 aren't compatible with the new ores* cluster. I'm thinking we should reorder the dependencies, to reimage ores-misc-01 and this cluster before we finish stress testing. Are you okay with that?

Update: I've worked around the wheels problem, see T179095

Halfak added a parent task: T168073: Switch ORES to dedicated cluster.Oct 30 2017, 6:05 PM

awight closed subtask T169246: Stress/capacity test new ores* cluster as Resolved.Dec 6 2017, 8:04 PM

Halfak mentioned this in T169246: Stress/capacity test new ores* cluster.Dec 6 2017, 8:39 PM

awight reopened subtask T169246: Stress/capacity test new ores* cluster as Open.Dec 7 2017, 8:05 PM

akosiaris added a parent task: T165170: rack/setup/install ores2001-2009.Dec 8 2017, 1:10 PM

awight mentioned this in T182799: Make sure ORES is compatible with stretch.Dec 13 2017, 4:59 PM

awight closed subtask T169246: Stress/capacity test new ores* cluster as Resolved.Dec 14 2017, 9:22 PM

Change 399596 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Reimage ores100* as stretch

https://gerrit.wikimedia.org/r/399596

Change 399596 merged by Alexandros Kosiaris:
[operations/puppet@production] Reimage ores100* as stretch

https://gerrit.wikimedia.org/r/399596

@akosiaris, is this done?

Halfak added a parent task: T185901: Preliminary deployment of ORES to new cluster.Jan 29 2018, 4:33 PM

It's half done (codfw but not eqiad). I 've been stalling it on T182799 so that we don't get hosts in a non-working state alerting and all. We could do eqiad as well and follow the codfw way of just not having the roles applied but that would not really buy us much.

Sorry for the confusion. T182799 is done. Will resolve.

akosiaris added a subtask: T182799: Make sure ORES is compatible with stretch.Jan 31 2018, 3:34 PM

Change 407018 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Set ores1* as spare::system

https://gerrit.wikimedia.org/r/407018

Change 407018 merged by Alexandros Kosiaris:
[operations/puppet@production] Set ores1* as spare::system

https://gerrit.wikimedia.org/r/407018

Mentioned in SAL (#wikimedia-operations) [2018-01-31T15:44:19Z] <akosiaris> reimage ores100{1..9} T171851

Just checked on the hosts and it seems they are still in progress. @akosiaris, are you still working on the re-imaging?

Yes I am, but currently in fosdem so it 'll take a bit more. ETA is probably Tuesday

Works for me. Thanks :)

Change 408558 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ores::stresstest as its no longer needed

https://gerrit.wikimedia.org/r/408558

Change 408559 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Set oresX00X hosts as role::ores

https://gerrit.wikimedia.org/r/408559

Change 408560 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ORES profile from scb

https://gerrit.wikimedia.org/r/408560

Change 408564 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Allow oresX00X to reach respective oresrdb

https://gerrit.wikimedia.org/r/408564

Change 408558 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ores::stresstest as its no longer needed

https://gerrit.wikimedia.org/r/408558

Change 408564 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Allow oresX00X to reach respective oresrdb

https://gerrit.wikimedia.org/r/408564

Change 408559 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Set oresX00X hosts as role::ores

https://gerrit.wikimedia.org/r/408559

Change 408799 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add ores scap::dsh::groups

https://gerrit.wikimedia.org/r/408799

Change 408799 merged by Alexandros Kosiaris:
[operations/puppet@production] Add ores scap::dsh::groups

https://gerrit.wikimedia.org/r/408799

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:17:18Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:18:39Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 22s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:19:31Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:20:26Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:32:51Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:33:04Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 00m 06s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:34:09Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:35:30Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 21s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:35:51Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:38:35Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 02m 45s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:03:47Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:05:34Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 47s)

Change 408817 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable notification for role::ores

https://gerrit.wikimedia.org/r/408817

akosiaris reopened subtask T182799: Make sure ORES is compatible with stretch as Open.Feb 7 2018, 3:21 PM

ORES deploy has failed (see T182799#3952787 and T184135#3873517) but the topic of this task, which is to reimage the hosts to stretch has been successfully completed so I am thinking I should resolve this

Change 408817 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable notification for role::ores

https://gerrit.wikimedia.org/r/408817

Halfak closed subtask T182799: Make sure ORES is compatible with stretch as Resolved.Feb 8 2018, 5:07 PM

I found this in our deploy repo.

P6677 (An Untitled Masterwork)

1	commit 354bff862627e9dbf6f617a72d438a22a4d81912
2	Author: Alexandros Kosiaris <akosiaris@wikimedia.org>
3	Date: Wed Feb 7 13:15:02 2018 +0000
4
5	Update ORES production scap configuration
6
7	Remove the cluster server group and related stuff
8	Add a new scb server group, to be removed soon
9	Remove the ores-worker server group from production
10	Move the production configs under [wmnet] stanza
11	Add checks for the default group
12	Pass --no-download to virtualenv as it looks like it's required in
13	stretch
14
15	Bug: T171851

Not sure what is going on as this change was not submitted to gerrit AFAICT

OK I put all of the changes in "alex_stuff"

halfak@tin:/srv/deployment/ores/deploy$ git branch -l 
  CELERY_4
  STABLE
  STABLE_REVSCORING_1
  alex_stuff
* master

In T171851#3959677, @Halfak wrote:

I found this in our deploy repo.

P6677 (An Untitled Masterwork)
1 commit 354bff862627e9dbf6f617a72d438a22a4d81912
2 Author: Alexandros Kosiaris <akosiaris@wikimedia.org>
3 Date: Wed Feb 7 13:15:02 2018 +0000
4
5 Update ORES production scap configuration
6
7 Remove the cluster server group and related stuff
8 Add a new scb server group, to be removed soon
9 Remove the ores-worker server group from production
10 Move the production configs under [wmnet] stanza
11 Add checks for the default group
12 Pass --no-download to virtualenv as it looks like it's required in
13 stretch
14
15 Bug: T171851

Not sure what is going on as this change was not submitted to gerrit AFAICT

Yes, that was me trying to clear up a bit scap.cfg so that I could deploy to ores[12]00[1-9] and then other things preempting me and not managing to finish it. The size limit is required since it seems like tin can't keep up with that many processes trying simultaneously to clone from it. I will also look into increasing apache MaxRequestWorkers on tin itself, but having scap behave nicely as well wouldn't hurt.

akosiaris mentioned this in rORESDEPLOYc18e42251dbb: Remove the cluster server group and related stuff Add a new scb server group….Feb 12 2018, 4:23 PM

akosiaris mentioned this in rORESDEPLOYb2950478078e: Remove the cluster server group and related stuff Add a new scb server group….Feb 12 2018, 4:27 PM

akosiaris mentioned this in rORESDEPLOY8aaa0cf97af2: Remove the cluster server group and related stuff Add a new scb server group….Feb 12 2018, 4:29 PM

akosiaris mentioned this in rORESDEPLOY9d7482fc9adf: Remove the cluster server group and related stuff Add a new scb server group….Feb 12 2018, 4:33 PM

@akosiaris Do you know whether that group_size change will apply to rollback as well? We want the rollback to be as fast as possible...

Yes it will. Which is why I am experimenting already with fetch_batch_size [1]

[1] https://github.com/wikimedia/scap/blob/master/docs/scap3/repo_config.rst

Change 409932 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/ores/deploy@master] Remove the cluster server group and related stuff

https://gerrit.wikimedia.org/r/409932

akosiaris mentioned this in rORESDEPLOYf613e60e518b: Remove the cluster server group and related stuff.Feb 12 2018, 5:11 PM

Change 409962 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add the ORES cluster to wikimedia_clusters

https://gerrit.wikimedia.org/r/409962

Change 409962 merged by Alexandros Kosiaris:
[operations/puppet@production] Add the ORES cluster to wikimedia_clusters

https://gerrit.wikimedia.org/r/409962

Mentioned in SAL (#wikimedia-operations) [2018-02-12T17:34:17Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T17:47:35Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 13m 18s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:00:00Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:12:30Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:27:53Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:34:39Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 06m 47s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:35:55Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:48:08Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 14s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:50:20Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:57:59Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 07m 39s)

It looks like this is done. Is that right?

Yes that's right. Resolving

akosiaris mentioned this in rORESDEPLOY34c361233e70: Remove the cluster server group and related stuff.Feb 14 2018, 8:37 AM

Change 408560 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ORES profile from scb

https://gerrit.wikimedia.org/r/408560

Mentioned in SAL (#wikimedia-operations) [2018-02-22T11:39:33Z] <akosiaris> purge ORES from scb hosts T168073 T171851

Change 409932 abandoned by Alexandros Kosiaris:
Remove the cluster server group and related stuff

Reason:
I think whatever changes I had done on this one have already been done in the past 2 months.

https://gerrit.wikimedia.org/r/409932

awight mentioned this in Blog Post: Status Update (May 2, 2018).May 2 2018, 6:43 PM

akosiaris mentioned this in rORESDEPLOY72a8cfe6a120: Create change.Jun 10 2018, 12:22 PM

akosiaris mentioned this in rORESDEPLOY57fbb41bd5f0: Create patch set 2.

Reimage ores* hosts with Debian StretchClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reimage ores* hosts with Debian Stretch
Closed, ResolvedPublic
Actions

Related Objects
Search...