Page MenuHomePhabricator

Reimage ores* hosts with Debian Stretch
Closed, ResolvedPublic

Description

We'd like to start using the new stat* machines to build models. It looks like they are running Stretch but our new ORES machines were installed with Jessie. What can be done to bring these machines in alignment?

Related Objects

StatusAssignedTask
Resolvedakosiaris
Resolvedakosiaris
ResolvedRobH
ResolvedRobH
Resolvedakosiaris
ResolvedNone
OpenNone
Resolvedakosiaris
ResolvedHalfak
Resolvedakosiaris
Resolvedawight
ResolvedHalfak
Resolvedawight
ResolvedNone
Resolvedawight
Resolvedawight
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
Resolvedakosiaris
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
ResolvedSumit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I just talked to @fgiunchedi about this in IRC and he said that reimaging while they aren't getting traffic makes sense. @akosiaris, how difficult is this?

Quite easy. But let's do T169246 first

\o/ Sounds good.

awight added a subscriber: awight.EditedOct 27 2017, 4:55 PM

@akosiaris We ran into a little glitch, that wheels created on ores-misc-01 aren't compatible with the new ores* cluster. I'm thinking we should reorder the dependencies, to reimage ores-misc-01 and this cluster before we finish stress testing. Are you okay with that?

Update: I've worked around the wheels problem, see T179095

Change 399596 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Reimage ores100* as stretch

https://gerrit.wikimedia.org/r/399596

Change 399596 merged by Alexandros Kosiaris:
[operations/puppet@production] Reimage ores100* as stretch

https://gerrit.wikimedia.org/r/399596

It's half done (codfw but not eqiad). I 've been stalling it on T182799 so that we don't get hosts in a non-working state alerting and all. We could do eqiad as well and follow the codfw way of just not having the roles applied but that would not really buy us much.

Sorry for the confusion. T182799 is done. Will resolve.

Change 407018 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Set ores1* as spare::system

https://gerrit.wikimedia.org/r/407018

Change 407018 merged by Alexandros Kosiaris:
[operations/puppet@production] Set ores1* as spare::system

https://gerrit.wikimedia.org/r/407018

Mentioned in SAL (#wikimedia-operations) [2018-01-31T15:44:19Z] <akosiaris> reimage ores100{1..9} T171851

Just checked on the hosts and it seems they are still in progress. @akosiaris, are you still working on the re-imaging?

Yes I am, but currently in fosdem so it 'll take a bit more. ETA is probably Tuesday

Halfak added a comment.Feb 5 2018, 4:15 PM

Works for me. Thanks :)

Change 408558 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ores::stresstest as its no longer needed

https://gerrit.wikimedia.org/r/408558

Change 408559 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Set oresX00X hosts as role::ores

https://gerrit.wikimedia.org/r/408559

Change 408560 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ORES profile from scb

https://gerrit.wikimedia.org/r/408560

Change 408564 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Allow oresX00X to reach respective oresrdb

https://gerrit.wikimedia.org/r/408564

Change 408558 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ores::stresstest as its no longer needed

https://gerrit.wikimedia.org/r/408558

Change 408564 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Allow oresX00X to reach respective oresrdb

https://gerrit.wikimedia.org/r/408564

Change 408559 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Set oresX00X hosts as role::ores

https://gerrit.wikimedia.org/r/408559

Change 408799 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add ores scap::dsh::groups

https://gerrit.wikimedia.org/r/408799

Change 408799 merged by Alexandros Kosiaris:
[operations/puppet@production] Add ores scap::dsh::groups

https://gerrit.wikimedia.org/r/408799

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:17:18Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:18:39Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 22s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:19:31Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:20:26Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:32:51Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:33:04Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 00m 06s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:34:09Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:35:30Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 21s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:35:51Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:38:35Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 02m 45s)

Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:03:47Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:05:34Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 47s)

Change 408817 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable notification for role::ores

https://gerrit.wikimedia.org/r/408817

ORES deploy has failed (see T182799#3952787 and T184135#3873517) but the topic of this task, which is to reimage the hosts to stretch has been successfully completed so I am thinking I should resolve this

Change 408817 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable notification for role::ores

https://gerrit.wikimedia.org/r/408817

Halfak added a comment.Feb 9 2018, 9:58 PM

I found this in our deploy repo.

1commit 354bff862627e9dbf6f617a72d438a22a4d81912
2Author: Alexandros Kosiaris <akosiaris@wikimedia.org>
3Date: Wed Feb 7 13:15:02 2018 +0000
4
5 Update ORES production scap configuration
6
7 Remove the cluster server group and related stuff
8 Add a new scb server group, to be removed soon
9 Remove the ores-worker server group from production
10 Move the production configs under [wmnet] stanza
11 Add checks for the default group
12 Pass --no-download to virtualenv as it looks like it's required in
13 stretch
14
15 Bug: T171851

Not sure what is going on as this change was not submitted to gerrit AFAICT

OK I put all of the changes in "alex_stuff"

halfak@tin:/srv/deployment/ores/deploy$ git branch -l 
  CELERY_4
  STABLE
  STABLE_REVSCORING_1
  alex_stuff
* master

I found this in our deploy repo.

1commit 354bff862627e9dbf6f617a72d438a22a4d81912
2Author: Alexandros Kosiaris <akosiaris@wikimedia.org>
3Date: Wed Feb 7 13:15:02 2018 +0000
4
5 Update ORES production scap configuration
6
7 Remove the cluster server group and related stuff
8 Add a new scb server group, to be removed soon
9 Remove the ores-worker server group from production
10 Move the production configs under [wmnet] stanza
11 Add checks for the default group
12 Pass --no-download to virtualenv as it looks like it's required in
13 stretch
14
15 Bug: T171851

Not sure what is going on as this change was not submitted to gerrit AFAICT

Yes, that was me trying to clear up a bit scap.cfg so that I could deploy to ores[12]00[1-9] and then other things preempting me and not managing to finish it. The size limit is required since it seems like tin can't keep up with that many processes trying simultaneously to clone from it. I will also look into increasing apache MaxRequestWorkers on tin itself, but having scap behave nicely as well wouldn't hurt.

@akosiaris Do you know whether that group_size change will apply to rollback as well? We want the rollback to be as fast as possible...

Yes it will. Which is why I am experimenting already with fetch_batch_size [1]

[1] https://github.com/wikimedia/scap/blob/master/docs/scap3/repo_config.rst

Change 409932 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/ores/deploy@master] Remove the cluster server group and related stuff

https://gerrit.wikimedia.org/r/409932

Change 409962 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add the ORES cluster to wikimedia_clusters

https://gerrit.wikimedia.org/r/409962

Change 409962 merged by Alexandros Kosiaris:
[operations/puppet@production] Add the ORES cluster to wikimedia_clusters

https://gerrit.wikimedia.org/r/409962

Mentioned in SAL (#wikimedia-operations) [2018-02-12T17:34:17Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T17:47:35Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 13m 18s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:00:00Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:12:30Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:27:53Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:34:39Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 06m 47s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:35:55Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:48:08Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 14s)

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:50:20Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851

Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:57:59Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 07m 39s)

It looks like this is done. Is that right?

akosiaris closed this task as Resolved.Feb 14 2018, 8:31 AM
akosiaris claimed this task.

Yes that's right. Resolving

Change 408560 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ORES profile from scb

https://gerrit.wikimedia.org/r/408560

Change 409932 abandoned by Alexandros Kosiaris:
Remove the cluster server group and related stuff

Reason:
I think whatever changes I had done on this one have already been done in the past 2 months.

https://gerrit.wikimedia.org/r/409932