We'd like to start using the new stat* machines to build models. It looks like they are running Stretch but our new ORES machines were installed with Jessie. What can be done to bring these machines in alignment?
Description
Details
Event Timeline
@akosiaris We ran into a little glitch, that wheels created on ores-misc-01 aren't compatible with the new ores* cluster. I'm thinking we should reorder the dependencies, to reimage ores-misc-01 and this cluster before we finish stress testing. Are you okay with that?
Update: I've worked around the wheels problem, see T179095
Change 399596 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Reimage ores100* as stretch
Change 399596 merged by Alexandros Kosiaris:
[operations/puppet@production] Reimage ores100* as stretch
It's half done (codfw but not eqiad). I 've been stalling it on T182799 so that we don't get hosts in a non-working state alerting and all. We could do eqiad as well and follow the codfw way of just not having the roles applied but that would not really buy us much.
Change 407018 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Set ores1* as spare::system
Change 407018 merged by Alexandros Kosiaris:
[operations/puppet@production] Set ores1* as spare::system
Mentioned in SAL (#wikimedia-operations) [2018-01-31T15:44:19Z] <akosiaris> reimage ores100{1..9} T171851
Just checked on the hosts and it seems they are still in progress. @akosiaris, are you still working on the re-imaging?
Change 408558 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ores::stresstest as its no longer needed
Change 408559 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Set oresX00X hosts as role::ores
Change 408560 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ORES profile from scb
Change 408564 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Allow oresX00X to reach respective oresrdb
Change 408558 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ores::stresstest as its no longer needed
Change 408564 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Allow oresX00X to reach respective oresrdb
Change 408559 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Set oresX00X hosts as role::ores
Change 408799 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add ores scap::dsh::groups
Change 408799 merged by Alexandros Kosiaris:
[operations/puppet@production] Add ores scap::dsh::groups
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:17:18Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:18:39Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 22s)
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:19:31Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:20:26Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 00m 55s)
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:32:51Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:33:04Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 00m 06s)
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:34:09Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:35:30Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 21s)
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:35:51Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-07T13:38:35Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 02m 45s)
Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:03:47Z] <akosiaris@tin> Started deploy [ores/deploy@eb0f776]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:05:34Z] <akosiaris@tin> Finished deploy [ores/deploy@eb0f776]: T171851 (duration: 01m 47s)
Change 408817 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable notification for role::ores
ORES deploy has failed (see T182799#3952787 and T184135#3873517) but the topic of this task, which is to reimage the hosts to stretch has been successfully completed so I am thinking I should resolve this
Change 408817 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable notification for role::ores
I found this in our deploy repo.
Not sure what is going on as this change was not submitted to gerrit AFAICT
OK I put all of the changes in "alex_stuff"
halfak@tin:/srv/deployment/ores/deploy$ git branch -l CELERY_4 STABLE STABLE_REVSCORING_1 alex_stuff * master
Yes, that was me trying to clear up a bit scap.cfg so that I could deploy to ores[12]00[1-9] and then other things preempting me and not managing to finish it. The size limit is required since it seems like tin can't keep up with that many processes trying simultaneously to clone from it. I will also look into increasing apache MaxRequestWorkers on tin itself, but having scap behave nicely as well wouldn't hurt.
@akosiaris Do you know whether that group_size change will apply to rollback as well? We want the rollback to be as fast as possible...
Yes it will. Which is why I am experimenting already with fetch_batch_size [1]
[1] https://github.com/wikimedia/scap/blob/master/docs/scap3/repo_config.rst
Change 409932 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/ores/deploy@master] Remove the cluster server group and related stuff
Change 409962 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add the ORES cluster to wikimedia_clusters
Change 409962 merged by Alexandros Kosiaris:
[operations/puppet@production] Add the ORES cluster to wikimedia_clusters
Mentioned in SAL (#wikimedia-operations) [2018-02-12T17:34:17Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-12T17:47:35Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 13m 18s)
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:00:00Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:12:30Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 30s)
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:27:53Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:34:39Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 06m 47s)
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:35:55Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:48:08Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 12m 14s)
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:50:20Z] <akosiaris@tin> Started deploy [ores/deploy@f7e23f4]: T171851
Mentioned in SAL (#wikimedia-operations) [2018-02-12T18:57:59Z] <akosiaris@tin> Finished deploy [ores/deploy@f7e23f4]: T171851 (duration: 07m 39s)
Change 408560 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ORES profile from scb
Mentioned in SAL (#wikimedia-operations) [2018-02-22T11:39:33Z] <akosiaris> purge ORES from scb hosts T168073 T171851
Change 409932 abandoned by Alexandros Kosiaris:
Remove the cluster server group and related stuff
Reason:
I think whatever changes I had done on this one have already been done in the past 2 months.