Page MenuHomePhabricator

switch releases.wikimedia.org from eqiad to codfw
Closed, DeclinedPublic

Description

This service is still in eqiad.

Also it's in service_catalog and called "active-active" but at the same time it's not and its' really just an alias for eqiad.

So that needs follow-up in the future like other services that are not in the catalog yet.

And for now it should be switched over just like we did with miscweb static sites hosted on miscweb*.

This wasn't covered because it has its own dedicated backends, releases*.

Event Timeline

[releases2002:~] $ host releases.discovery.wmnet
releases.discovery.wmnet is an alias for releases1002.eqiad.wmnet.

Change 893576 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch releases.wikimedia.org from eqiad to codfw

https://gerrit.wikimedia.org/r/893576

Change 893577 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] switch releases.wikimedia.org backends rsync direction

https://gerrit.wikimedia.org/r/893577

Mentioned in SAL (#wikimedia-operations) [2023-03-02T01:08:25Z] <mutante> releases2002 - stopping apache2 to test alerting (active server is 1002 but should be switched) T327975 T330960

Change 893829 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: fix tests for releases.wikimedia.org, remove parsoid

https://gerrit.wikimedia.org/r/893829

Change 893829 merged by Dzahn:

[operations/puppet@production] httpbb: fix tests for releases.wikimedia.org, remove parsoid

https://gerrit.wikimedia.org/r/893829

I confirmed the files in /srv/org/wikimedia/releases/ (the document root) are identical on both backends and that the existing timer and rsync command works. ran it manually as well and there was no diff to transfer.

/usr/bin/rsync --delete -av rsync://releases1002.eqiad.wmnet/srv-org-wikimedia-releases-releases2002.codfw.wmnet /srv/org/wikimedia/releases/

compiler output: https://puppet-compiler.wmflabs.org/output/893577/39933/

Then I switched the rsync direction around.

Change 893577 merged by Dzahn:

[operations/puppet@production] switch releases.wikimedia.org backends rsync direction

https://gerrit.wikimedia.org/r/893577

Mentioned in SAL (#wikimedia-operations) [2023-03-03T00:13:39Z] <mutante> switching releases.wikimedia.org from eqiad to codfw - T330960

Mentioned in SAL (#wikimedia-operations) [2023-03-03T01:12:27Z] <mutante> releases1002: deleting /usr/local/sbin/sync-srv-org-wikimedia-reprepro-releases1002.eqiad.wmnet which confusingly contains an rsync command to rsync from releases1001 which does not exist anymore T330960

Change 893828 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: add monitor for releases-jenkins.wikimedia.org

https://gerrit.wikimedia.org/r/893828

Change 893576 merged by Dzahn:

[operations/dns@master] switch releases.wikimedia.org from eqiad to codfw

https://gerrit.wikimedia.org/r/893576

This is switched! I am sending an email to all members in admin groups on releases* servers.

I am just keeping it open to add docs to wikitech tomorrow.

hashar added subscribers: jnuche, hashar.

From my mail earlier today, that broke the release Jenkins service.

The service configuration files are managed by scap via the jenkins-deploy repository and is currently a work in progress. We notably do not vary the configuration between the hosts and as such the Jenkins on releases2002 attempts to attach releases1002 as an agent which does not work due to lack of iptables rules (and anyway Jenkins should attach releases2002 as an agent).

The jobs are also not under configuration management which we will need to find a solution though (either storing the flat XML files in the jenkins-deploy repository or using Jenkins Job Builder to generate them). The Jenkins running on releases1002 thus does not have any job defined.

Jenkins on releases2002 is still up and running, so I guess eventually the job used to automatically cut the branch and update wmf branches releases notes will kick in successfully, but since the service entry ( https://releases-jenkins.wikimedia.org/ ) got switched to the other host we can't access it.

I guess we should have opted out switching other the releases hosts due to Jenkins (like we have opted out the contint servers).

Change 893778 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Revert "switch releases.wikimedia.org backends rsync direction"

https://gerrit.wikimedia.org/r/893778

Change 893779 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/dns@master] Revert "switch releases.wikimedia.org from eqiad to codfw"

https://gerrit.wikimedia.org/r/893779

Change 893778 merged by Dzahn:

[operations/puppet@production] Revert "switch releases.wikimedia.org backends rsync direction"

https://gerrit.wikimedia.org/r/893778

Change 893779 merged by Dzahn:

[operations/dns@master] Revert "switch releases.wikimedia.org from eqiad to codfw"

https://gerrit.wikimedia.org/r/893779

https://releases-jenkins.wikimedia.org/ is back on releases1002 which has the fully configured Jenkins instance.

I have manually masked stopped, disabled and masked Jenkins on releases2002. Ran Puppet again and it has stayed as is.

Dzahn changed the task status from Open to Stalled.Mar 3 2023, 3:12 PM
Dzahn claimed this task.

From my mail earlier today, that broke the release Jenkins service.

This code comment below seems relevant here. We allow rsyncing jenkins but it's not happening automatically. from profile::releases::common.

# allow syncing jenkins data between servers for migrations
# but do not automatically do it
rsync::quickdatacopy { "var-lib-jenkins-${releases_server}":
  ensure      => present,
  auto_sync   => false,
  delete      => true,
  source_host => $primary_server,
  dest_host   => $releases_server,
  module_path => '/var/lib/jenkins',
}

Change 894090 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: ensure rsync timers are removed when switching backends

https://gerrit.wikimedia.org/r/894090

Change 894090 merged by Dzahn:

[operations/puppet@production] releases: ensure rsync timers are removed when switching backends

https://gerrit.wikimedia.org/r/894090

Change 894072 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] Revert "releases: ensure rsync timers are removed when switching backends"

https://gerrit.wikimedia.org/r/894072

Change 894072 merged by Dzahn:

[operations/puppet@production] Revert "releases: ensure rsync timers are removed when switching backends"

https://gerrit.wikimedia.org/r/894072

^ merging the change above and then reverting it again (after realizing it won't work), did actually fix the issue for now.

Because first the timer that should not exist got removed but along with rsyncd config that _needs_ to exist.. then on revert the rsyncd config was added back but the timer that should not exist was not added back.

So for now all is ok.. timer exists only on releases2002 and can and does pull from releases1002 and nothing is broken on releases1002.

@hashar A follow up question, is there a plan to configure releases2002 so that the switchover is possible? I'm asking both in terms of testing this within the DC switchover period and longer term operational expectations.

Change 893828 merged by Dzahn:

[operations/puppet@production] releases: add monitor for releases-jenkins.wikimedia.org

https://gerrit.wikimedia.org/r/893828

@hashar A follow up question, is there a plan to configure releases2002 so that the switchover is possible? I'm asking both in terms of testing this within the DC switchover period and longer term operational expectations.

Not for this switch over period. I have eventually considered it but that clashed with on going work to fully manage Jenkins configuration. The first step is done for release Jenkins and one can automatically provision its installation and configuration. However there are a couple blockers still:

  • we need to be manage the jobs configuration which are currently manually crafted
  • the current deployment spins up two live Jenkins (on releases2002 and releases1002) which would both be up and running cause timed jobs to trigger twice (once on each instance)

I don't think we can realistic achieve those over the next few days. The CI Jenkins is further behind since we are not fully managing the configuration yet and it is a reason I have opted it out from the switch over.

I am inclined to decline this task about switching releases.wikimedia.org with the reason: Jenkins is not easily switchable yet.

for the record, it would not be hard (on the ATS side) to seperate releases.wikimedia.org from releases-jenkins.wikimedia.org and switch the former while not touching the latter.

If anyone sees value on that it's an option as well. All we would have to do is use a different DNS discovery name.

Regarding "over the next few days", I don't think there was such an expectation, because there is nothing to switch back if we already reverted the switch.

So it should all be just about next time a switch happens in the eqiad -> codfw direction.

fwiw I don't think we normally decline tasks just because _right now_ they can't be done if we still want to do them in the future.

Change 897999 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: limit monitoring of releases-jenkins to active server

https://gerrit.wikimedia.org/r/897999

^ We need to limit monitoring to the active server for now to avoid false alerts because we are getting the 503 from the codfw instance.

Change 897999 merged by Dzahn:

[operations/puppet@production] releases: limit monitoring of releases-jenkins to active server

https://gerrit.wikimedia.org/r/897999

@hashar A follow up question, is there a plan to configure releases2002 so that the switchover is possible? I'm asking both in terms of testing this within the DC switchover period and longer term operational expectations.

I am inclined to decline this task about switching releases.wikimedia.org with the reason: Jenkins is not easily switchable yet.

I think that makes sense as long as we have a follow up task for the Jenkins config work. Is there an existing one we can link here or do we need a new one?

closing as declined based on comments by both Hashar and LSobanski above.

declined and not resolved to make it clear in the context of the DC switchover task that this did not happen this time.

..
I think that makes sense as long as we have a follow up task for the Jenkins config work. Is there an existing one we can link here or do we need a new one?

The root task is T319406: Automate integration Jenkins deployment and config changes. The one for releases Jenkins has been solved, there is one ongoing for the CI Jenkins (T328920). And after that we have to investigate how to ship the jobs configuration for which we haven't filed any task yet cause we haven't started looking into it ;)