Page MenuHomePhabricator

replace backends for releases.wikimedia.org with buster VMs
Closed, ResolvedPublic

Description

releases1001/releases2001 are the backends for https://releases.wikimedia.org and https://releases-jenkins.wikimedia.org

They are currently running stretch and should be replaced with releases1002/2002 running buster.

time frame: by the end of Q4 2020

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+6 -3
operations/puppetproduction+1 -1
operations/puppetproduction+6 -6
operations/puppetproduction+9 -2
operations/puppetproduction+39 -33
operations/puppetproduction+9 -5
operations/puppetproduction+2 -1
operations/puppetproduction+12 -8
operations/puppetproduction+13 -0
operations/dnsmaster+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+1 -3
operations/puppetproduction+29 -0
operations/puppetproduction+7 -0
operations/puppetproduction+11 -1
operations/puppetproduction+9 -7
operations/puppetproduction+27 -23
operations/puppetproduction+13 -6
operations/puppetproduction+24 -17
operations/puppetproduction+6 -2
operations/puppetproduction+1 -1
operations/dnsmaster+4 -4
operations/dnsmaster+10 -2
Show related patches Customize query in gerrit

Event Timeline

Dzahn created this task.Mar 14 2020, 12:05 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2020, 12:05 AM
Dzahn triaged this task as Medium priority.Mar 14 2020, 8:26 PM
Dzahn claimed this task.Jun 12 2020, 7:04 AM

Change 605176 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPs for releases1002/releases2002

https://gerrit.wikimedia.org/r/605176

Change 605176 merged by Dzahn:
[operations/dns@master] add IPs for releases1002/releases2002

https://gerrit.wikimedia.org/r/605176

VMs with buster, releases1002/releases2002 have been created in the subtask.

Change 606022 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add releases role to releases1002/2002

https://gerrit.wikimedia.org/r/606022

Change 607305 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] move releases1002 from row A to row to rebalance VMs

https://gerrit.wikimedia.org/r/607305

Change 607305 merged by Dzahn:
[operations/dns@master] move releases1002 from row A to row to rebalance VMs

https://gerrit.wikimedia.org/r/607305

Change 606022 merged by Dzahn:
[operations/puppet@production] site: add releases role to releases1002/2002

https://gerrit.wikimedia.org/r/606022

Change 607641 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases::mediawiki:: support buster / PHP 7.3

https://gerrit.wikimedia.org/r/607641

Dzahn added a subscriber: hashar.EditedJun 24 2020, 11:41 PM

@hashar For some reason on releases1002/2002 (new VMs on buster), after applying the releases role, one gets 2 jenkins processes like this:

jenkins  12995  0.0  0.0   5712    40 ?        S    22:33   0:00 /usr/bin/daemon --name=jenkins --inherit --env=JENKINS_HOME=/var/lib/jenkins --output=/var/log/jenkins/jenkins.log --pidfile=/var/run/jenkins/jenkins.pid -- /bin/java -Djava.awt.headless=true -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080
jenkins  12997  0.6  7.2 3658092 292708 ?      Sl   22:33   0:24 /bin/java -Djava.awt.headless=true -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080

This then makes Icinga alert because there is more than one: "PROCS CRITICAL: 2 processes with regex args '.*/bin/java .*-jar /usr/share/jenkins/jenkins.war'".

On releases1001 though it looks quite different:

jenkins  26288  2.1 15.9 3603772 645220 ?      Ssl  21:42   2:19 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Djava.awt.headless=true -Dhudson.plugins.git.GitSCM.verbose=true -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.util.logging.config.file=/etc/jenkins/logging.properties -Dhudson.udp=-1 -Dhudson.DNSMultiCast.disabled=true -Djenkins.model.Jenkins.buildsDir=${ITEM_ROOTDIR}/builds -Djenkins.model.Jenkins.workspacesDir=${ITEM_ROOTDIR}/workspace -Dhudson.model.DirectoryBrowserSupport.CSP=sandbox; default-src 'none'; img-src 'self'; style-src 'self' 'unsafe-inline'; media-src 'self' -jar /usr/share/jenkins/jenkins.war --accessLoggerClassName=winstone.accesslog.SimpleAccessLogger --simpleAccessLogger.format=combined --simpleAccessLogger.file=/var/log/jenkins/access.log --webroot=/var/cache/jenkins/war --pluginroot=/var/cache/jenkins/plugins --httpPort=8080 --prefix=/

Both are using the same puppet role so this is unexpected.

I did a systemctl start jenkins followed by systemctl status jenkins and it was failed:

  Active: failed (Result: exit-code) since Wed 2020-06-24 23:40:31 UTC; 3s ago
 Process: 27634 ExecStart=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Djava.awt.headless=true -Dhudson.plugins.git.GitSCM.verbose=true -Dhudson.model.ParametersAction.keepUndef
Main PID: 27634 (code=exited, status=1/FAILURE)

Edit: systemctl stop jenkins -> rogue second jenkins process still running -> kill that process -> start properly again with systemtl start jenkins -> icinga recovers (same for both machines 1002 and 2002)

Mentioned in SAL (#wikimedia-operations) [2020-06-24T23:43:12Z] <mutante> releases1002 - kill rogue jenkins process, start jenkins with systemctl start jenkins (T247652)

Mentioned in SAL (#wikimedia-operations) [2020-06-24T23:44:46Z] <mutante> releases2002 - systemctl stop jenkins, kill 15244 (rogue jenkins process), start jenkins with systemctl start jenkins (T247652)

The server with /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java has been started by our systemd unit and is correct.

The other shows it has been started with /usr/bin/daemon which in turns fork to start a /usr/bin/java. That originates from the init.d script shipped by the Jenkins Debian package.

We would want either:

  • the Debian package to be installed without starting the service, then have puppet to install the systemd unit and spawn it.
  • install the systemd unit without starting the service, install the Debian package

I don't think either is easily doable. So in the end one has to manually kill it and restart the unit. Once the systemd unit is installed, that is no more an issue.

jcrespo added a subscriber: jcrespo.EditedJun 25 2020, 7:57 AM

This is bacula when trying to backup releases2002:

25-Jun 04:05 backup1001.eqiad.wmnet JobId 239095: No prior Full backup Job record found.
25-Jun 04:05 backup1001.eqiad.wmnet JobId 239095: No prior or suitable Full backup found in catalog. Doing FULL backup.
25-Jun 04:49 backup1001.eqiad.wmnet JobId 239095: Start Backup JobId 239095, Job=releases2002.codfw.wmnet-Monthly-1st-Fr
i-production-srv-org-wikimedia.2020-06-25_04.05.01_50
25-Jun 04:49 backup1001.eqiad.wmnet JobId 239095: Using Device "FileStorageProduction" to write.
25-Jun 04:49 backup1001.eqiad.wmnet JobId 239095: Warning: bsockcore.c:203 Could not connect to Client: releases2002.cod
fw.wmnet-fd on releases2002.codfw.wmnet:9102. ERR=Connection refused
Retrying ...
25-Jun 04:52 backup1001.eqiad.wmnet JobId 239095: Fatal error: bsockcore.c:209 Unable to connect to Client: releases2002
.codfw.wmnet-fd on releases2002.codfw.wmnet:9102. ERR=Connection refused
25-Jun 04:52 backup1001.eqiad.wmnet JobId 239095: Fatal error: No Job status returned from FD.
25-Jun 04:52 backup1001.eqiad.wmnet JobId 239095: Error: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian buster/sid
  JobId:                  239095
  Job:                    releases2002.codfw.wmnet-Monthly-1st-Fri-production-srv-org-wikimedia.2020-06-25_04.05.01_50
  Backup Level:           Full (upgraded from Incremental)
  Client:                 "releases2002.codfw.wmnet-fd" 
  FileSet:                "srv-org-wikimedia" 2013-08-27 22:09:41
  Pool:                   "production" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "backup1001-FileStorageProduction" (From Pool resource)
  Scheduled time:         25-Jun-2020 04:05:01
  Start time:             25-Jun-2020 04:49:42
  End time:               25-Jun-2020 04:52:42
  Elapsed time:           3 mins 
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      14474
  Volume Session Time:    1586342727
  Last Volume Bytes:      311,710,128,846 (311.7 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Waiting on FD
  Termination:            *** Backup Error ***

This happened consistently on retry. My first guess is that the bacula-fd daemon may not be running or a firewall issue. Backups on buster work normally on other hosts.

There are ferm rules:

iptables --list -v|grep bacula
   36  2160 ACCEPT     tcp  --  any    any     backup1001.eqiad.wmnet  anywhere             tcp dpt:bacula-fd
    0     0 ACCEPT     tcp  --  any    any     helium.eqiad.wmnet   anywhere             tcp dpt:bacula-fd

bacula-fd is running:

`
systemctl status bacula-fd
● bacula-fd.service - Bacula File Daemon service
   Loaded: loaded (/lib/systemd/system/bacula-fd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-06-24 22:29:36 UTC; 10h ago

But it is only listening on 127.0.0.1:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:9102          0.0.0.0:*               LISTEN      1202/bacula-fd

Puppet installed the package which I guess started the daemon:

Jun 24 22:29:28 releases2002 puppet-agent[583]: Applying configuration version '(40446aa8ee) Dzahn - site: add releases role to releases1002/2002'
...
Jun 24 22:29:37 releases2002 puppet-agent[583]: (/Stage[main]/Bacula::Client/Package[bacula-fd]/ensure) created

The configuration file then got created to replace the default config file shipped by the package:

Jun 24 22:29:37 releases2002 puppet-agent[583]: (/Stage[main]/Bacula::Client/File[/etc/bacula/bacula-fd.conf]/content) +++ /tmp/puppet-file20200624-583-6b4lhi#0112020-06-24 22:29:37.090606910 +0000
...
 FileDaemon {
  ...
-     FDAddress = 127.0.0.1
+    # FDAddresses = # For director connections
  ...
}

Puppet triggered a refresh:

Jun 24 22:29:37 releases2002 puppet-agent[583]: (/Stage[main]/Bacula::Client/File[/etc/bacula/bacula-fd.conf]) Scheduling refresh of Service[bacula-fd]

But that apparently hasn't caused the file to be taken in account. So I guess it is just about restarting the bacula-fd systemd unit. Bonus for finding out in puppet why the daemon does not properly take in account a new configuration when it is notified.

Mentioned in SAL (#wikimedia-operations) [2020-06-25T08:42:34Z] <hashar> releases2002: restarted bacula-fd to take in account the puppet provided configuration # T247652

@jcrespo should be good now:

# netstat -tlnp|grep bacula
tcp        0      0 0.0.0.0:9102            0.0.0.0:*               LISTEN      26877/bacula-fd

Thanks, it ran successfully now:

239105  Full          21    22.89 K  OK       25-Jun-20 08:44 releases2002.codfw.wmnet-Monthly-1st-Fri-production-srv-org-wikimedia

Although it only backed up 21 files and 22.89K (the other servers had much more data).

Dzahn added a comment.Jun 25 2020, 2:41 PM

Thanks, it ran successfully now:

239105  Full          21    22.89 K  OK       25-Jun-20 08:44 releases2002.codfw.wmnet-Monthly-1st-Fri-production-srv-org-wikimedia

Although it only backed up 21 files and 22.89K (the other servers had much more data).

Thank you both. This was not expected to happen. Usually i never had to manually restart a bacula-fd. It must be some race condition.

That part that it doesn't have many files is normal so far since these have not been deployed to yet. They are brand new.

Change 607838 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases::mediawiki: only install PHP packages if pre-buster

https://gerrit.wikimedia.org/r/607838

Change 607838 merged by Dzahn:
[operations/puppet@production] releases::mediawiki: only install PHP packages if pre-buster

https://gerrit.wikimedia.org/r/607838

Change 607641 abandoned by Hashar:
releases::mediawiki:: support buster / PHP 7.3

Reason:
We are instead removing the PHP packages from the release hosts: https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /607858/

https://gerrit.wikimedia.org/r/607641

Change 610193 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases::mediawiki: support rsyncing files to multiple secondaries

https://gerrit.wikimedia.org/r/610193

Change 610195 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: also sync blubber,parsoid,reprepro files to multiple servers

https://gerrit.wikimedia.org/r/610195

Change 610193 merged by Dzahn:
[operations/puppet@production] releases::mediawiki: support rsyncing files to multiple secondaries

https://gerrit.wikimedia.org/r/610193

Mentioned in SAL (#wikimedia-operations) [2020-07-08T21:56:54Z] <mutante> deleting files from releases2001 that are not existing on releases1001 to make them mirrors. rsync with --delete and the command from quickdatacopy class (T247652)

Change 610389 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rsync::quickdatacopy: add optional parameter to let rsync --delete files

https://gerrit.wikimedia.org/r/610389

Change 610405 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: switch reprepro file sync to support multiple destinations

https://gerrit.wikimedia.org/r/610405

Change 610195 abandoned by Dzahn:
[operations/puppet@production] releases: also sync blubber,parsoid,reprepro files to multiple servers

Reason:
for blubber and parsoid this would be duplicate, they are already included in the mediawiki sync. upload other patches that move code around to make that more clear

https://gerrit.wikimedia.org/r/610195

Change 610405 merged by Dzahn:
[operations/puppet@production] releases: switch reprepro file sync to support multiple destinations

https://gerrit.wikimedia.org/r/610405

Change 610389 merged by Dzahn:
[operations/puppet@production] rsync::quickdatacopy: add optional parameter to let rsync --delete files

https://gerrit.wikimedia.org/r/610389

Change 618412 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] ATS: switch releases.wm to new buster backend servers

https://gerrit.wikimedia.org/r/618412

Change 618411 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: use --delete when rsyncing files between servers

https://gerrit.wikimedia.org/r/618411

Change 618415 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] httpbb: add test file for releases.wm.org

https://gerrit.wikimedia.org/r/618415

Change 618559 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: open firewall hole for http from deployment_server

https://gerrit.wikimedia.org/r/618559

Change 618559 merged by Dzahn:
[operations/puppet@production] releases: open firewall hole for http from deployment_server

https://gerrit.wikimedia.org/r/618559

Change 618621 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera: switch releases server to releases1001, remove 1001/2001

https://gerrit.wikimedia.org/r/618621

Change 618415 merged by Dzahn:
[operations/puppet@production] httpbb: add directory and test file for releases.wm.org

https://gerrit.wikimedia.org/r/618415

Change 618621 merged by Dzahn:
[operations/puppet@production] hiera: switch releases server to releases1001, remove 1001/2001

https://gerrit.wikimedia.org/r/618621

Change 619393 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: allow http connections also from cumin masters

https://gerrit.wikimedia.org/r/619393

Change 619393 merged by Dzahn:
[operations/puppet@production] releases: allow http connections also from cumin masters

https://gerrit.wikimedia.org/r/619393

Change 618412 merged by Dzahn:
[operations/dns@master] ATS: switch releases.wikimedia.org to buster backends

https://gerrit.wikimedia.org/r/618412

Mentioned in SAL (#wikimedia-operations) [2020-08-10T23:53:14Z] <mutante> https://releases.wikimedia.org switched to new backends running Debian buster. files have been synced. httpbb tests have been created and pass. (T247652)

Mentioned in SAL (#wikimedia-operations) [2020-08-11T00:08:22Z] <mutante> releases-jenkins.wikimedia.org currently under maintenance (T247652)

Mentioned in SAL (#wikimedia-operations) [2020-08-11T00:24:10Z] <mutante> reverting switch of releases.wikimedia.org for today since releases-jenkins.wikimedia.org is tied to it and new jenkins still needs some config and plugins (T247652)

Change 619822 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: allow rsyncing jenkins data between releases servers

https://gerrit.wikimedia.org/r/619822

Change 619826 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ATS: temp. set backend for releases-jenkins to releases1001

https://gerrit.wikimedia.org/r/619826

Change 619826 merged by Dzahn:
[operations/puppet@production] ATS: temp. set backend for releases-jenkins to releases1001

https://gerrit.wikimedia.org/r/619826

Change 619822 merged by Dzahn:
[operations/puppet@production] releases: allow rsyncing jenkins data between releases servers

https://gerrit.wikimedia.org/r/619822

Change 620099 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: add quickdatacopy rsync on the primary as well

https://gerrit.wikimedia.org/r/620099

Change 620109 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: set releases1001 as primary to sync jenkins config

https://gerrit.wikimedia.org/r/620109

Change 620109 merged by Dzahn:
[operations/puppet@production] releases: set releases1001 as primary to sync jenkins config

https://gerrit.wikimedia.org/r/620109

Change 620099 merged by Dzahn:
[operations/puppet@production] releases: rsync needs to be on all servers incl the primary

https://gerrit.wikimedia.org/r/620099

Mentioned in SAL (#wikimedia-operations) [2020-08-13T21:11:24Z] <mutante> rsyncing /var/lib/jenkins from releases1001 to releases1002 and then all other releases* servers. 57GB, overwriting existing data from manual config (T247652)

Change 620135 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: reprepro rsync needs to be on all servers

https://gerrit.wikimedia.org/r/620135

Change 620135 merged by Dzahn:
[operations/puppet@production] releases: reprepro rsync needs to be on all servers

https://gerrit.wikimedia.org/r/620135

Change 620137 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: avoid adding rsync when source and dest are the same

https://gerrit.wikimedia.org/r/620137

Change 620137 merged by Dzahn:
[operations/puppet@production] releases: avoid adding rsync when source and dest are the same

https://gerrit.wikimedia.org/r/620137

Change 620141 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: ensure to have motd warnings on all secondary servers

https://gerrit.wikimedia.org/r/620141

Change 620141 merged by Dzahn:
[operations/puppet@production] releases: ensure to have motd warnings on all secondary servers

https://gerrit.wikimedia.org/r/620141

Change 620143 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: update contents of the warning MOTD template

https://gerrit.wikimedia.org/r/620143

Change 620143 merged by Dzahn:
[operations/puppet@production] releases: update contents of the warning MOTD template

https://gerrit.wikimedia.org/r/620143

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:45 PM

Change 621059 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: stop jenkins on releases1001, start it on releases1001

https://gerrit.wikimedia.org/r/621059

Change 621059 merged by Dzahn:
[operations/puppet@production] releases: stop jenkins on releases1001, start it on releases1002

https://gerrit.wikimedia.org/r/621059

both https://releases.wikimedia.org and https://releases-jenkins.wikimedia.org have been switched to the new backends running on buster.

Dzahn closed this task as Resolved.Aug 18 2020, 10:24 PM
Dzahn changed the status of subtask T260742: decom releases1001 and releases2001 from Open to Stalled.Aug 26 2020, 6:54 PM