I'm not sure if this is a scap or network thing:
scap deploy-log -f scap/log/scap-sync-2017-11-29-0003-1-g532bd0b.log
I'm not sure if this is a scap or network thing:
scap deploy-log -f scap/log/scap-sync-2017-11-29-0003-1-g532bd0b.log
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
scap: Remove git_rev setting | mediawiki/services/ores/deploy | master | +0 -1 |
Restricted Differential Revision | |
Restricted Differential Revision |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T176324 Scoring platform team FY18 Q2 | |||
Declined | None | T179501 Use external dsh group to list pooled ORES nodes | |||
Resolved | akosiaris | T168073 Switch ORES to dedicated cluster | |||
Resolved | awight | T181661 Connection timeout from tin to new ores servers | |||
Resolved | akosiaris | T182347 SCAP: Upload debian package version 3.7.4-3 | |||
Resolved | akosiaris | T183046 scap 3.7.4-2 is broken | |||
Invalid | None | T182801 New, mysterious scap failure |
Interesting...
How do I get more verbosity, i.e. for ssh?
Hm, judging by https://phabricator.wikimedia.org/source/scap/browse/master/scap/ssh.py;0d2afa233f3ee489ddbfb9e7dbfaf467452e50db$41, it should probably require some changes in scap's code so the scap -v would propagate to SSH as well.
That sounds like a thoroughly sensible change. I'll create a patch now.
I'd like to push the latest scap code to production this week if I can get an opsen to upload the package. I'll create a new release tag today. That will at least help to debug this issue since -v will be passed on to ssh as suggested by @akosiaris above.
This is affecting me in production, now:
Timeout, server scb2004.codfw.wmnet not responding. 16:01:39 connection to scb2004.codfw.wmnet failed and future stages will not be attempted for this target
@awight: can you try it now with the -v flag, scap should include the ssh logs in verbose output now.
Mentioned in SAL (#wikimedia-operations) [2017-12-11T19:17:56Z] <awight@tin> Started deploy [ores/deploy@1c0ede0]: (non-production) Testing parallel ORES deployment, T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-11T19:19:07Z] <awight@tin> Finished deploy [ores/deploy@1c0ede0]: (non-production) Testing parallel ORES deployment, T181661 (duration: 01m 12s)
Done. Logs are in scap-sync-2017-12-09-0004.log (sic., note that the dates are still misleading)
Thanks for this feature, it looks helpful!
Something really strange is going on...
debug1: Offering RSA public key: /etc/keyholder.d/deploy_service.pub debug1: Server accepts key: pkalg ssh-rsa blen 279 debug1: Authentication succeeded (publickey). Authenticated to ores1004.eqiad.wmnet ([10.64.16.95]:22). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session. debug1: Sending command: /usr/bin/scap deploy-local -v --repo ores/deploy -g cluster fetch --refresh-config http://tin.eqiad.wmnet/ores/deploy/.git Timeout, server ores1004.eqiad.wmnet not responding. 19:18:57 [tin] connection to ores1004.eqiad.wmnet failed and future stages will not be attempted for this target
So it would appear that this deploy-local command freezes: /usr/bin/scap deploy-local -v --repo ores/deploy -g cluster fetch --refresh-config http://tin.eqiad.wmnet/ores/deploy/.git
@thcipriani, @akosiaris does that seem right to you or am I missing something?
I'd like to see the corresponding syslog on the target machine, there might be clues in /var/log/auth.log /var/log/daemon.log
Pardon me, but I have to ask why a file with timestamps in the log file dating Dec 11th, and with a local fs system timestamp (on tin) of December 11th as well is named 2017-12-09 giving the impression it was created on Dec 9th. I guess a bug ?
In any case. daemon.log is not really expected to hold anything of significant value for Dec 11th 19:00-19:30 but here goes
Dec 11 19:03:15 ores1004 puppet-agent[7736]: Retrieving pluginfacts Dec 11 19:03:15 ores1004 puppet-agent[7736]: Retrieving plugin Dec 11 19:03:15 ores1004 puppet-agent[7736]: Loading facts Dec 11 19:03:27 ores1004 puppet-agent[7736]: Caching catalog for ores1004.eqiad.wmnet Dec 11 19:03:28 ores1004 puppet-agent[7736]: Applying configuration version '1513018998' Dec 11 19:03:30 ores1004 systemd[1]: Reloading. Dec 11 19:03:30 ores1004 systemd[1]: Started ACPI event daemon. Dec 11 19:03:30 ores1004 systemd[1]: Listening on ACPID Listen Socket. Dec 11 19:03:30 ores1004 systemd[1]: Mounted /. Dec 11 19:03:36 ores1004 puppet-agent[7736]: Finished catalog run in 8.37 seconds Dec 11 19:22:44 ores1004 dbus[921]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' Dec 11 19:22:44 ores1004 systemd[1]: Starting Time & Date Service... Dec 11 19:22:44 ores1004 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument Dec 11 19:22:44 ores1004 dbus[921]: [system] Successfully activated service 'org.freedesktop.timedate1' Dec 11 19:22:44 ores1004 systemd-timedated[13695]: /etc/localtime should be a symbolic link to a time zone data file in /usr/share/zoneinfo/. Dec 11 19:22:44 ores1004 systemd[1]: Started Time & Date Service. Dec 11 19:33:20 ores1004 puppet-agent[16060]: Retrieving pluginfacts Dec 11 19:33:20 ores1004 puppet-agent[16060]: Retrieving plugin Dec 11 19:33:20 ores1004 puppet-agent[16060]: Loading facts Dec 11 19:33:28 ores1004 puppet-agent[16060]: Caching catalog for ores1004.eqiad.wmnet Dec 11 19:33:30 ores1004 puppet-agent[16060]: Applying configuration version '1513020803' Dec 11 19:33:31 ores1004 systemd[1]: Reloading. Dec 11 19:33:31 ores1004 systemd[1]: Started ACPI event daemon. Dec 11 19:33:31 ores1004 systemd[1]: Listening on ACPID Listen Socket. Dec 11 19:33:31 ores1004 systemd[1]: Mounted /. Dec 11 19:33:37 ores1004 puppet-agent[16060]: Finished catalog run in 7.96 seconds Dec 11 19:52:44 ores1004 dbus[921]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' Dec 11 19:52:44 ores1004 systemd[1]: Starting Time & Date Service... Dec 11 19:52:44 ores1004 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument Dec 11 19:52:44 ores1004 dbus[921]: [system] Successfully activated service 'org.freedesktop.timedate1' Dec 11 19:52:44 ores1004 systemd-timedated[20231]: /etc/localtime should be a symbolic link to a time zone data file in /usr/share/zoneinfo/. Dec 11 19:52:44 ores1004 systemd[1]: Started Time & Date Service. Dec 11 20:04:05 ores1004 puppet-agent[22822]: Retrieving pluginfacts
auth.log is just a tad more useful, just a repeat of what we 've seen before
Dec 11 19:17:50 ores1004 sshd[11976]: Set /proc/self/oom_score_adj to 0 Dec 11 19:17:50 ores1004 sshd[11976]: Connection from 208.80.155.119 port 46296 on 10.64.16.95 port 22 Dec 11 19:17:50 ores1004 sshd[11976]: Connection closed by 208.80.155.119 [preauth] Dec 11 19:17:55 ores1004 sshd[11978]: Set /proc/self/oom_score_adj to 0 Dec 11 19:17:55 ores1004 sshd[11978]: Connection from 208.80.153.74 port 45262 on 10.64.16.95 port 22 Dec 11 19:17:55 ores1004 sshd[11978]: Connection closed by 208.80.153.74 [preauth] Dec 11 19:17:56 ores1004 sshd[11980]: Set /proc/self/oom_score_adj to 0 Dec 11 19:17:56 ores1004 sshd[11980]: Connection from 10.64.0.196 port 57196 on 10.64.16.95 port 22 Dec 11 19:17:56 ores1004 sshd[11980]: Postponed publickey for deploy-service from 10.64.0.196 port 57196 ssh2 [preauth] Dec 11 19:17:56 ores1004 sshd[11980]: Accepted publickey for deploy-service from 10.64.0.196 port 57196 ssh2: RSA 6d:54:92:8b:39:10:f5:9b:84:40:36:ef:3c:9a:6d:d8 Dec 11 19:17:56 ores1004 sshd[11980]: pam_unix(sshd:session): session opened for user deploy-service by (uid=0) Dec 11 19:17:56 ores1004 sshd[11980]: User child is on pid 11995 Dec 11 19:17:56 ores1004 sshd[11995]: Starting session: command for deploy-service from 10.64.0.196 port 57196 Dec 11 19:18:01 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm Dec 11 19:18:01 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:18:01 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:18:01 ores1004 CRON[12265]: pam_unix(cron:session): session opened for user prometheus by (uid=0) Dec 11 19:18:01 ores1004 CRON[12265]: pam_unix(cron:session): session closed for user prometheus Dec 11 19:18:04 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid md Dec 11 19:18:04 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:18:04 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:18:06 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm Dec 11 19:18:06 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:18:06 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:18:17 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid md Dec 11 19:18:17 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:18:17 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:18:50 ores1004 sshd[12515]: Set /proc/self/oom_score_adj to 0 Dec 11 19:18:50 ores1004 sshd[12515]: Connection from 208.80.155.119 port 58488 on 10.64.16.95 port 22 Dec 11 19:18:50 ores1004 sshd[12515]: Connection closed by 208.80.155.119 [preauth] Dec 11 19:18:55 ores1004 sshd[12517]: Set /proc/self/oom_score_adj to 0 Dec 11 19:18:55 ores1004 sshd[12517]: Connection from 208.80.153.74 port 57728 on 10.64.16.95 port 22 Dec 11 19:18:55 ores1004 sshd[12517]: Connection closed by 208.80.153.74 [preauth] Dec 11 19:19:01 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm Dec 11 19:19:01 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:19:01 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:19:01 ores1004 CRON[12550]: pam_unix(cron:session): session opened for user prometheus by (uid=0) Dec 11 19:19:02 ores1004 CRON[12550]: pam_unix(cron:session): session closed for user prometheus Dec 11 19:19:04 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid md Dec 11 19:19:04 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:19:04 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:19:06 ores1004 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm Dec 11 19:19:06 ores1004 sudo: pam_unix(sudo:session): session opened for user root by (uid=0) Dec 11 19:19:06 ores1004 sudo: pam_unix(sudo:session): session closed for user root Dec 11 19:19:13 ores1004 sshd[11995]: Connection closed by 10.64.0.196 Dec 11 19:19:13 ores1004 sshd[11995]: Transferred: sent 76296, received 3484 bytes Dec 11 19:19:13 ores1004 sshd[11995]: Closing connection to 10.64.0.196 port 57196 Dec 11 19:19:13 ores1004 sshd[11980]: pam_unix(sshd:session): session closed for user deploy-service
Since syslog was mentioned, here it is as well
Dec 11 19:00:01 ores1004 CRON[7031]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:01:01 ores1004 CRON[7244]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:02:01 ores1004 CRON[7464]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:03:01 ores1004 CRON[7690]: (root) CMD (/usr/local/sbin/puppet-run > /dev/null 2>&1) Dec 11 19:03:01 ores1004 CRON[7691]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:03:15 ores1004 puppet-agent[7736]: Retrieving pluginfacts Dec 11 19:03:15 ores1004 puppet-agent[7736]: Retrieving plugin Dec 11 19:03:15 ores1004 puppet-agent[7736]: Loading facts Dec 11 19:03:27 ores1004 puppet-agent[7736]: Caching catalog for ores1004.eqiad.wmnet Dec 11 19:03:28 ores1004 puppet-agent[7736]: Applying configuration version '1513018998' Dec 11 19:03:30 ores1004 systemd[1]: Reloading. Dec 11 19:03:30 ores1004 systemd[1]: Started ACPI event daemon. Dec 11 19:03:30 ores1004 systemd[1]: Listening on ACPID Listen Socket. Dec 11 19:03:30 ores1004 systemd[1]: Mounted /. Dec 11 19:03:32 ores1004 crontab[8226]: (root) LIST (root) Dec 11 19:03:32 ores1004 crontab[8228]: (root) LIST (prometheus) Dec 11 19:03:36 ores1004 puppet-agent[7736]: Finished catalog run in 8.37 seconds Dec 11 19:04:01 ores1004 CRON[8868]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:05:01 ores1004 CRON[9089]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 11 19:05:01 ores1004 CRON[9090]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:06:01 ores1004 CRON[9302]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:07:01 ores1004 CRON[9530]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:08:01 ores1004 CRON[9762]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:09:01 ores1004 CRON[9975]: (root) CMD ( [ -x /usr/lib/php5/sessionclean ] && /usr/lib/php5/sessionclean) Dec 11 19:09:01 ores1004 CRON[9976]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:10:01 ores1004 CRON[10222]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:11:01 ores1004 CRON[10433]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:12:02 ores1004 CRON[10660]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:13:01 ores1004 CRON[10888]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:14:01 ores1004 CRON[11109]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:15:01 ores1004 CRON[11329]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 11 19:15:01 ores1004 CRON[11330]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:16:01 ores1004 CRON[11553]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:17:01 ores1004 CRON[11782]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 11 19:17:01 ores1004 CRON[11783]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:18:01 ores1004 CRON[12266]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:19:01 ores1004 CRON[12551]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:20:01 ores1004 CRON[12765]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:21:01 ores1004 CRON[13215]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:22:01 ores1004 CRON[13522]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:22:44 ores1004 dbus[921]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' Dec 11 19:22:44 ores1004 systemd[1]: Starting Time & Date Service... Dec 11 19:22:44 ores1004 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument Dec 11 19:22:44 ores1004 dbus[921]: [system] Successfully activated service 'org.freedesktop.timedate1' Dec 11 19:22:44 ores1004 systemd-timedated[13695]: /etc/localtime should be a symbolic link to a time zone data file in /usr/share/zoneinfo/. Dec 11 19:22:44 ores1004 systemd[1]: Started Time & Date Service. Dec 11 19:23:01 ores1004 CRON[13781]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:24:01 ores1004 CRON[14005]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:25:01 ores1004 CRON[14210]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 11 19:25:01 ores1004 CRON[14211]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:26:01 ores1004 CRON[14431]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:27:01 ores1004 CRON[14657]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:28:01 ores1004 CRON[14900]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:29:01 ores1004 CRON[15121]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:30:01 ores1004 CRON[15341]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:31:01 ores1004 CRON[15560]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:32:01 ores1004 CRON[15786]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:32:17 ores1004 systemd-timesyncd[791]: interval/delta/delay/jitter/drift 2048s/+0.000s/0.000s/0.000s/+8ppm Dec 11 19:33:01 ores1004 CRON[15999]: (root) CMD (/usr/local/sbin/puppet-run > /dev/null 2>&1) Dec 11 19:33:01 ores1004 CRON[16004]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Dec 11 19:33:20 ores1004 puppet-agent[16060]: Retrieving pluginfacts Dec 11 19:33:20 ores1004 puppet-agent[16060]: Retrieving plugin Dec 11 19:33:20 ores1004 puppet-agent[16060]: Loading facts Dec 11 19:33:28 ores1004 puppet-agent[16060]: Caching catalog for ores1004.eqiad.wmnet Dec 11 19:33:30 ores1004 puppet-agent[16060]: Applying configuration version '1513020803' Dec 11 19:33:31 ores1004 systemd[1]: Reloading. Dec 11 19:33:31 ores1004 systemd[1]: Started ACPI event daemon. Dec 11 19:33:31 ores1004 systemd[1]: Listening on ACPID Listen Socket. Dec 11 19:33:31 ores1004 systemd[1]: Mounted /. Dec 11 19:33:33 ores1004 crontab[16520]: (root) LIST (root) Dec 11 19:33:33 ores1004 crontab[16522]: (root) LIST (prometheus) Dec 11 19:33:37 ores1004 puppet-agent[16060]: Finished catalog run in 7.96 seconds Dec 11 19:34:01 ores1004 CRON[17085]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Now for the more interesting stuff. I 've tried to run /usr/bin/scap deploy-local -v --repo ores/deploy -g cluster fetch --refresh-config manually as deploy-service on ores1004.eqiad.wmnet and it croaks with a stacktrace
{"name": "deploy-local", "created": 1513091889.277656, "args": [], "msecs": 277.65607833862305, "filename": "cli.py", "levelno": 30, "msg": "Unhandled error:", "lineno": 230, "exc_text": "Traceback (most recent call last):\n File \"/usr/lib/python2.7/dist-packages/scap/cli.py\", line 329, in run\n exit_status = app.main(app.extra_arguments)\n File \"/usr/lib/python2.7/dist-packages/scap/deploy.py\", line 146, in main\n getattr(self, stage)()\n File \"/usr/lib/python2.7/dist-packages/scap/deploy.py\", line 316, in fetch\n git.checkout(rev_dir, self.rev)\n File \"/usr/lib/python2.7/dist-packages/scap/git.py\", line 349, in checkout\n git.checkout('--force', '--quiet', rev)\n File \"/usr/lib/python2.7/dist-packages/scap/sh.py\", line 1428, in __call__\n return RunningCommand(cmd, call_args, stdin, stdout, stderr)\n File \"/usr/lib/python2.7/dist-packages/scap/sh.py\", line 775, in __init__\n self.wait()\n File \"/usr/lib/python2.7/dist-packages/scap/sh.py\", line 793, in wait\n self.handle_command_exit_code(exit_code)\n File \"/usr/lib/python2.7/dist-packages/scap/sh.py\", line 816, in handle_command_exit_code\n raise exc\nErrorReturnCode_128: \n\n RAN: /usr/bin/git checkout --force --quiet b4f2b0249a60754cfa4e18650fdd45c62ef4a09a\n\n STDOUT:\n\n\n STDERR:\nfatal: reference is not a tree: b4f2b0249a60754cfa4e18650fdd45c62ef4a09a\n", "funcName": "_handle_exception", "relativeCreated": 1288.093090057373} {"name": "deploy-local", "created": 1513091889.280495, "args": ["deploy-local", "ErrorReturnCode_128", {"full_cmd": "/usr/bin/git checkout --force --quiet b4f2b0249a60754cfa4e18650fdd45c62ef4a09a", "stderr": "fatal: reference is not a tree: b4f2b0249a60754cfa4e18650fdd45c62ef4a09a\n", "stdout": ""}], "msecs": 280.49492835998535, "filename": "cli.py", "levelno": 40, "msg": "%s failed: <%s> %s", "lineno": 232, "exc_text": null, "funcName": "_handle_exception", "relativeCreated": 1290.9319400787354}
simply put b4f2b0249a60754cfa4e18650fdd45c62ef4a09a does not exist in whichever repo scap tries to find it at. For some more information
deploy-service@ores1004:/srv/deployment/ores/deploy-cache$ ls -la . revs/ .: total 20 drwxr-xr-x 4 deploy-service deploy-service 4096 Dec 1 14:24 . drwxr-xr-x 4 deploy-service root 4096 Dec 1 14:23 .. -rw-r--r-- 1 deploy-service deploy-service 1851 Dec 12 15:18 .config lrwxrwxrwx 1 deploy-service deploy-service 45 Dec 1 14:24 .done -> revs/532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 12 15:18 cache lrwxrwxrwx 1 deploy-service deploy-service 45 Dec 1 14:22 current -> revs/532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 10 deploy-service deploy-service 4096 Dec 11 19:21 revs revs/: total 40 drwxr-xr-x 10 deploy-service deploy-service 4096 Dec 11 19:21 . drwxr-xr-x 4 deploy-service deploy-service 4096 Dec 1 14:24 .. drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 6 14:40 29905e509984050a2ea675725f3c110033fc67a0 drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 29 18:04 532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 5 15:53 6baed71ef9a02264660f8637fb2313c9c560f71d drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 6 17:33 82a13ae1173ec570ea563f389ab22e1a69aa154a drwxr-xr-x 6 deploy-service deploy-service 4096 Oct 26 16:43 971be2294a51772d309c967465b32f4f9a9b66f5 drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 1 20:18 9afbf148ec3007a6787ab099978bf5a08360879d drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 11 19:21 b4f2b0249a60754cfa4e18650fdd45c62ef4a09a drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 29 13:17 e58bfbf2827be1ac453bc3d07dfe7fe41a08dc66
So the object from a first look does exist in the deploy-cache. However...
deploy-service@ores1004:/srv/deployment/ores/deploy-cache/revs/b4f2b0249a60754cfa4e18650fdd45c62ef4a09a$ git show commit 82dfd5652c1550dbee362100204d7aa5242e59ab Author: halfak <aaron.halfaker@gmail.com> Date: Mon Jun 26 17:15:23 2017 -0500 Bumps submodules for editquality/ores/wheels (1.3.15 update) Change-Id: I72eabc15cf84ca1abc38b1f19883eb229d2bf778
OK, what's going on here ?
This is a known bug, T181171: scap logs sometimes have incorrect date. The date is derived from the git commit being deployed, rather than the current system date.
Mentioned in SAL (#wikimedia-operations) [2017-12-12T15:58:58Z] <akosiaris@tin> Started deploy [ores/deploy@b4f2b02]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-12T15:59:08Z] <akosiaris@tin> Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 09s)
Mentioned in SAL (#wikimedia-operations) [2017-12-12T15:59:19Z] <akosiaris@tin> Started deploy [ores/deploy@b4f2b02]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-12T15:59:28Z] <akosiaris@tin> Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 04s)
FWIW, if you pass -Dlog_json:False you get non-json output that's easier to read: /usr/bin/scap deploy-local -v -Dlog_json:False --repo ores/deploy -g cluster fetch --refresh-config
simply put b4f2b0249a60754cfa4e18650fdd45c62ef4a09a does not exist in whichever repo scap tries to find it at. For some more information
deploy-service@ores1004:/srv/deployment/ores/deploy-cache$ ls -la . revs/ .: total 20 drwxr-xr-x 4 deploy-service deploy-service 4096 Dec 1 14:24 . drwxr-xr-x 4 deploy-service root 4096 Dec 1 14:23 .. -rw-r--r-- 1 deploy-service deploy-service 1851 Dec 12 15:18 .config lrwxrwxrwx 1 deploy-service deploy-service 45 Dec 1 14:24 .done -> revs/532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 12 15:18 cache lrwxrwxrwx 1 deploy-service deploy-service 45 Dec 1 14:22 current -> revs/532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 10 deploy-service deploy-service 4096 Dec 11 19:21 revs revs/: total 40 drwxr-xr-x 10 deploy-service deploy-service 4096 Dec 11 19:21 . drwxr-xr-x 4 deploy-service deploy-service 4096 Dec 1 14:24 .. drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 6 14:40 29905e509984050a2ea675725f3c110033fc67a0 drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 29 18:04 532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 5 15:53 6baed71ef9a02264660f8637fb2313c9c560f71d drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 6 17:33 82a13ae1173ec570ea563f389ab22e1a69aa154a drwxr-xr-x 6 deploy-service deploy-service 4096 Oct 26 16:43 971be2294a51772d309c967465b32f4f9a9b66f5 drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 1 20:18 9afbf148ec3007a6787ab099978bf5a08360879d drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 11 19:21 b4f2b0249a60754cfa4e18650fdd45c62ef4a09a drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 29 13:17 e58bfbf2827be1ac453bc3d07dfe7fe41a08dc66So the object from a first look does exist in the deploy-cache. However...
deploy-service@ores1004:/srv/deployment/ores/deploy-cache/revs/b4f2b0249a60754cfa4e18650fdd45c62ef4a09a$ git show commit 82dfd5652c1550dbee362100204d7aa5242e59ab Author: halfak <aaron.halfaker@gmail.com> Date: Mon Jun 26 17:15:23 2017 -0500 Bumps submodules for editquality/ores/wheels (1.3.15 update) Change-Id: I72eabc15cf84ca1abc38b1f19883eb229d2bf778OK, what's going on here ?
This is a bit of scap internals. The origin remote of anything in the revs dirs is the deploy-cache/cache dir. The object is likely not there because the directory exists (somehow) and you didn't pass --force see:
https://github.com/wikimedia/scap/blob/master/scap/deploy.py#L289-L295
Mentioned in SAL (#wikimedia-operations) [2017-12-12T16:22:38Z] <akosiaris@tin> Started deploy [ores/deploy@b4f2b02]: T181661
akosiaris@tin:/srv/deployment/ores/deploy$ scap deploy -v -l 'ores1004.eqiad.wmnet' T181661
fails reproducibly.
Log is
16:22:37 <Command u'/usr/bin/git submodule foreach --recursive git update-server-info', pid 33404>: process completed 16:22:37 Started deploy [ores/deploy@b4f2b02]: T181661 16:22:37 == CLUSTER == :* ores1004.eqiad.wmnet 16:22:37 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'cluster', 'fetch', '--refresh-config'] 16:22:37 Using key: /etc/keyholder.d/deploy_service.pub 16:22:39 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'cluster', 'fetch', '--refresh-config'] on ores1004.eqiad.wmnet returned [70]: OpenSSH_6.7p1 Debian-5+deb8u3, OpenSSL 1.0.2m 2 Nov 2017 debug1: Reading configuration data /dev/null debug1: Connecting to ores1004.eqiad.wmnet [10.64.16.95] port 22. debug1: Connection established. debug1: identity file /etc/keyholder.d/deploy_service.pub type 1 debug1: key_load_public: No such file or directory debug1: identity file /etc/keyholder.d/deploy_service.pub-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3 debug1: Remote protocol version 2.0, remote software version OpenSSH_6.7p1 Debian-5+deb8u3 debug1: match: OpenSSH_6.7p1 Debian-5+deb8u3 pat OpenSSH* compat 0x04000000 debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-ctr umac-128-etm@openssh.com none debug1: kex: client->server aes128-ctr umac-128-etm@openssh.com none debug1: sending SSH2_MSG_KEX_ECDH_INIT debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: Server host key: ECDSA 5d:bd:36:84:e8:2a:57:85:03:ef:80:cc:3e:fc:fb:6f debug1: Host 'ores1004.eqiad.wmnet' is known and matches the ECDSA host key. debug1: Found key in /etc/ssh/ssh_known_hosts:1116 debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,keyboard-interactive debug1: Next authentication method: publickey debug1: Offering RSA public key: /etc/keyholder.d/deploy_service.pub debug1: Server accepts key: pkalg ssh-rsa blen 279 debug1: Authentication succeeded (publickey). Authenticated to ores1004.eqiad.wmnet ([10.64.16.95]:22). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session. debug1: Sending command: /usr/bin/scap deploy-local -v --repo ores/deploy -g cluster fetch --refresh-config http://tin.eqiad.wmnet/ores/deploy/.git /srv/deployment/ores/deploy-cache/cache debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0 debug1: channel 0: free: client-session, nchannels 1 debug1: fd 1 clearing O_NONBLOCK Transferred: sent 3432, received 119092 bytes, in 1.4 seconds Bytes per second: sent 2459.9, received 85360.0 debug1: Exit status 70
This correlates with trying to run scap deploy-local manually.
@mmodell Ideas on how I can debug this further ?
Mentioned in SAL (#wikimedia-operations) [2017-12-12T17:03:29Z] <akosiaris@tin> Started deploy [ores/deploy@b4f2b02]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-12T17:11:14Z] <akosiaris@tin> Started deploy [ores/deploy@b4f2b02]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-12T17:11:50Z] <akosiaris@tin> Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 36s)
Mentioned in SAL (#wikimedia-operations) [2017-12-12T17:12:00Z] <akosiaris@tin> Started deploy [ores/deploy@b4f2b02]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-12T17:12:10Z] <akosiaris@tin> Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 03s)
Ah indeed, that's more useful.
simply put b4f2b0249a60754cfa4e18650fdd45c62ef4a09a does not exist in whichever repo scap tries to find it at. For some more information
deploy-service@ores1004:/srv/deployment/ores/deploy-cache$ ls -la . revs/ .: total 20 drwxr-xr-x 4 deploy-service deploy-service 4096 Dec 1 14:24 . drwxr-xr-x 4 deploy-service root 4096 Dec 1 14:23 .. -rw-r--r-- 1 deploy-service deploy-service 1851 Dec 12 15:18 .config lrwxrwxrwx 1 deploy-service deploy-service 45 Dec 1 14:24 .done -> revs/532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 12 15:18 cache lrwxrwxrwx 1 deploy-service deploy-service 45 Dec 1 14:22 current -> revs/532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 10 deploy-service deploy-service 4096 Dec 11 19:21 revs revs/: total 40 drwxr-xr-x 10 deploy-service deploy-service 4096 Dec 11 19:21 . drwxr-xr-x 4 deploy-service deploy-service 4096 Dec 1 14:24 .. drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 6 14:40 29905e509984050a2ea675725f3c110033fc67a0 drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 29 18:04 532bd0bc6e923b8594aba71295e8c6db9bbc5f8e drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 5 15:53 6baed71ef9a02264660f8637fb2313c9c560f71d drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 6 17:33 82a13ae1173ec570ea563f389ab22e1a69aa154a drwxr-xr-x 6 deploy-service deploy-service 4096 Oct 26 16:43 971be2294a51772d309c967465b32f4f9a9b66f5 drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 1 20:18 9afbf148ec3007a6787ab099978bf5a08360879d drwxr-xr-x 6 deploy-service deploy-service 4096 Dec 11 19:21 b4f2b0249a60754cfa4e18650fdd45c62ef4a09a drwxr-xr-x 6 deploy-service deploy-service 4096 Nov 29 13:17 e58bfbf2827be1ac453bc3d07dfe7fe41a08dc66So the object from a first look does exist in the deploy-cache. However...
deploy-service@ores1004:/srv/deployment/ores/deploy-cache/revs/b4f2b0249a60754cfa4e18650fdd45c62ef4a09a$ git show commit 82dfd5652c1550dbee362100204d7aa5242e59ab Author: halfak <aaron.halfaker@gmail.com> Date: Mon Jun 26 17:15:23 2017 -0500 Bumps submodules for editquality/ores/wheels (1.3.15 update) Change-Id: I72eabc15cf84ca1abc38b1f19883eb229d2bf778OK, what's going on here ?
This is a bit of scap internals. The origin remote of anything in the revs dirs is the deploy-cache/cache dir. The object is likely not there because the directory exists (somehow) and you didn't pass --force see:
https://github.com/wikimedia/scap/blob/master/scap/deploy.py#L289-L295
--force did not solve the issue unfortunately :(
Moving the deploy-cache/cache directory aside did solve the issue (partially?) and moved on until...
7:18:27 acquiring buffering lock for flushing buffer 17:18:27 got buffering lock for flushing buffer 17:18:27 released buffering lock for flushing buffer 17:18:27 <Command u'/usr/bin/git submodule update --init --recursive --reference /srv/deployment/ores/deploy-cache/cache'>.<Process 3440 ['/usr/bin/git', 'submodule', 'update', '--init', '--recursive', '--reference', '/srv/deployment/ores/deploy-cache/cache']>.stdout: got chunk size 0 to flush: '' 17:18:27 acquiring buffering lock for flushing buffer 17:18:27 got buffering lock for flushing buffer 17:18:27 released buffering lock for flushing buffer 17:18:27 got chunk size 0 to flush: '' 17:18:27 Unhandled error: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 329, in run exit_status = app.main(app.extra_arguments) File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 146, in main getattr(self, stage)() File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 321, in fetch reference=self.context.cache_dir) File "/usr/lib/python2.7/dist-packages/scap/git.py", line 388, in update_submodules git.submodule(*cmd) File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 1428, in __call__ return RunningCommand(cmd, call_args, stdin, stdout, stderr) File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 775, in __init__ self.wait() File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 793, in wait self.handle_command_exit_code(exit_code) File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 816, in handle_command_exit_code raise exc ErrorReturnCode_1: RAN: /usr/bin/git submodule update --init --recursive --reference /srv/deployment/ores/deploy-cache/cache STDOUT: Submodule 'submodules/draftquality' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/draftquality) registered for path 'submodules/draftquality' Submodule 'submodules/editquality' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality) registered for path 'submodules/editquality' Submodule 'submodules/ores' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/ores) registered for path 'submodules/ores' Submodule 'submodules/wheels' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wheels) registered for path 'submodules/wheels' Submodule 'submodules/wikiclass' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wikiclass) registered for path 'submodules/wikiclass' Submodule path 'submodul... (258 more, please see e.stdout) STDERR: Cloning into 'submodules/draftquality'... Cloning into 'submodules/editquality'... fatal: reference is not a tree: 15d5283b7422919d85203b5ba907027f9356e421 Cloning into 'submodules/ores'... fatal: reference is not a tree: b7413d6abf80688d44b42da4a7d3a4d1a80ca150 Cloning into 'submodules/wheels'... Cloning into 'submodules/wikiclass'... Unable to checkout '15d5283b7422919d85203b5ba907027f9356e421' in submodule path 'submodules/editquality' Unable to checkout 'b7413d6abf80688d44b42da4a7d3a4d1a80ca150' in submodule path 'submodules/ores' 17:18:27 deploy-local failed: <ErrorReturnCode_1> RAN: /usr/bin/git submodule update --init --recursive --reference /srv/deployment/ores/deploy-cache/cache STDOUT: Submodule 'submodules/draftquality' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/draftquality) registered for path 'submodules/draftquality' Submodule 'submodules/editquality' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality) registered for path 'submodules/editquality' Submodule 'submodules/ores' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/ores) registered for path 'submodules/ores' Submodule 'submodules/wheels' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wheels) registered for path 'submodules/wheels' Submodule 'submodules/wikiclass' (http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wikiclass) registered for path 'submodules/wikiclass' Submodule path 'submodul... (258 more, please see e.stdout) STDERR: Cloning into 'submodules/draftquality'... Cloning into 'submodules/editquality'... fatal: reference is not a tree: 15d5283b7422919d85203b5ba907027f9356e421 Cloning into 'submodules/ores'... fatal: reference is not a tree: b7413d6abf80688d44b42da4a7d3a4d1a80ca150 Cloning into 'submodules/wheels'... Cloning into 'submodules/wikiclass'... Unable to checkout '15d5283b7422919d85203b5ba907027f9356e421' in submodule path 'submodules/editquality' Unable to checkout 'b7413d6abf80688d44b42da4a7d3a4d1a80ca150' in submodule path 'submodules/ores'
rOEQ15d5283b7422 doesn't exist in the editquality repo. Somehow the submodule pointers are wrong.
But it does exist on tin
akosiaris@tin:/srv/deployment/ores/deploy/.git/modules/submodules/editquality$ git show 15d5283b7422919d85203b5ba907027f9356e421 commit 15d5283b7422919d85203b5ba907027f9356e421 Merge: 529371c 4813659 Author: Aaron Halfaker <aaron.halfaker@gmail.com> Date: Wed Dec 6 11:47:01 2017 -0600 Merge pull request #108 from wiki-ai/eswikiquote Add eswikiquote reverted models and revisions for review
For completeness sake
deploy-service@ores1004:/srv/deployment/ores/deploy-cache/cache$ cat .gitmodules [submodule "submodules/wikiclass"] path = submodules/wikiclass url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wikiclass [submodule "submodules/editquality"] path = submodules/editquality url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality [submodule "submodules/ores"] path = submodules/ores url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/ores [submodule "submodules/wheels"] path = submodules/wheels url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wheels [submodule "submodules/draftquality"] path = submodules/draftquality url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/draftquality deploy-service@ores1004:/srv/deployment/ores/deploy-cache/cache$ cat .git/config [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true [remote "origin"] url = http://tin.eqiad.wmnet/ores/deploy/.git fetch = +refs/heads/*:refs/remotes/origin/* [branch "master"] remote = origin merge = refs/heads/master [submodule "submodules/draftquality"] url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/draftquality [submodule "submodules/editquality"] url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality [submodule "submodules/ores"] url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/ores [submodule "submodules/wheels"] url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wheels [submodule "submodules/wikiclass"] url = http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/wikiclass
But a git fetch indeed does not fetch anything new.
Another thing: I'm having difficulty just cloning the editquality submodule.
It's so large that git pack-objects takes over 100 seconds of cpu time and uses ~5.5gigs of ram.
RES SHR S %CPU %MEM TIME+ COMMAND 5.665g 486352 S 1822 9.0 102:16.34 git pack-objects --revs --thin --stdout --delta-base-offset
I don't think there is anything really wrong with scap, it's just an unusably-large git repo.
@mmodell is on to something though with the comment about that commit not being in the repo
akosiaris@tin:/srv/deployment/ores/deploy/.git/modules/submodules/editquality$ git branch --contains 15d5283b7422919d85203b5ba907027f9356e421 akosiaris@tin:/srv/deployment/ores/deploy/.git/modules/submodules/editquality$ git tag --contains 15d5283b7422919d85203b5ba907027f9356e421 akosiaris@tin:/srv/deployment/ores/deploy/.git/modules/submodules/editquality$
So this commit does not exist in any of the branches or tags
Hmm, indeed, if the object does not exist on any branch or tag then it likely won't be fetched by the "dumb" git http transport (which is what we still use in scap.)
Probably wouldn't be fetched by other transports either, though I guess it might get fetched along with other objects if it's in the same pack file.
Just fetching this one repo (editquality) from phabricator is causing inordinate load on the server. It's nothing that phab1001 can't handle, however, it's surprising how much work it takes for git to clone a single ~3gb repository.
load average: 18.23, 26.87, 28.41
Behavior is erratic as well
akosiaris@bast1001:~$ git clone https://phabricator.wikimedia.org/source/editquality.git Cloning into 'editquality'... error: RPC failed; result=22, HTTP code = 504 fatal: The remote end hung up unexpectedly
Ahem...
Yeah that repo is 334M in the current workdir but the .git is 2.1 gigs. That doesn't seem too unreasonable but apparently git can't handle it. It's especially bad cloning from phabricator over https (upstream:T4369)
A fresh clone of http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality on bast1001 does not have rOEQ15d5283b7422 either so aside from the difficult to fetch repo we have at least identified 1 more cause for the problem. It seems like the editquality submodule referenced by the deploy repo needs to be updated to contain that change as well and not be at HEAD detached at ac3042e state. @awight care to have the honours of updating that submodule ?
I can't help but wonder why scap is trying to checkout rOEQ15d5283b7422 however. Where does it get that info from ? master ?
@akosiaris: scap should be getting the hash from the submodule pointers contained at HEAD of tin.eqiad.wmnet/ores/deploy/.git
That said, I can't actually find that hash anywhere in the repo on tin so I'm baffled really, I can't figure out where that commit is coming from at all.
the ores submodule btw is in the exact same state and also fails to checkout
akosiaris@tin:/srv/deployment/ores/deploy/ores$ git status HEAD detached at f3a4a63 nothing to commit, working tree clean akosiaris@tin:/srv/deployment/ores/deploy/ores$ cd ../ akosiaris@tin:/srv/deployment/ores/deploy$ git status On branch master
ah ha! I figured _something_ out at least! The rOEQ15d5283b7422 commit is in origin/master it just hasn't been merged to master on tin. git pull would fast forward it and then that commit exists. The question that this raises, however, is this:
Why is scap trying to fetch that commit? It would appear as if scap is fetching the repo from origin instead of tin, at least initially, then attempting to fetch only the submodules from tin (but not finding the commit in question because it hasn't been pulled there)
(If this is true, it's incorrect behavior, I think, as scap should always fetch from tin unless instructed otherwise by scap.cfg)
Well, it's getting the submodule pointers from origin/master instead I 'd say. See 42cf53249ffacc078aa786a581fe165e00bb1520
That said, I can't actually find that hash anywhere in the repo on tin so I'm baffled really, I can't figure out where that commit is coming from at all.
Indeed, and that's the important detail that was eluding me before. And I think that I have found the culprit: scap/scap.cfg has this line: git_rev: origin/master
so @awight, can you enlighten me about your scap.cfg? Is git_rev: origin/master intentional? If you want to deploy from origin then I think we need to use git_upstream_submodules = True
Aha! nice find. It looks like it's been there since the very beginning. See fd1067ff4da. It has undergone a few minor changes since then but nothing major.
I 've crafted a commit on tin removing that line and retrying a scap deploy from tin just for ores1004.
I think I should add a NOTICE to scap that says something along the lines of "Deploying from non-default origin: $GIT_REV" just to make this more obvious ;)
I 've crafted a commit on tin removing that line and retrying a scap deploy from tin just for ores1004.
OK, success! deploy for the first time proceeded fine on ores1004. @mmodell your rock! Thanks!
Now to see if we have solved the parallel scap issue on all ores* boxes as well
Hurray!!! I had to manually (viva cumin) rm -rf /srv/deployment/ores/deploy/deploy-cache/cache on all ores* hosts, but it now works.
AFAICT, next steps forward should probably be:
Change 398067 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/ores/deploy@master] scap: Remove git_rev setting
Change 398067 merged by Awight:
[mediawiki/services/ores/deploy@master] scap: Remove git_rev setting
@mmodell Tangential note, I've been happy using git clone --depth 1 on personal projects. Would that make any sense for scap?
@awight: from what I understand, git has to do a lot of extra work on the server side in order to build to shallow clone. I may look into this a bit further, it would certainly save space on the deployment target but it probably doesn't solve the issue of slow fetches.
I may actually be wrong about this, upon reviewing my sources I think I was just remembering something incorrectly. It's only hard on the server when you subsequently try to fetch into a shallow clone - git has to do more work to figure out what needs to be fetched and it makes the server work harder than simply cloning the whole repo to begin with.
So --depth=1 might work as long as we don't have to fetch from the repo later. I'm not sure how that would be better than the --reference method that scap is using right now but I am definitely open to considering anything which can improve performance and stability.
Looks like I'm getting the same error.
commit b67bba77acb7c0ffc678201c9f3f54f198da6650
scap deploy -v -l "ores*" "(non-production) Update ORES on new cluster"
scap-sync-2017-12-12-0006-3-gb67bba7.log
Mentioned in SAL (#wikimedia-operations) [2017-12-14T11:12:07Z] <akosiaris@tin> Started deploy [ores/deploy@b67bba7]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-14T11:13:02Z] <akosiaris@tin> Finished deploy [ores/deploy@b67bba7]: T181661 (duration: 00m 55s)
Mentioned in SAL (#wikimedia-operations) [2017-12-14T11:13:59Z] <akosiaris@tin> Started deploy [ores/deploy@b67bba7]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-14T11:14:08Z] <akosiaris@tin> Finished deploy [ores/deploy@b67bba7]: T181661 (duration: 00m 03s)
I can not reproduce this. A deploy of b67bba77a (see T181661#3836939) just worked fine. In fact it worked fine 2 times in a row. Command used was scap deploy -v -l "ores*" T181661.
Mentioned in SAL (#wikimedia-operations) [2017-12-14T11:16:12Z] <akosiaris@tin> Started deploy [ores/deploy@b67bba7]: T181661
Mentioned in SAL (#wikimedia-operations) [2017-12-14T11:16:21Z] <akosiaris@tin> Finished deploy [ores/deploy@b67bba7]: T181661 (duration: 00m 03s)
Not sure if this is related, but now I'm seeing a deploy-local failure with no diagnostics other than error code 70:
scap deploy-log -v reveals more:
11:27:10 [ores1001.eqiad.wmnet] Unhandled error: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 329, in run exit_status = app.main(app.extra_arguments) File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 146, in main getattr(self, stage)() File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 321, in fetch reference=self.context.cache_dir) File "/usr/lib/python2.7/dist-packages/scap/git.py", line 388, in update_submodules git.submodule(*cmd) File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 1428, in __call__ return RunningCommand(cmd, call_args, stdin, stdout, stderr) File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 775, in __init__ self.wait() File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 793, in wait self.handle_command_exit_code(exit_code) File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 816, in handle_command_exit_code raise exc ErrorReturnCode_1: RAN: /usr/bin/git submodule update --init --recursive --reference /srv/deployment/ores/deploy-cache/cache STDOUT: STDERR: fatal: reference is not a tree: 15d5283b7422919d85203b5ba907027f9356e421 fatal: reference is not a tree: a3a50bb0b5aea50178e428430438ae3323d7376f Unable to checkout '15d5283b7422919d85203b5ba907027f9356e421' in submodule path 'submodules/editquality' Unable to checkout 'a3a50bb0b5aea50178e428430438ae3323d7376f' in submodule path 'submodules/ores'
Those revisions aren't in gerrit. I think the github -> gerrit mirroring broke when we were messing around with pointing to git-lfs repos?
The Phabricator control panels look happy,
https://phabricator.wikimedia.org/source/editquality/manage/uris/ shows that we're observing GitHub,
https://github.com/wiki-ai/editquality/commit/15d5283b7422919d85203b5ba907027f9356e421 that commit has been in GitHub for 3 days,
https://gerrit.wikimedia.org/r/#/q/15d5283b7422919d85203b5ba907027f9356e421 It's not in Gerrit yet,
https://phabricator.wikimedia.org/source/editquality/manage/basics/ "Last updated Thu, Dec 14, 7:07 AM (9 m, 19 s ago)."
Oh what. A clue,
https://phabricator.wikimedia.org/source/editquality/history/master/ <-- this does contain the revision.
Oops—we aren't expecting this repo to be mirrored to gerrit. So the surprise is that the revision exists in Phabricator but can be found by the deployment tool. Making a new task: T182865: Source revision is in Phabricator, but can't be found by deployment tools
@awight: yeah, I'm getting to the bottom of it now. The issue is that the commit does not exist on a local branch on tin, it's only in refs/remotes/origin/ which scap doesn't fetch automatically as the default refspec only fetches refs/heads/* and not refs/remotes/*
Using a workaround for T182865, where we go into submodules and checkout master before submodule updating from the ORES root, I was able to successfully parallel deploy to ores*, and in under a minute! It looks like the SSH timeout is resolved!!
I'll close this task for now, and will reopen if anything untoward rears its head again. Thanks everyone for the great work debugging!