Thu, Jan 18
forgot to add the bug to the scap message so it'd show up here, but this is now deployed: https://tools.wmflabs.org/sal/log/AWEK1UGSkkJ8OkTwh3Dt
Fix should be live now
Wed, Jan 17
This fix was deployed to wmf.16 wikis yesterday and verified by @Addshore, closing.
Leaving this task open for all the git_fat -> git_binary_manager changes I made, but since the new version of scap was released yesterday, that doesn't need to happen as desperately anymore.
Tue, Jan 16
1.31.0-wmf.16 is now live everywhere.
Noticed this happening in 1.31.0-wmf.16 just now.
Full stack trace:
This started when I pushed out wmf.16 to the wikipedia wikis.
Mon, Jan 15
Sun, Jan 14
Current status is that repo owners depending on git_fat either have to wait for scap 3.7.6-1 to be released, or merge the patch I made against their repository from this ticket in order to deploy.
This error is currently in scap version 3.7.5-1 that was released to production on Friday and that means that any repo currently using the:
Hello, I just pushed new scap tags for version 3.7.6/debian/3.7.6-1. This has the fix for T184882.
Fri, Jan 12
- Things I like
For reference, this is why this is happening:
Looking at the phabricator-jessie-commits job on T184118#3897095 I had to refresh my memory on what was happening.
Changing priority because this is currently "fixed" because I pushed a new commit onto scap's master branch.
For whatever reason beta debs got built from the release branch? https://integration.wikimedia.org/ci/job/phabricator-jessie-debs/280/console
happening again but maybe with a different cause. Are beta packages being built from the release branch now?
to horizon project puppet for the integration project and docker is running again. Bringing intergration-slave-docker-1005 back online in jenkins and calling this done.
I think I found the problem:
I disconnected this machine in jenkins for now.
Thu, Jan 11
Wed, Jan 10
FWIW, I pushed https://phabricator.wikimedia.org/rMSCA58f5eca67156879a0999422c4662b1d1c13acd96 on the 3rd that should have fixed the cause of this immediate task. Overall beta deployment could use some rethinking, but this specific task is effectively resolved, I think.
Mon, Jan 8
Dug a little deeper on this today and realized that this didn't actually make it out to production afaict (thanks to @zeljkofilipin), although the error rate on the canaries was high enough to climb to spike the overall error rate. The task I filed a few comments back is still a thing we need to do. I'm going to call this investigation complete though.
Fri, Jan 5
18:08:10 ['/vagrant/scap/bin/scap', 'deploy-local', '-v', '--repo', 'mockbase/deploy', '--force', '-g', 'default', 'fetch', '--refresh-config'] on scap-target-01 returne$ : OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l 25 May 2017 debug1: Reading configuration data /dev/null debug1: Connecting to scap-target-01 [192.168.122.133] port 22. debug1: Connection established. debug1: identity file /home/vagrant/.ssh/id_rsa type 1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_rsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_dsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_dsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_ecdsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_ecdsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_ed25519 type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/vagrant/.ssh/id_ed25519-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1 debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u1 debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH* compat 0x04000000 debug1: Authenticating to scap-target-01:22 as 'vagrant' debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ecdsa-sha2-nistp256 debug1: kex: server->client cipher: email@example.com MAC: <implicit> compression: none debug1: kex: client->server cipher: firstname.lastname@example.org MAC: <implicit> compression: none debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: Server host key: ecdsa-sha2-nistp256 SHA256:XdTvHeZ/g7GyMh5Y2UylYB6/sNKetKlxvRKNIJ18OSA debug1: Host 'scap-target-01' is known and matches the ECDSA host key. debug1: Found key in /etc/ssh/ssh_known_hosts:2 debug1: rekey after 134217728 blocks debug1: SSH2_MSG_NEWKEYS sent debug1: SSH2_MSG_NEWKEYS received debug1: rekey after 134217728 blocks debug1: SSH2_MSG_EXT_INFO received debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521> debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password debug1: Next authentication method: publickey debug1: Offering RSA public key: /home/vagrant/.ssh/id_rsa debug1: Server accepts key: pkalg ssh-rsa blen 279 debug1: Authentication succeeded (publickey). Authenticated to scap-target-01 ([192.168.122.133]:22). debug1: channel 0: new [client-session] debug1: Requesting email@example.com debug1: Entering interactive session. debug1: pledge: network debug1: client_input_global_request: rtype firstname.lastname@example.org want_reply 0 debug1: Sending command: /vagrant/scap/bin/scap deploy-local -v --repo mockbase/deploy --force -g default fetch --refresh-config http://192.168.122.1/mockbase/deploy/.git debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: client_input_channel_req: channel 0 rtype email@example.com reply 0 debug1: channel 0: free: client-session, nchannels 1 debug1: fd 1 clearing O_NONBLOCK Transferred: sent 2828, received 5516 bytes, in 1.4 seconds Bytes per second: sent 1994.8, received 3890.9 debug1: Exit status 70
Thu, Jan 4
0 pages to fix, 0 were resolvable.
beta-scap-eqiad seems happy. Calling this closed :)
Now I'm getting
Tried locally, seems to yield:
for the time being I reverted the patch that added deployment-snapshot01 to mediawiki deployment targets in beta just so beta-scap-eqiad will stop being mad alerting. Whenever deployment-snapshot01 is all setup we can remove:
This one seems related to https://github.com/wikimedia/scap/commit/5acb672bea5beecf677e791eb0ca332793412201 and T182865 @mmodell do you have some time to check this out?
This one was resolved with the release of scap 3.7.4-3
Wed, Jan 3
Tue, Jan 2
I filed T183999: Scap canary has a shifting baseline to address the main problem I see here, which is that a deployment that spikes the error rate and that's canary check fails but is subsequently redeployed results in the canary check running with a new baseline.
It looks like this change eventually merged and that this test eventually passed (https://integration.wikimedia.org/ci/job/mediawiki-core-phpcs-docker/3352/console is a run of the same change). Lowering the priority from UBN accordingly.
Scap does seem to have failed along with that error spike: http://tools.wmflabs.org/sal/log/AWC3SqMzwg13V6286YVJ
Logstash graph based on the query looks like it should have caught it: https://logstash.wikimedia.org/goto/800d886d5e05b8e1f9d11454717cf183
Dec 20 2017
Dec 19 2017
Aww, phab formatting ruined my commit message...