Page MenuHomePhabricator

Gage (Jeff Gage)
Disabled

Projects

User Details

User Since
Oct 24 2014, 11:27 PM (509 w, 2 h)
Roles
Disabled
LDAP User
Gage
MediaWiki User
Unknown

Recent Activity

Sep 23 2015

Gage added a comment to T98481: check_puppetrun: print "agent disabled" reason.

The script source doesn't say so, but I've noticed that it's written by ripienaar. Latest upstream implements this feature:

Sep 23 2015, 7:43 AM · patch-welcome, SRE, Icinga, observability

Aug 3 2015

Nemo_bis awarded T92604: IPSec: roll-out plan a Yellow Medal token.
Aug 3 2015, 8:06 PM · SRE, Patch-For-Review, Interdatacenter-IPsec

Jul 22 2015

Gage closed T100478: Labs homedirs owned by root for new projects as Resolved.

Yeah I haven't seen recurrence of this so I'm closing the ticket. Thanks.

Jul 22 2015, 5:50 PM · Cloud-Services

Jul 21 2015

Gage added a member for Wikidata-Query-Service: Gage.
Jul 21 2015, 10:13 PM

Jul 16 2015

Gage added a comment to T105705: Evaluate traffic flow between the Jobrunners and the Cirrus cluster.

TLDR: the rough estimate is about 32Mbit/sec from jobrunners to elasticsearch nodes. Traffic is bursty so I advise planning for a 50-60Mbit ceiling.

Jul 16 2015, 8:59 PM · SRE

Jul 13 2015

Gage claimed T105705: Evaluate traffic flow between the Jobrunners and the Cirrus cluster.
Jul 13 2015, 4:42 PM · SRE

Jul 10 2015

Gage added a comment to T98173: install/setup/deploy server rhodium as puppetmaster (scaling out).

Regarding running puppetmaster on !Precise: when I tried with Trusty I got this:

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: stack level too deep

It seems to be due to a bad interaction between Rails and Activerecord (deb: ruby-activerecord-3.2), this ticket proposes some workarounds:
https://projects.puppetlabs.com/issues/9290

Jul 10 2015, 7:58 PM · Puppet, Puppet-infrastructure-modernization, SRE, Patch-For-Review

Jul 9 2015

Gage added a project to T102397: icinga log rotation wipes out portions of history: observability.
Jul 9 2015, 5:49 PM · SRE, observability
Gage added a project to T102394: Implement pybal pool state monitoring and alerting via icinga: observability.
Jul 9 2015, 5:48 PM · SRE, Patch-For-Review, PyBal, observability, Traffic
Gage added a comment to T93776: remove ganglia(old), replace with ganglia_new.

logstash1001-1003: These hosts are older than 1004-1006, and run Precise instead of Jessie. Gmond wouldn't stop or start.

gage@logstash1002:~$ sudo /usr/sbin/gmond -f
[apache_status] Received the following parameters
{'url': 'http://127.0.0.1:80/server-status', 'collect_ssl': 'False', 'metric_group': 'apache'}
Fatal Python error: PyThreadState_Get: no current thread
Jul  9 00:06:37 logstash1002 kernel: [13984018.998725] init: ganglia-monitor main process (14196) killed by ABRT signal
Jul  9 00:06:37 logstash1002 kernel: [13984018.998751] init: ganglia-monitor main process ended, respawning
Jul  9 00:06:37 logstash1002 kernel: [13984019.086671] init: ganglia-monitor main process (14210) killed by ABRT signal
Jul  9 00:06:37 logstash1002 kernel: [13984019.086698] init: ganglia-monitor main process ended, respawning

The above error messages were unhelpful, so I used strace -f /usr/sbin/gmond -f, saw that gmond parses /etc/ganglia/conf.d/* before aborting, and then did a binary search to remove files in conf.d/ until the problem disappeared.

Jul 9 2015, 12:56 AM · SRE, observability, Patch-For-Review

Jun 29 2015

Gage added a comment to T104019: Nested ".d" dirs in /etc/apt/.

Because new images were not built, I tried to work around this myself like so:

sudo mv /etc/apt/apt.conf.d/apt.conf.d/* /etc/apt/apt.conf.d/
sudo mv /etc/apt/preferences.d/preferences.d/* /etc/apt/preferences.d/
sudo mv /etc/apt/sources.list.d/sources.list.d/* /etc/apt/sources.list.d/
sudo rmdir /etc/apt/apt.conf.d/apt.conf.d/
sudo rmdir /etc/apt/preferences.d/preferences.d/
sudo rmdir /etc/apt/sources.list.d/sources.list.d/

However that results in this error:

N: Ignoring file 'puppet_base_2.7' in directory '/etc/apt/preferences.d/' as it has an invalid filename extension

So I gave it the proper extension:

sudo mv /etc/apt/preferences.d/puppet_base_2.7{,.pref}

However that installs puppet 2.7.11-1ubuntu2 instead of 3.4.3-1~ubuntu12.04.1. Puppet 2.7's version of the 'exec' type doesn't support the 'umask' attribute, resulting in this cryptic error when I apply role::puppet::self:

err: Failed to apply catalog: Invalid parameter umask at /etc/puppet/modules/git/manifests/clone.pp:147

Solution:

sudo rm /etc/apt/preferences.d/puppet_base_2.7.pref
Jun 29 2015, 6:54 AM · Patch-For-Review, Cloud-Services

Jun 26 2015

Gage added a comment to T104019: Nested ".d" dirs in /etc/apt/.

I'm surprised to report that this is still happening on instances created after the patch was merged. I tried twice.

Jun 26 2015, 10:39 PM · Patch-For-Review, Cloud-Services
Gage updated the task description for T104019: Nested ".d" dirs in /etc/apt/.
Jun 26 2015, 6:46 PM · Patch-For-Review, Cloud-Services
Gage created T104019: Nested ".d" dirs in /etc/apt/.
Jun 26 2015, 6:43 PM · Patch-For-Review, Cloud-Services

Jun 18 2015

Restricted Application added a project to T101199: Wikitech often loses track of internal openstack/nova session: Cloud-Services.
Jun 18 2015, 12:21 AM · MW-1.27-release (WMF-deploy-2016-02-09_(1.27.0-wmf.13)), Patch-For-Review, Cloud-Services, wikitech.wikimedia.org, MediaWiki-extensions-OpenStackManager

Jun 14 2015

Gage added a comment to T92618: Ganglia broken for labstore1001 (again).

This is broken again since June 8:

Jun 14 2015, 6:29 PM · Labs-Sprint-102, Cloud-Services

Jun 8 2015

Gage closed T96111: Strongswan: security association reauthentication failure as Resolved.

Strongswan 5.3.0-1+wmf2 is currently in our apt repo. I'll make a separate task for config values.

Jun 8 2015, 4:59 PM · SRE, Patch-For-Review, Interdatacenter-IPsec

Jun 5 2015

Gage closed T92603: Monitor IPsec status as Resolved.
Jun 5 2015, 3:22 PM · SRE, Patch-For-Review, Interdatacenter-IPsec, observability

Jun 4 2015

Gage reopened T97380: analytics1013 crashed, investigate... as "Open".

This machine crashed again. All the errors are on socket 0, so we should probably replace that DIMM.

Jun 4 2015, 9:55 PM · SRE, Analytics

Jun 2 2015

Gage added a comment to T100959: graphite2001 bios config issue.

a) no grub prompt
b) yes, I see kernel output
c) yes, I see getty on the serial port.

Jun 2 2015, 11:39 PM · SRE
Gage added a comment to T78616: Fix syslog error "nslcd[29117]: error writing to client: Broken pipe".

There's also a Debian bug report discussing this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=685504

Jun 2 2015, 10:35 PM · LDAP, cloud-services-team (Kanban), Cloud-VPS

Jun 1 2015

Gage updated subscribers of T100954: Wikitech: update Bacula article.
Jun 1 2015, 11:18 PM · SRE, Documentation

May 31 2015

Gage created T100959: graphite2001 bios config issue.
May 31 2015, 10:46 PM · SRE
Gage created T100954: Wikitech: update Bacula article.
May 31 2015, 9:01 PM · SRE, Documentation

May 27 2015

Gage added a comment to T99845: analytics1036 can't talk cross row?.

I discussed this problem with a friend in neteng at Twitter, who says he has seen similar behavior in Juniper switches before. He recommends, and I agree: let's reboot the switch (asw-d2-eqiad).

May 27 2015, 1:47 AM · SRE, ops-eqiad
Gage renamed T100478: Labs homedirs owned by root for new projects from Labs homedirs owned by root for new instances to Labs homedirs owned by root for new projects.
May 27 2015, 12:14 AM · Cloud-Services
Gage created T100478: Labs homedirs owned by root for new projects.
May 27 2015, 12:13 AM · Cloud-Services

May 26 2015

Gage added projects to T82576: Enable STARTTLS (both inbound and outbound) on lists: Mail, Wikimedia-Mailing-lists.
May 26 2015, 11:55 PM · SRE, Wikimedia-Mailing-lists, Mail

May 22 2015

Gage added a comment to T99845: analytics1036 can't talk cross row?.

Also let's try to ensure the NIC gets rebooted. My expectation is that racadm serveraction powercycle will do it. I've seen NICs get into weird states that persisted across warm reboots before.

May 22 2015, 7:51 PM · SRE, ops-eqiad
Gage added a comment to T99845: analytics1036 can't talk cross row?.

I recommend checking all bios settings against 1035 for accidental changes (I can't imagine what setting would cause this, but we might as well rule it out), which will also result in rebooting the box to confirm that this behavior is reproducible.

May 22 2015, 7:45 PM · SRE, ops-eqiad

May 20 2015

Gage added a comment to T99845: analytics1036 can't talk cross row?.

This is mysterious. I've compared the problem host, analytics1036 (ge-2/0/5), with healthy analytics1035 (ge-2/0/4), which sits right next to it in rack D-2.

  • Confirmed the problem:
    • Neon (row A), iron (row B), and analytics1029 (row C) can ping analytics1035 but not analytics1036.
    • Other analytics hosts in row D (analytics1035 in D-2, analytics1041 in D-4) can ping analytics1036.
    • None of the 12 other analytics hosts rebooted today have this problem
    • It's possible to reach the affected host from another host in the same rack. ssh -A through iron -> analytics1035 -> analytics1036 works.
  • Same running kernel
  • Both hosts rebooted today
  • ip addr show output looks the same: same netmasks etc.
  • netstat -rn looks the same
  • /etc/network/interfaces looks the same
  • lldpctl output looks the same
  • On asw-d-eqiad, show interfaces for ge-2/0/4 and ge-2/0/5 look the same and show vlans analytics1-d-eqiad shows both ports as members
    • The only thing that looks odd to me is this part of the running config which seems to have some redundancy and doesn't treat all ports consistently (2/0/3 and 2/0/7 not in 'member-range', 4/0/* not in 'member', but those are not the ports in question!):
interface-range vlan-analytics1-d-eqiad {
    member "ge-2/0/[0-3]";
    member "ge-2/0/[4-6]";
    member ge-2/0/7;
    member-range ge-2/0/0 to ge-2/0/2;
    member-range ge-2/0/4 to ge-2/0/6;
    member-range ge-4/0/1 to ge-4/0/4;
    unit 0 {
        family ethernet-switching {
            vlan {
                members analytics1-d-eqiad;
            }
        }
    }
}
May 20 2015, 11:19 PM · SRE, ops-eqiad
Gage triaged T99833: Puppet function: ipresolve: throw an error if lookup fails, refactor into wmflib as Medium priority.
May 20 2015, 9:13 PM · SRE, Patch-For-Review, Interdatacenter-IPsec, Puppet
Gage created T99833: Puppet function: ipresolve: throw an error if lookup fails, refactor into wmflib.
May 20 2015, 9:08 PM · SRE, Patch-For-Review, Interdatacenter-IPsec, Puppet

May 19 2015

Gage updated subscribers of T98161: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie..
May 19 2015, 3:05 AM · SRE, Analytics-Clusters
Gage added a comment to T98161: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie..

I followed this: https://git.wikimedia.org/blob/operations%2Fpuppet.git/2cdd08f9686b040816bd0dd8e63e712f4b084a7a/modules%2Fpackage_builder%2FREADME.md

May 19 2015, 3:04 AM · SRE, Analytics-Clusters
Gage closed T88536: Implement a big IPsec off switch as Resolved.

Deployed, tested, documented: https://wikitech.wikimedia.org/wiki/IPsec#Emergency_shutdown

May 19 2015, 1:59 AM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage added a comment to T98620: Degraded RAID-1 arrays on new logstash hosts: [UU__].

Patch: https://gerrit.wikimedia.org/r/#/c/211931/

May 19 2015, 1:55 AM · SRE, Patch-For-Review
Gage lowered the priority of T98620: Degraded RAID-1 arrays on new logstash hosts: [UU__] from Unbreak Now! to Medium.
May 19 2015, 1:49 AM · SRE, Patch-For-Review

May 15 2015

Gage added a member for Traffic: Gage.
May 15 2015, 4:06 PM

May 13 2015

Gage added a comment to T98620: Degraded RAID-1 arrays on new logstash hosts: [UU__].

Eventually I'd like to see the partman receipe fixed and tested by reinstalling one of these hosts, but I've fixed the running config so that the arrays no longer appear as degraded:

gage@logstash1006:~$ cat /proc/mdstat
Personalities : [raid1] [raid0]
md0 : active raid1 sda2[0] sdb2[1]
      249869312 blocks super 1.2 [4/2] [UU__]
      bitmap: 2/2 pages [8KB], 65536KB chunk
May 13 2015, 5:02 PM · SRE, Patch-For-Review

May 8 2015

Gage triaged T84907: Kafka logging to Logstash as Low priority.
May 8 2015, 5:59 PM · observability, Analytics, Wikimedia-Logstash
Gage triaged T84908: Zookeeper logging to Logstash as Low priority.
May 8 2015, 5:58 PM · observability, Analytics-Engineering, Wikimedia-Logstash
Gage updated the task description for T98620: Degraded RAID-1 arrays on new logstash hosts: [UU__].
May 8 2015, 5:51 PM · SRE, Patch-For-Review
Gage created T98620: Degraded RAID-1 arrays on new logstash hosts: [UU__].
May 8 2015, 5:21 PM · SRE, Patch-For-Review

May 7 2015

Gage added a comment to T97411: Build a non-trunk 3.19 kernel for jessie.

This kernel is now installed on berkelium & curium.

  • IPsec ESNs work (fixed in 3.19.3)
  • Aesni security patch for CVE-2015-3331 is included (fixed in 3.19.3)
  • Aes256gcm does not work. (fixed in 4.0, but we don't care because we plan to use aes128gcm which works in 3.19.)
May 7 2015, 6:03 PM · SRE, Patch-For-Review, Traffic
Gage created T98488: rsyslog: use high precision timestamps or explain why not.
May 7 2015, 3:06 PM · Observability-Logging
Gage added a comment to T98481: check_puppetrun: print "agent disabled" reason.

Relatedly, I have learned that the reason must be quoted or only get the first word is stored:

May 7 2015, 2:46 PM · patch-welcome, SRE, Icinga, observability
Gage created T98481: check_puppetrun: print "agent disabled" reason.
May 7 2015, 2:41 PM · patch-welcome, SRE, Icinga, observability

May 6 2015

Gage added a comment to T92601: Migrate host lists out of cache.pp to reference values in Hiera.

manifests/role/cache.pp has been refactored into modules/role/manifests/cache/* which reference hieradata/common/cache/*, hence the redundant data described in this task is eliminated.

May 6 2015, 6:37 PM · Traffic, SRE
Gage changed the status of T94320: Improve monitoring of https://git.wikimedia.org/ from Open to Stalled.
May 6 2015, 6:00 PM · SRE, Gitblit, observability
Gage changed the status of T87840: Retire Torrus from Open to Stalled.
May 6 2015, 6:00 PM · observability, SRE, Technical-Debt
Gage removed a subtask for T81543: Enable IPSec between datacenters: T85823: IPsec: add firewall rules.
May 6 2015, 5:54 PM · SRE, Traffic, Interdatacenter-IPsec
Gage removed a parent task for T85823: IPsec: add firewall rules: T81543: Enable IPSec between datacenters.
May 6 2015, 5:54 PM · SRE, Interdatacenter-IPsec
Gage closed Restricted Task, a subtask of T81543: Enable IPSec between datacenters, as Declined.
May 6 2015, 5:53 PM · SRE, Traffic, Interdatacenter-IPsec
Gage updated the task description for T82698: shutdown sodium after mailman has migrated to jessie VM.
May 6 2015, 5:49 PM · SRE

May 4 2015

Gage added a comment to T81543: Enable IPSec between datacenters.

Thanks, Brandon. I'll reply in order:

May 4 2015, 5:52 PM · SRE, Traffic, Interdatacenter-IPsec
Gage added a comment to T96111: Strongswan: security association reauthentication failure.

To summarize remaining work:

  • Strongswan 5.3.0 is needed but is currently only in Experimental. It won't be coming to Jessie so it needs to be imported to WMF's apt repo.
  • Determine appropriate values for prod: lifetime, margin, auto.
May 4 2015, 5:52 PM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage updated the task description for T92604: IPSec: roll-out plan.
May 4 2015, 5:52 PM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage closed Restricted Task, a subtask of T81543: Enable IPSec between datacenters, as Declined.
May 4 2015, 5:43 PM · SRE, Traffic, Interdatacenter-IPsec

May 3 2015

Gage added a comment to T94417: Fix ipv6 autoconf issues.

The token-based solution (Proposal 1) sounds good to me; it seems like the only barrier to adoption is making a policy decision to go with a proposal which doesn't support Precise, correct?

May 3 2015, 9:23 PM · SRE, Patch-For-Review, Interdatacenter-IPsec

Apr 22 2015

Gage created T96928: Add icinga-wm bot to #wikimedia-analytics.
Apr 22 2015, 10:12 PM · SRE, Patch-For-Review, MediaWiki-extensions-EventLogging

Apr 21 2015

Gage closed T94820: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test as Resolved.

This seems to be fixed in linux-image-3.19.0-trunk-amd64 version 3.19.3-1~exp1, currently in Debian/Experimental.

Apr 21 2015, 10:21 PM · SRE, Interdatacenter-IPsec

Apr 17 2015

Gage triaged T95835: Better backup coverage for X1 database cluster as High priority.
Apr 17 2015, 8:43 PM · SRE, Patch-For-Review, Blocked-on-Operations, Scrum-of-Scrums, incident-20150410-flowdataloss, DBA
Gage triaged T96017: Migrate SCA cluster to SCB (Jessie and Node 4.2) as High priority.
Apr 17 2015, 8:36 PM · Services (done), User-mobrovac, SRE
Gage triaged T95896: mw1031 has a bad uplink as Medium priority.
Apr 17 2015, 8:35 PM · SRE, ops-eqiad
Gage closed T96164: Requesting access to hafnium for mforns as Resolved.
Apr 17 2015, 7:59 PM · SRE, SRE-Access-Requests
Gage closed T96163: Requesting access to tin.eqiad.wmnet for mforns as Resolved.

Done!

Apr 17 2015, 7:57 PM · SRE, Patch-For-Review, SRE-Access-Requests
Gage added a comment to T96164: Requesting access to hafnium for mforns.

Hi Marcel,

Apr 17 2015, 7:12 PM · SRE, SRE-Access-Requests
Gage closed T95905: Give joal access to eventlog1001.eqiad.wmnet as Resolved.

Added joal to eventlogging-roots + eventlogging-admins to match nuria, millimetric, mforns. This gives him sudo on eventlog1001 as well as access to hafnium.

Apr 17 2015, 7:09 PM · SRE, SRE-Access-Requests
Gage added a comment to T96146: Update 3.19 kernel to 3.19.3.

Berkelium and Curium are now upgraded to Debian's 3.19.3 kernels containing the IPsec patch. Next, I will test the aes256gcm and ESN behavior.

Apr 17 2015, 5:37 PM · SRE, Interdatacenter-IPsec

Apr 16 2015

Gage added a comment to T96111: Strongswan: security association reauthentication failure.

Strongswan 5.3.0 has been uploaded to Debian/experimental, and is now running on Berkelium & Curium. So far the problem has not recurred.

Apr 16 2015, 10:12 PM · SRE, Patch-For-Review, Interdatacenter-IPsec

Apr 15 2015

Gage added a comment to T96111: Strongswan: security association reauthentication failure.

Ok, good news. Further discussion with ecdsa has revealed that this problem is fixed in 5.3.0, which is released but not yet packaged for Debian.

Apr 15 2015, 2:29 PM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage added a comment to T96111: Strongswan: security association reauthentication failure.

I reduced some timeouts in order to recreate the problem; config changes suggested by ecdsa have not yet been made:

conn %default
        ikelifetime=6m
        lifetime=6m
        rekeyfuzz=0%
        margintime=1m
        keyingtries=%forever

Which quickly triggered the behavior. This time it's the IPv6 connection which is not established:

gage@curium:~$ sudo ipsec statusall
Status of IKE charon daemon (strongSwan 5.2.1, Linux 3.19.0-trunk-amd64, x86_64):
  uptime: 22 minutes, since Apr 15 13:13:53 2015
  malloc: sbrk 2580480, mmap 0, used 381120, free 2199360
  worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 2
  loaded plugins: charon aes rc2 sha1 sha2 md5 random nonce x509 revocation constraints pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl fips-prf gmp agent xcbc hmac gcm attr kernel-netlink resolve socket-default stroke updown
Listening IP addresses:
  10.64.0.170
  2620:0:861:101:862b:2bff:fefd:be6d
  2620:0:861:101:10:64:0:170
Connections:
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4:  10.64.0.170...10.64.0.169  IKEv1/2
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4:   local:  [CN=curium.eqiad.wmnet] uses public key authentication
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4:    cert:  "CN=curium.eqiad.wmnet"
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4:   remote: [CN=berkelium.eqiad.wmnet] uses public key authentication
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4:   child:  dynamic === dynamic TRANSPORT
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv6:  2620::861:101:10:64:0:170...2620::861:101:10:64:0:169  IKEv1/2
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv6:   local:  [CN=curium.eqiad.wmnet] uses public key authentication
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv6:    cert:  "CN=curium.eqiad.wmnet"
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv6:   remote: [CN=berkelium.eqiad.wmnet] uses public key authentication
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv6:   child:  dynamic === dynamic TRANSPORT
Security Associations (1 up, 0 connecting):
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4[9]: ESTABLISHED 2 minutes ago, 10.64.0.170[CN=curium.eqiad.wmnet]...10.64.0.169[CN=berkelium.eqiad.wmnet]
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4[9]: IKEv2 SPIs: 46a8b8b8776b91a6_i 18deeb67a2053fe6_r*, public key reauthentication in 2 minutes
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4[9]: IKE proposal: AES_GCM_16_128/PRF_HMAC_SHA2_384/ECP_384_BP
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4{9}:  INSTALLED, TRANSPORT, ESP SPIs: c24cb949_i c533bc7c_o
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4{9}:  AES_GCM_16_128, 0 bytes_i, 0 bytes_o, rekeying in 2 minutes
curium.eqiad.wmnet-berkelium.eqiad.wmnet_by_ipv4{9}:   10.64.0.170/32 === 10.64.0.169/32
Apr 15 2015, 1:56 PM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage added a comment to T96111: Strongswan: security association reauthentication failure.

I spoke with Tobias from Strongswan on IRC about this:

Apr 15 2015, 1:25 PM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage updated the task description for T96111: Strongswan: security association reauthentication failure.
Apr 15 2015, 2:54 AM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage created T96111: Strongswan: security association reauthentication failure.
Apr 15 2015, 2:32 AM · SRE, Patch-For-Review, Interdatacenter-IPsec

Apr 14 2015

Gage closed T96053: Access for new Analytics Dev: Madhu Viswanathan as Resolved.

Patches merged. User account is created and access is granted.

Apr 14 2015, 10:33 PM · SRE, Patch-For-Review, SRE-Access-Requests

Apr 13 2015

Gage closed T95928: Analytics1017 (Hadoop datanode) crashed as Resolved.
Apr 13 2015, 6:52 PM · SRE, Analytics-Clusters
Gage renamed T95928: Analytics1017 (Hadoop datanode) crashed from Fwd: Analytics1017 (Hadoop datanode) crashed to Analytics1017 (Hadoop datanode) crashed.
Apr 13 2015, 6:46 PM · SRE, Analytics-Clusters
Gage created T95928: Analytics1017 (Hadoop datanode) crashed.
Apr 13 2015, 6:43 PM · SRE, Analytics-Clusters

Apr 9 2015

Gage added a comment to T95596: Urgent: Statsite changes semantics of timer rate metrics, need metric rename.

Discussed in IRC; ran this at approximately UTC 22:32:

sudo stop carbon/relay
sudo stop carbon/cache NAME=a
sudo stop carbon/cache NAME=b
sudo stop carbon/cache NAME=c
sudo stop carbon/cache NAME=d
sudo stop carbon/cache NAME=e
sudo stop carbon/cache NAME=f
sudo stop carbon/cache NAME=g
sudo stop carbon/cache NAME=h
find /var/lib/carbon/whisper/{cassandra,restbase} -type f -name 'sample_rate.wsp' -printf "%h\n" | while read i; do sudo mv "$i/rate.wsp" "$i"/sample_rate.wsp ; done
sudo start carbon/cache NAME=a
sudo start carbon/cache NAME=b
sudo start carbon/cache NAME=c
sudo start carbon/cache NAME=d
sudo start carbon/cache NAME=e
sudo start carbon/cache NAME=f
sudo start carbon/cache NAME=g
sudo start carbon/cache NAME=h
sudo start carbon/relay
Apr 9 2015, 10:36 PM · SRE, Grafana

Apr 8 2015

Gage added a comment to T92603: Monitor IPsec status.

Patch is submitted: https://gerrit.wikimedia.org/r/#/c/199787/

Apr 8 2015, 6:15 PM · SRE, Patch-For-Review, Interdatacenter-IPsec, observability
Gage added a project to T92603: Monitor IPsec status: Patch-For-Review.
Apr 8 2015, 6:13 PM · SRE, Patch-For-Review, Interdatacenter-IPsec, observability

Apr 6 2015

Gage added a comment to T94820: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test.

Thanks for the feedback. Steps to reproduce are in the task description, I used IPv4:

while true ; do wget -nv -O /dev/null http://10.64.0.170/index.nginx-debian.html ; sleep 1 ; done

With the following ciphers in /etc/ipsec.conf:

ike=aes128gcm16-null-prfsha384-ecp384bp!
esp=aes128gcm16-null-ecp384bp-esn!
Apr 6 2015, 8:33 PM · SRE, Interdatacenter-IPsec

Apr 3 2015

Gage added a comment to T94820: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test.

Ok, this is reproducible and seems to be the primary problem I was having yesterday: enabling Extended Sequence Numbers (ESN, http://kernelnewbies.org/Linux_2_6_39#head-87ffd4407af29460251c521e0228fe0ac9219d4b) causes a crash within 5 wgets.

Apr 3 2015, 2:58 AM · SRE, Interdatacenter-IPsec
Gage added a comment to T94820: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test.

Got another one: I changed ciphers on berkelium & restarted the daemon there; before I had a chance to restart the daemon on curium for corresponding change, curium panicked. admittedly this is not a circumstance we'll often encounter in prod:

Apr 3 2015, 2:52 AM · SRE, Interdatacenter-IPsec

Apr 2 2015

Gage created T94820: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test.
Apr 2 2015, 10:18 AM · SRE, Interdatacenter-IPsec

Mar 31 2015

Gage added a comment to T93730: asw-c4-eqiad hardware fault?.

I will switch the hadoop active namenode from analytics1001 to analytics1002 (which is in another rack) before this replacement action is taken.

Mar 31 2015, 6:29 PM · SRE, Incident-20141130-eqiad-C4, ops-eqiad

Mar 30 2015

Gage claimed T92603: Monitor IPsec status.
Mar 30 2015, 4:41 PM · SRE, Patch-For-Review, Interdatacenter-IPsec, observability
Gage claimed T92604: IPSec: roll-out plan.
Mar 30 2015, 4:41 PM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage triaged T92602: Secure inter-datacenter web request log (Kafka) traffic as Medium priority.
Mar 30 2015, 4:05 PM · Patch-For-Review, SRE, Traffic, Analytics-Clusters, Interdatacenter-IPsec

Mar 28 2015

Gage renamed T94320: Improve monitoring of https://git.wikimedia.org/ from Monitor https://git.wikimedia.org/ to Improve monitoring of https://git.wikimedia.org/.
Mar 28 2015, 11:57 PM · SRE, Gitblit, observability
Gage created T94320: Improve monitoring of https://git.wikimedia.org/.
Mar 28 2015, 11:54 PM · SRE, Gitblit, observability

Mar 27 2015

Gage created T94215: decommission cp3001 & cp3002.
Mar 27 2015, 6:58 PM · DC-Ops, SRE, Patch-For-Review, ops-esams

Mar 23 2015

Gage updated subscribers of T29392: Please create a mailing list for Persian wikinews.

I have:

  1. Discussed this with @JohnLewis
  2. Re-added Mjbmri@gmail.com as a list admin
  3. Changed the list admin password and emailed it to Mjbmr and Atgigabyte
Mar 23 2015, 7:08 PM · SRE, Wikimedia-Mailing-lists
Gage added a comment to T87840: Retire Torrus.

Not sure, as @faidon declined my ticket to monitor the Netapp stats with LibreNMS. So those stats are still only being collected by Torrus.

Mar 23 2015, 4:55 AM · observability, SRE, Technical-Debt

Mar 21 2015

Gage added a comment to T92476: cp4009 hardware fault.

This order arrived at 200 Paul today.

Mar 21 2015, 12:05 AM · SRE, ops-ulsfo

Mar 13 2015

Gage added a comment to T85823: IPsec: add firewall rules.

If implemented, this task will be completed after T92604. Once we are confident in traffic flows over IPsec we may wish to use firewall rules to explicitly disallow unencrypted traffic.

Mar 13 2015, 5:37 AM · SRE, Interdatacenter-IPsec
Gage created T92604: IPSec: roll-out plan.
Mar 13 2015, 5:36 AM · SRE, Patch-For-Review, Interdatacenter-IPsec
Gage created T92603: Monitor IPsec status.
Mar 13 2015, 5:36 AM · SRE, Patch-For-Review, Interdatacenter-IPsec, observability