- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 11 2018
Sep 23 2015
The script source doesn't say so, but I've noticed that it's written by ripienaar. Latest upstream implements this feature:
Aug 3 2015
Jul 22 2015
Yeah I haven't seen recurrence of this so I'm closing the ticket. Thanks.
Jul 21 2015
Jul 16 2015
TLDR: the rough estimate is about 32Mbit/sec from jobrunners to elasticsearch nodes. Traffic is bursty so I advise planning for a 50-60Mbit ceiling.
Jul 13 2015
Jul 10 2015
Regarding running puppetmaster on !Precise: when I tried with Trusty I got this:
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: stack level too deep
It seems to be due to a bad interaction between Rails and Activerecord (deb: ruby-activerecord-3.2), this ticket proposes some workarounds:
https://projects.puppetlabs.com/issues/9290
Jul 9 2015
logstash1001-1003: These hosts are older than 1004-1006, and run Precise instead of Jessie. Gmond wouldn't stop or start.
gage@logstash1002:~$ sudo /usr/sbin/gmond -f [apache_status] Received the following parameters {'url': 'http://127.0.0.1:80/server-status', 'collect_ssl': 'False', 'metric_group': 'apache'} Fatal Python error: PyThreadState_Get: no current thread
Jul 9 00:06:37 logstash1002 kernel: [13984018.998725] init: ganglia-monitor main process (14196) killed by ABRT signal Jul 9 00:06:37 logstash1002 kernel: [13984018.998751] init: ganglia-monitor main process ended, respawning Jul 9 00:06:37 logstash1002 kernel: [13984019.086671] init: ganglia-monitor main process (14210) killed by ABRT signal Jul 9 00:06:37 logstash1002 kernel: [13984019.086698] init: ganglia-monitor main process ended, respawning
The above error messages were unhelpful, so I used strace -f /usr/sbin/gmond -f, saw that gmond parses /etc/ganglia/conf.d/* before aborting, and then did a binary search to remove files in conf.d/ until the problem disappeared.
Jul 1 2015
Jun 29 2015
Because new images were not built, I tried to work around this myself like so:
sudo mv /etc/apt/apt.conf.d/apt.conf.d/* /etc/apt/apt.conf.d/ sudo mv /etc/apt/preferences.d/preferences.d/* /etc/apt/preferences.d/ sudo mv /etc/apt/sources.list.d/sources.list.d/* /etc/apt/sources.list.d/ sudo rmdir /etc/apt/apt.conf.d/apt.conf.d/ sudo rmdir /etc/apt/preferences.d/preferences.d/ sudo rmdir /etc/apt/sources.list.d/sources.list.d/
However that results in this error:
N: Ignoring file 'puppet_base_2.7' in directory '/etc/apt/preferences.d/' as it has an invalid filename extension
So I gave it the proper extension:
sudo mv /etc/apt/preferences.d/puppet_base_2.7{,.pref}
However that installs puppet 2.7.11-1ubuntu2 instead of 3.4.3-1~ubuntu12.04.1. Puppet 2.7's version of the 'exec' type doesn't support the 'umask' attribute, resulting in this cryptic error when I apply role::puppet::self:
err: Failed to apply catalog: Invalid parameter umask at /etc/puppet/modules/git/manifests/clone.pp:147
Solution:
sudo rm /etc/apt/preferences.d/puppet_base_2.7.pref
Jun 26 2015
I'm surprised to report that this is still happening on instances created after the patch was merged. I tried twice.
Jun 18 2015
Jun 14 2015
This is broken again since June 8:
Jun 8 2015
Strongswan 5.3.0-1+wmf2 is currently in our apt repo. I'll make a separate task for config values.
Jun 5 2015
Jun 4 2015
This machine crashed again. All the errors are on socket 0, so we should probably replace that DIMM.
Jun 3 2015
Jun 2 2015
a) no grub prompt
b) yes, I see kernel output
c) yes, I see getty on the serial port.
There's also a Debian bug report discussing this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=685504
Jun 1 2015
May 31 2015
May 27 2015
I discussed this problem with a friend in neteng at Twitter, who says he has seen similar behavior in Juniper switches before. He recommends, and I agree: let's reboot the switch (asw-d2-eqiad).
May 26 2015
May 22 2015
Also let's try to ensure the NIC gets rebooted. My expectation is that racadm serveraction powercycle will do it. I've seen NICs get into weird states that persisted across warm reboots before.
I recommend checking all bios settings against 1035 for accidental changes (I can't imagine what setting would cause this, but we might as well rule it out), which will also result in rebooting the box to confirm that this behavior is reproducible.
May 20 2015
This is mysterious. I've compared the problem host, analytics1036 (ge-2/0/5), with healthy analytics1035 (ge-2/0/4), which sits right next to it in rack D-2.
- Confirmed the problem:
- Neon (row A), iron (row B), and analytics1029 (row C) can ping analytics1035 but not analytics1036.
- Other analytics hosts in row D (analytics1035 in D-2, analytics1041 in D-4) can ping analytics1036.
- None of the 12 other analytics hosts rebooted today have this problem
- It's possible to reach the affected host from another host in the same rack. ssh -A through iron -> analytics1035 -> analytics1036 works.
- Same running kernel
- Both hosts rebooted today
- ip addr show output looks the same: same netmasks etc.
- netstat -rn looks the same
- /etc/network/interfaces looks the same
- lldpctl output looks the same
- On asw-d-eqiad, show interfaces for ge-2/0/4 and ge-2/0/5 look the same and show vlans analytics1-d-eqiad shows both ports as members
- The only thing that looks odd to me is this part of the running config which seems to have some redundancy and doesn't treat all ports consistently (2/0/3 and 2/0/7 not in 'member-range', 4/0/* not in 'member', but those are not the ports in question!):
interface-range vlan-analytics1-d-eqiad { member "ge-2/0/[0-3]"; member "ge-2/0/[4-6]"; member ge-2/0/7; member-range ge-2/0/0 to ge-2/0/2; member-range ge-2/0/4 to ge-2/0/6; member-range ge-4/0/1 to ge-4/0/4; unit 0 { family ethernet-switching { vlan { members analytics1-d-eqiad; } } } }
May 19 2015
Deployed, tested, documented: https://wikitech.wikimedia.org/wiki/IPsec#Emergency_shutdown
May 15 2015
May 14 2015
May 13 2015
Eventually I'd like to see the partman receipe fixed and tested by reinstalling one of these hosts, but I've fixed the running config so that the arrays no longer appear as degraded:
gage@logstash1006:~$ cat /proc/mdstat Personalities : [raid1] [raid0] md0 : active raid1 sda2[0] sdb2[1] 249869312 blocks super 1.2 [4/2] [UU__] bitmap: 2/2 pages [8KB], 65536KB chunk
May 11 2015
May 8 2015
May 7 2015
This kernel is now installed on berkelium & curium.
- IPsec ESNs work (fixed in 3.19.3)
- Aesni security patch for CVE-2015-3331 is included (fixed in 3.19.3)
- Aes256gcm does not work. (fixed in 4.0, but we don't care because we plan to use aes128gcm which works in 3.19.)
Relatedly, I have learned that the reason must be quoted or only get the first word is stored:
May 6 2015
manifests/role/cache.pp has been refactored into modules/role/manifests/cache/* which reference hieradata/common/cache/*, hence the redundant data described in this task is eliminated.
May 5 2015
May 4 2015
Thanks, Brandon. I'll reply in order:
To summarize remaining work:
- Strongswan 5.3.0 is needed but is currently only in Experimental. It won't be coming to Jessie so it needs to be imported to WMF's apt repo.
- Determine appropriate values for prod: lifetime, margin, auto.
May 3 2015
The token-based solution (Proposal 1) sounds good to me; it seems like the only barrier to adoption is making a policy decision to go with a proposal which doesn't support Precise, correct?
Apr 23 2015
Apr 22 2015
Apr 21 2015
This seems to be fixed in linux-image-3.19.0-trunk-amd64 version 3.19.3-1~exp1, currently in Debian/Experimental.