I just edited the 56.15.185.in-addr.arpa object in the RIPE database to point nameservers directly to labs-ns0/1. This should work now, no need for classful classless delegation :)
Thu, Sep 13
As I mentioned above in my second-to-last update, they are blacklisted for queued TRIM which is suboptimal of course. However, the data corruption issues with synchronous TRIM have been long resolved -- they were already back in 2016, they certainly seem to be in the kernels we're running with now.
Wed, Sep 12
Mon, Sep 10
We're still getting RAID alerts about this host.
Sat, Sep 8
Fri, Sep 7
Has anything happened on this? IIRC at our meetings we talked about investigating this further e.g. with the help of JTAC, and exploring whether we should disable the JunOS' DDoS protection.
Thu, Sep 6
Wed, Sep 5
OK, I created a Gerrit repo under operations/software/keyholder and imported the existing history with:
git clone ~/wikimedia/puppet/ keyholder git remote rm origin git branch -m master
Thanks for tracking that down @Dzahn!
Mon, Sep 3
Fri, Aug 31
Thu, Aug 30
Thanks Chase, but I'm afraid we don't have anything to do with wikimediafoundation.org's operations -- it's completely out of our control. The other tag is correct though and it may get the attention of the website's operators.
Oh also, would it be possible to keep the (operations/puppet) history such as commit messages etc.? git filter-branch etc. should make this possible right?
My vote is under operations/software, if not under some non-operations hierarchy.
Tue, Aug 28
Thanks for filing this! I lost about an hour debugging and (re-)fixing the above issue today, so +1 to everything you said :)
This was logged every time a login was attempted, in netmon1002's /var/log/auth.log with this:
Aug 28 00:08:07 netmon1002 /ssh-agent-proxy: [<class '__main__.SshAgentProtocolError'>] SSH2_AGENTC_SIGN_REQUEST: Bad flags 0x4
Aug 24 2018
Aug 23 2018
Aug 21 2018
Aug 16 2018
Aug 3 2018
@Joe gave more timestamps from etcd logs on IRC:
- Aug 2 13:59:52
- Aug 3 01:19-01:20
- Aug 3 01:28-01:29
- Aug 3 01:50-01:51
- Aug 3 02:06-02:07
These are potentially network partition hiccups/events. These seem to correlate well with the other events (dbproxy/db etc.) listed here.
So... what's the status of this? What else has been observed, what has been done to troubleshoot and what's the latest from Juniper? I tried to
access the Juniper case for more insight, but unfortunately I don't seem to have the right permissions to access this case (unrelated to this task and low-prio, but perhaps @ayounsi or @RobH can work with Juniper to figure out why?)
I'm investigating unrelated issues in asw2-b-eqiad and this port is flapping (probably boot-looping into PXE), so I disabled it. @RobH, feel free to un-disable when you're about to install.
I'm investigating unrelated issues in asw2-b-eqiad and these ports are flapping (probably boot-looping into PXE), so I disabled them. @RobH, feel free to un-disable when you're about to install them.
Jul 25 2018
Jul 20 2018
Is there any progress and/or timeline for this? Thanks!
Jul 19 2018
Jul 18 2018
What's going on with this?
It's been a couple of years since I filed this and I don't remember much since, so unfortunately I don't have any more insight at this point. These kind of widespread network events are very rare and there are no such outages recently I'm afraid. We could figure out ways to simulate them from e.g. mwdebug though, although I doubt that anyone has the time to investigate this in such depth, so I don't particularly disagree with resolving this task instead.
Jul 17 2018
I filed T199816 for removing that page, we can follow up on that and if implemented, resolve this task and its parent.
Jul 11 2018
Jul 10 2018
I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cases, so that may be something that could potentially work (with the caveats that Riccardo mentioned above, however).
Jul 4 2018
There seems to be another step missing: Racktables seems inconsistent. The new one is listed as "new-mr1-eqiad", while the old one as "mr1-eqiad". Can someone fix that?
Jun 27 2018
Sure, that's fine :)
Jun 25 2018
Yes, let's not block this for yet another week! Consider this approved, please go ahead.
Jun 14 2018
Our email logs can be pretty sensitive, especially since they include our corporate emails passing through (senders, recipients, timestamps etc.).
So for at least labvirt1019 it was indeed about PXE not working (the card worked under Linux) and that was due to a BIOS misconfiguration (the "network boot" option for the card set to disabled). T194964#4283034 has more details and troubleshooting steps.
OK, I managed to get this server to boot from its 10G interfaces. The issue was fairly straightforward to resolve ("network boot" was set to "disabled" for the 10G ports and only set to "network boot" for the first 1G port), but here are the steps I took to troubleshoot for future reference:
- Live-hacked install1002 to update the DHCP config with the 10G port's MAC address, as this was still pointing to the 1G interface.
- Attempted to boot with "network boot" (ESC-@ I think) and verified that I couldn't, as I was getting "media check failed" from the Broadcom PXE menu. I was running tcpdump -i any port 67 or port 68 on install1002 simultaneously to grab DHCP requests, but we didn't get that far, as the PXE option ROM wasn't even attempting to do DHCP. This pointed that either the card or cable isn't working, or more likely that this is the option ROM for a different interface, e.g. one of the 1G ones.
- Booted into the previously installed system (running Debian) from the console and verified that the port works in Linux. I did that by setting the interface (eno49) as up, then checking the switch on the other end (asw2-b-eqiad:xe-4/0/16) with show interfaces description and show configuration interfaces xe-4/0/16 | display inheritance and verifying that it sees the link as "up up", and that the config is correct. Then I ran ethtool on the system itself, and verified that it sees the link as negotiated/up and with the right speed. Finally, I ran dhclient eno49 there and it worked and got an IP assigned. By all that I verified that both the card and the cable actually work and that the network configuration is correct, and thus the issues were just about PXE.
- Rebooted and then entered the system config. In the BIOS/Platform config (RBSU) and the PCI interface, I disabled the 4x1G card (Embedded LOM). This is not actually required, but it made things a bit easier to debug as I could figure out e.g. whether the PXE prompt you get is from the 1G card or the 10G card.
- In the 10G card's configuration, I disabled "HP Shared Memory", per T167299, although I'm not sure if this is actually required anymore. From that task, it sounds like it would affect the network past the PXE stage and in the installer, but I had verified that it works in Linux, so that was probably not needed (but we also don't use these features as far as I know). I also disabled SR-IOV for good measure since we don't use it, although I doubt it would affect this.
- In the BIOS/Platform config (RBSU), under Network Options > Network Boot Options, the option "Embedded FlexibleLOM 1 Port 1" was set to "Disabled". I set that to "Network boot". This is certainly related and likely the entire cause of this issues.
- After enabling, you immediately get a warning that says "Important: When enabling network boot support for an Embedded FlexibleLOM embedded NIC, the NIC boot option does not appear in the UEFI Boot Order or Legacy IPL lists until the next system reboot.". So I just did a server reboot after that (easy).
- After that, I booted normally, hit ESC-@ for network boot and was presented with a PXE prompt; from there on, network boot worked, d-i started loading and also acquired an IP and the preseed configuration. It stopped with an error at a partman prompt (likely because of a misconfigured partman profile, unrelated to all this).
Jun 13 2018
What are the symptoms?
@Cmjohnson I'm afraid I don't understand fully what steps you've taken on which server, port or switch. So perhaps let's look at the current status: could you describe where each of labvirt1019's and labvirt1020's ports are connected to, and specifically to which ports on the switch and with what kind of cable? Thanks!
Jun 10 2018
Jun 8 2018
@Cmjohnson regarding flerovium, sure, no problem, go ahead. (The others would need coordination with their respective service owners)
Jun 7 2018
So this backfired, but thankfully the fix was as simple as starting exim :) Good thinking @herron!
The cause was the prep for T175361, in combination with a couple of unexpected misconfigurations/SPOFs, given it's been years since the switchover from mx1001->mx2001 has been tested.
Jun 6 2018
It's been a few months now, what's the status of this?
Jun 5 2018
So we need to do something in a very short amount of time (~two months) -- does anyone have a game plan? @Jrbranaa what's the latest?
Jun 4 2018
We have a number of spreadsheets tracking inventory, refreshes, CapEx budgets etc. Which one are you referring to specifically (doc & sheet)?
May 25 2018
May 23 2018
OK, so it looks 126.96.36.199/24 is proposed to be used immediately in eqiad, to replace 188.8.131.52/25 in the next ~6 months. Additionally, 184.108.40.206/24 is proposed to be reserved (but not assigned) to be used tentatively in Q3 FY18-19 in codfw, for a region 2 deployment. Both of these sound good to me and you can proceed :)
May 21 2018
The RAID still shows as degraded -- @RobH -or someone else- could you have a look? Thanks!
May 18 2018
May 17 2018
Confirmed, thanks @Papaul!
May 16 2018
The /25 -> /24 renumbering seems fairly straightforward, but given a) IPv4's depletion (we effectively cannot get more IPv4 space from any of the RIRs), b) the Neutron redesign and c) Cloud Services' growth and needs like T122406's, I think it's worthwhile to look at it a bit more broadly in order to make sure we avoid e.g. depletion or fragmentation of our IP space. Perhaps for instance we need to be looking at a larger assignment :)
May 15 2018
radium is super old hardware (2011 era) and its refresh is imminent, as part of T189317. No reason to spend time to reimage at this point :)
May 8 2018
Thanks @Deskana :) I think that all seems sufficient and we should just go ahead with this. 2018-08-01 sounds reasonable, and we can always extend this if there's a need.
May 7 2018
I'm not sure if this needs my approval, but if it does, it has it, as long as:
- The console data contain PII, so an NDA would be absolutely required with whomever we'd need to give access to this. Presumably this company is under a contract with us and that probably includes a confidentiality clause? @Deskana, can you confirm?
- Without knowing much about this, this sounds like a one-off project, that has a start and an end date -- is that right? If so, we should make sure to revoke access to that account when the project is over (and especially if the contract, alongside its confidentialy clause, expires). We have an "expiration date" field for shell accounts, so we could do something similar here.
Apr 26 2018
As far as periodicity goes, note that MaxMind states that GeoIP2 Country and City are updated every Tuesday and the rest every 1-4 weeks, so a weekly cronjob every Wednesday sounds like it would do the trick.
Apr 25 2018
@danstillman this is very useful information (and good news!), thank you for the detailed updated! It still seems like the options are either running a Docker image which embeds custom builds of Firefox and Node.js though, which comes with certain maintenance challenges.
My two cents:
- I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in its current state, especially when the stability of the system is potentially at risk? I personally doubt it'll be very useful and it's yet another thing that we'll have parameterized (in the humongous base class with dozens of parameters no less). As a general rule, I think we should be avoiding adding hiera knobs unless there's a very good reason for it (including at least an existing user in the tree!) and rely on sane defaults and/or other properties of the systems via facter.
- Right now setting profile::base::atop_enabled will still result in different results in jessie and stretch hosts given the -R difference upstream, so this still comes with the potential minefield that resulted in this task. I can e.g. imagine a new hire that isn't aware of this task enabling this knob in a year on a jessie host, then a month later reimaging the host as stretch and scratching their heads :)
- Installing the atop package while disabling the cron job isn't going to be particularly useful: atop's value proposition is its recording function; tools like top and htop are at least equally good or superior in the runtime/realtime stuff.
Apr 24 2018
So, this task has been open for a couple of months now, with the underlying issues have been present for far longer than that. In case it wasn't clear from the lengthy and detailed task description, there are currently two deadlines here:
- Firefox 52 ESR (which this is indirectly based on) EOLs in August 2018.
- Ubuntu 14.04 trusty EOLs in April 2019.
These are externally set, and affect security support among other things, so they're unfortunately hard deadlines.
Let's just use both of them to also set up the stand-in that you mentioned above?
Apr 23 2018
Easy enough, +1 :) Maybe Add a /* comment */ linking to the NLNOG filter guide?
I took a careful look at this -- it looks pretty good, but I'd suggest rolling it out slowly in phases just to be on the safe side. That could be separate phases for either the three different things it does (prefix length, bogon ASNs, long AS paths), the sites/BGP groups it's applied in, or both.
Apr 19 2018
I'm a Pivot newbie -- how could this be inferred? I've tried adding an Ip ~ ":" but that can only appear as a filter, not under split; in split I can only add "Ip" as a field, but that of course just lists different IPs, not the boolean state of IPv6 or not.
Apr 18 2018
Seems fine :) Welcome back Sean!
I don't feel strongly about this, but I'm a bit skeptical about keeping this in puppet/volatile, given these are fairly out of scope for Puppet (it wouldn't really ever use this data AIUO). It'd be easy to forget, breakages wouldn't be immediately obvious etc.
WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now.
Apr 17 2018
Yup, a replacement is underway as part of T189317 :)
Apr 16 2018
kafkacat 1.3.1-1~bpo9+1 should be available from Debian's stretch-backports on all stretch hosts:
$ rmadison -a amd64 kafkacat kafkacat | 1.3.0-1+b1 | stable | amd64 kafkacat | 1.3.1-1~bpo9+1 | stretch-backports | amd64 kafkacat | 1.3.1-1 | testing | amd64 kafkacat | 1.3.1-1 | unstable | amd64
Apr 13 2018
@Ottomata pinged me last week about that, I guess I hadn't seen this task or forgot about it entirely, sorry about that!
Apr 5 2018
Ah! That's a regular mainboard/SATA controller, so these two drives wouldn't be able to participate in RAID groups. We've done that before I think, at least with Dells, where we had the system drives connected separately.
I don't understand :) Could you clarify which disks are in which slots, and how/where are they connected?
OK, I just saw above that this is a HPE Smart Array P440ar controller. According to the specs, the controller has "Internal: 8 SAS/SATA physical links across 2 x4 ports". So I think each of the ports connects to one of the internal cages (1I and 2I), with each holding 4 disks. That's all normal and according to the specs, and 8 disks is the maximum that this controller can hold. Where are the other two disks located (front/back?), and where are they connected?