Is there any progress and/or timeline for this? Thanks!
Thu, Jul 19
Wed, Jul 18
What's going on with this?
It's been a couple of years since I filed this and I don't remember much since, so unfortunately I don't have any more insight at this point. These kind of widespread network events are very rare and there are no such outages recently I'm afraid. We could figure out ways to simulate them from e.g. mwdebug though, although I doubt that anyone has the time to investigate this in such depth, so I don't particularly disagree with resolving this task instead.
Tue, Jul 17
I filed T199816 for removing that page, we can follow up on that and if implemented, resolve this task and its parent.
Wed, Jul 11
Tue, Jul 10
I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cases, so that may be something that could potentially work (with the caveats that Riccardo mentioned above, however).
Wed, Jul 4
There seems to be another step missing: Racktables seems inconsistent. The new one is listed as "new-mr1-eqiad", while the old one as "mr1-eqiad". Can someone fix that?
Tue, Jul 3
The argument that switches between stat boxes are expensive in staff time, so we should make them less often doesn't resonate much with me (maybe we should just make them more often to avoid getting too attached to individual servers :), but happy to approve a purchase as well -- it's in the budget as @Ottomata mentioned, and it does sound like a reasonable expense to make in the grand scheme of things. Please go ahead!
Wed, Jun 27
Sure, that's fine :)
Mon, Jun 25
That spare assignment sounds good to me, consider it approved. @RobH, you can go ahead :)
Yes, let's not block this for yet another week! Consider this approved, please go ahead.
Jun 14 2018
Our email logs can be pretty sensitive, especially since they include our corporate emails passing through (senders, recipients, timestamps etc.).
So for at least labvirt1019 it was indeed about PXE not working (the card worked under Linux) and that was due to a BIOS misconfiguration (the "network boot" option for the card set to disabled). T194964#4283034 has more details and troubleshooting steps.
OK, I managed to get this server to boot from its 10G interfaces. The issue was fairly straightforward to resolve ("network boot" was set to "disabled" for the 10G ports and only set to "network boot" for the first 1G port), but here are the steps I took to troubleshoot for future reference:
- Live-hacked install1002 to update the DHCP config with the 10G port's MAC address, as this was still pointing to the 1G interface.
- Attempted to boot with "network boot" (ESC-@ I think) and verified that I couldn't, as I was getting "media check failed" from the Broadcom PXE menu. I was running tcpdump -i any port 67 or port 68 on install1002 simultaneously to grab DHCP requests, but we didn't get that far, as the PXE option ROM wasn't even attempting to do DHCP. This pointed that either the card or cable isn't working, or more likely that this is the option ROM for a different interface, e.g. one of the 1G ones.
- Booted into the previously installed system (running Debian) from the console and verified that the port works in Linux. I did that by setting the interface (eno49) as up, then checking the switch on the other end (asw2-b-eqiad:xe-4/0/16) with show interfaces description and show configuration interfaces xe-4/0/16 | display inheritance and verifying that it sees the link as "up up", and that the config is correct. Then I ran ethtool on the system itself, and verified that it sees the link as negotiated/up and with the right speed. Finally, I ran dhclient eno49 there and it worked and got an IP assigned. By all that I verified that both the card and the cable actually work and that the network configuration is correct, and thus the issues were just about PXE.
- Rebooted and then entered the system config. In the BIOS/Platform config (RBSU) and the PCI interface, I disabled the 4x1G card (Embedded LOM). This is not actually required, but it made things a bit easier to debug as I could figure out e.g. whether the PXE prompt you get is from the 1G card or the 10G card.
- In the 10G card's configuration, I disabled "HP Shared Memory", per T167299, although I'm not sure if this is actually required anymore. From that task, it sounds like it would affect the network past the PXE stage and in the installer, but I had verified that it works in Linux, so that was probably not needed (but we also don't use these features as far as I know). I also disabled SR-IOV for good measure since we don't use it, although I doubt it would affect this.
- In the BIOS/Platform config (RBSU), under Network Options > Network Boot Options, the option "Embedded FlexibleLOM 1 Port 1" was set to "Disabled". I set that to "Network boot". This is certainly related and likely the entire cause of this issues.
- After enabling, you immediately get a warning that says "Important: When enabling network boot support for an Embedded FlexibleLOM embedded NIC, the NIC boot option does not appear in the UEFI Boot Order or Legacy IPL lists until the next system reboot.". So I just did a server reboot after that (easy).
- After that, I booted normally, hit ESC-@ for network boot and was presented with a PXE prompt; from there on, network boot worked, d-i started loading and also acquired an IP and the preseed configuration. It stopped with an error at a partman prompt (likely because of a misconfigured partman profile, unrelated to all this).
Jun 13 2018
What are the symptoms?
@Cmjohnson I'm afraid I don't understand fully what steps you've taken on which server, port or switch. So perhaps let's look at the current status: could you describe where each of labvirt1019's and labvirt1020's ports are connected to, and specifically to which ports on the switch and with what kind of cable? Thanks!
Jun 10 2018
Jun 8 2018
@Cmjohnson regarding flerovium, sure, no problem, go ahead. (The others would need coordination with their respective service owners)
Jun 7 2018
So this backfired, but thankfully the fix was as simple as starting exim :) Good thinking @herron!
The cause was the prep for T175361, in combination with a couple of unexpected misconfigurations/SPOFs, given it's been years since the switchover from mx1001->mx2001 has been tested.
Jun 6 2018
It's been a few months now, what's the status of this?
Jun 5 2018
So we need to do something in a very short amount of time (~two months) -- does anyone have a game plan? @Jrbranaa what's the latest?
Jun 4 2018
We have a number of spreadsheets tracking inventory, refreshes, CapEx budgets etc. Which one are you referring to specifically (doc & sheet)?
May 25 2018
May 23 2018
OK, so it looks 18.104.22.168/24 is proposed to be used immediately in eqiad, to replace 22.214.171.124/25 in the next ~6 months. Additionally, 126.96.36.199/24 is proposed to be reserved (but not assigned) to be used tentatively in Q3 FY18-19 in codfw, for a region 2 deployment. Both of these sound good to me and you can proceed :)
May 21 2018
The RAID still shows as degraded -- @RobH -or someone else- could you have a look? Thanks!
May 18 2018
May 17 2018
Confirmed, thanks @Papaul!
May 16 2018
The /25 -> /24 renumbering seems fairly straightforward, but given a) IPv4's depletion (we effectively cannot get more IPv4 space from any of the RIRs), b) the Neutron redesign and c) Cloud Services' growth and needs like T122406's, I think it's worthwhile to look at it a bit more broadly in order to make sure we avoid e.g. depletion or fragmentation of our IP space. Perhaps for instance we need to be looking at a larger assignment :)
May 15 2018
radium is super old hardware (2011 era) and its refresh is imminent, as part of T189317. No reason to spend time to reimage at this point :)
May 8 2018
Thanks @Deskana :) I think that all seems sufficient and we should just go ahead with this. 2018-08-01 sounds reasonable, and we can always extend this if there's a need.
May 7 2018
I'm not sure if this needs my approval, but if it does, it has it, as long as:
- The console data contain PII, so an NDA would be absolutely required with whomever we'd need to give access to this. Presumably this company is under a contract with us and that probably includes a confidentiality clause? @Deskana, can you confirm?
- Without knowing much about this, this sounds like a one-off project, that has a start and an end date -- is that right? If so, we should make sure to revoke access to that account when the project is over (and especially if the contract, alongside its confidentialy clause, expires). We have an "expiration date" field for shell accounts, so we could do something similar here.
Apr 26 2018
As far as periodicity goes, note that MaxMind states that GeoIP2 Country and City are updated every Tuesday and the rest every 1-4 weeks, so a weekly cronjob every Wednesday sounds like it would do the trick.
Apr 25 2018
@danstillman this is very useful information (and good news!), thank you for the detailed updated! It still seems like the options are either running a Docker image which embeds custom builds of Firefox and Node.js though, which comes with certain maintenance challenges.
My two cents:
- I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in its current state, especially when the stability of the system is potentially at risk? I personally doubt it'll be very useful and it's yet another thing that we'll have parameterized (in the humongous base class with dozens of parameters no less). As a general rule, I think we should be avoiding adding hiera knobs unless there's a very good reason for it (including at least an existing user in the tree!) and rely on sane defaults and/or other properties of the systems via facter.
- Right now setting profile::base::atop_enabled will still result in different results in jessie and stretch hosts given the -R difference upstream, so this still comes with the potential minefield that resulted in this task. I can e.g. imagine a new hire that isn't aware of this task enabling this knob in a year on a jessie host, then a month later reimaging the host as stretch and scratching their heads :)
- Installing the atop package while disabling the cron job isn't going to be particularly useful: atop's value proposition is its recording function; tools like top and htop are at least equally good or superior in the runtime/realtime stuff.
Apr 24 2018
So, this task has been open for a couple of months now, with the underlying issues have been present for far longer than that. In case it wasn't clear from the lengthy and detailed task description, there are currently two deadlines here:
- Firefox 52 ESR (which this is indirectly based on) EOLs in August 2018.
- Ubuntu 14.04 trusty EOLs in April 2019.
These are externally set, and affect security support among other things, so they're unfortunately hard deadlines.
Let's just use both of them to also set up the stand-in that you mentioned above?
Apr 23 2018
Easy enough, +1 :) Maybe Add a /* comment */ linking to the NLNOG filter guide?
I took a careful look at this -- it looks pretty good, but I'd suggest rolling it out slowly in phases just to be on the safe side. That could be separate phases for either the three different things it does (prefix length, bogon ASNs, long AS paths), the sites/BGP groups it's applied in, or both.
Apr 19 2018
I'm a Pivot newbie -- how could this be inferred? I've tried adding an Ip ~ ":" but that can only appear as a filter, not under split; in split I can only add "Ip" as a field, but that of course just lists different IPs, not the boolean state of IPv6 or not.
Apr 18 2018
Seems fine :) Welcome back Sean!
I don't feel strongly about this, but I'm a bit skeptical about keeping this in puppet/volatile, given these are fairly out of scope for Puppet (it wouldn't really ever use this data AIUO). It'd be easy to forget, breakages wouldn't be immediately obvious etc.
WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now.
Apr 17 2018
Yup, a replacement is underway as part of T189317 :)
Apr 16 2018
kafkacat 1.3.1-1~bpo9+1 should be available from Debian's stretch-backports on all stretch hosts:
$ rmadison -a amd64 kafkacat kafkacat | 1.3.0-1+b1 | stable | amd64 kafkacat | 1.3.1-1~bpo9+1 | stretch-backports | amd64 kafkacat | 1.3.1-1 | testing | amd64 kafkacat | 1.3.1-1 | unstable | amd64
Apr 13 2018
@Ottomata pinged me last week about that, I guess I hadn't seen this task or forgot about it entirely, sorry about that!
Apr 5 2018
Ah! That's a regular mainboard/SATA controller, so these two drives wouldn't be able to participate in RAID groups. We've done that before I think, at least with Dells, where we had the system drives connected separately.
I don't understand :) Could you clarify which disks are in which slots, and how/where are they connected?
OK, I just saw above that this is a HPE Smart Array P440ar controller. According to the specs, the controller has "Internal: 8 SAS/SATA physical links across 2 x4 ports". So I think each of the ports connects to one of the internal cages (1I and 2I), with each holding 4 disks. That's all normal and according to the specs, and 8 disks is the maximum that this controller can hold. Where are the other two disks located (front/back?), and where are they connected?
@Cmjohnson @RobH This has been going on for weeks now, and this is too much of a delay for setting up these systems. I'm elevating this task's priority, let's get to the bottom of this ASAP. A lot of the delays were just on our side, but I see that HPE is delaying this further too; please escalate within HPE and/or with me if you are not getting timely responses.
Apr 4 2018
In terms of code, what would the changes required be? What are these deprecation warnings that you mentioned above? Are we tracking fixes for these somewhere and are we making sure new ones don't crop up?
Apr 3 2018
Mar 28 2018
These seem to be under warranty for another 2 months, so we should hurry up.
Mar 27 2018
Mar 26 2018
There's firstname.lastname@example.org that just gets discarded. I'm not sure if it could be a good fit to your purpose though -- wouldn't it be possible to just remove the email address completely from those accounts and/or just disable those entirely where applicable?
Is there any way we can help? Do you have logs or more information about the "trusty will only image from eth0" that we could perhaps help troubleshoot together?
Mar 22 2018
I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it going either way, depending on the density we'll come up with.
Mar 19 2018
I don't disagree with any of that (if anything, they're all great ideas), but I'm not sure if we should be spending time on it right now. Revamping our status page and providing a proper status page that reflects our true status, and is also used for short text announcements by humans is definitely in my radar, and depending on how hiring and onboarding goes, might even happen in the next 18 months or so. Thoughts?
Mar 17 2018
Mar 14 2018
These are now attached and configured, resolving.
Figured this out with @Papaul on IRC (thanks!).
I rebooted furud and is not booting right now, saying:
The total number of enclosures connected to connector 01, has exceeded the maximum allowable limit of 4 enclosures. Please remove the extra enclosures and then restart your system.
@RobH could you perhaps help out with the topology here?
Mar 13 2018
So post-mortem, I think there are 4 different things here:
- T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis)
- T189522: Detect IP address collisions
- General enhancements on our server provisioning and decommissioning pipeline, which has a bunch of long-standing issues, but also requires a more dedicated long-term effort. I'm sure there's one or more tasks related to this, but more broadly, this work stream is something that has been incorporated into our (draft) annual plan as a major item next year.
- (Tagential) Triage the decom queue in a more prompt way to avoid servers lingering for months after their service decom.
We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network.
You can see your anchor when logged in to the RIPE Atlas website.
The direct link to the probe page for the anchor is here:
Mar 12 2018
I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence to make sure this doesn't happen again, e.g. by adding an alert, or a Juniper slax script to make sure enabled ports always have a description.
Just heard from RIPE:
I just finished the provisioning of sg-sin-as14907.anchors.atlas.ripe.net and noticed that port 5666 is filtered.
I'd like all the 5 shelves (array3-7) connected to furud, but not the 2 old ones (array1-2) until further notice. Can we just bypass array1-2 by disconnecting them entirely, and creating a chain with just array3-7?
email@example.com> show arp no-resolve | match 10.64.0.17 78:2b:cb:2d:fa:e6 10.64.0.17 ae1.1017 none
Thanks for taking care of this before your trip! I checked this out last week, and it seemed then (and now that I double-checked it) that only three shelves (36 disks) are visible, rather than 5 (array3-7).
Mar 9 2018
That is correct to my knowledge -- that was the case with our other anchors.
Mar 7 2018
I believe this was blocked until today on an SFP replacement (T188923). It seems that the IP of the Atlas is responding now, and we even receive an SSH banner. So I just submitted the form on the RIPE Atlas panel. Now we're waiting on RIPE before this is fully online:
Thank you for installing the software for your RIPE Atlas anchor!
It may take up to a week to run the tests for your anchor.
We will keep you informed throughout the process of finalising your anchor.
This has been discussed in bigger requests a couple of times before (T103893, T84201) for Greenhouse specfically, plus a bunch of other times for other third-party services. The TL;DR is that we don't really like whitelisting in SPF/DKIM/DMARC for wikimedia.org for all of the third-party services that we use, because that opens up attack vectors like email spoofing, CEO fraud to entities that we do not control nor are able to vet their security. The alternative we had proposed before was to use a separate subdomain (careers.wikimedia.org). It's still non-ideal, but it's better than allowing them and others like them to send emails us as <insert ED name>@wikimedia.org for instance.
Mar 2 2018
Let's keep the existing arrays (array1 & array2) offline, and just connect all of the new ones.
To answer my own earlier question: I was looking at nftables' wiki about the supported features compared to xtables and the updates to the Linux kernel per version. Several systems (mostly WMCS) are still using trusty and Linux 3.13, which is really the first release of nftables and with multiple pretty basic features missing (e.g. REJECT, MASQUERADE etc.). Our latest and greatest right now is 4.9, and even that is apparently missing NOTRACK (added in 4.10), which is something we're using in a few places (e.g. DNSes).
Mar 1 2018
First, I don't think we should be thinking in terms of "using software from the 90s", at least not for something that is still as widely used and well-maintained as iptables (and to something that is as seldomly used as nftables). This is not something we should judge software with; we can talk instead in terms of amounts of bugs, maintainability, upstream response times, when was the last release, if/when it was deprecated by upstream(s) etc.
Feb 28 2018
I don't think it's easy for anyone to calculate the amount of effort required for this, but the stated 1-2 year long migration sounds longer than I thought and... pretty scary. I'd like to at least be conscious of the amount of effort required here, and foresee clear, tangible benefits at the end of the line to be able to justify the effort both for the migration itself, plus all the associated risks, learning curve and confusion in the meantime.