faidon (Faidon Liambotis)
SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (197 w, 4 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

faidon added a comment to T197242: Transition citoid to use Zotero's translation-server-v2.

Is there any progress and/or timeline for this? Thanks!

Fri, Jul 20, 2:10 PM · Citoid, Services (watching), VisualEditor, Operations

Thu, Jul 19

Framawiki awarded T199816: Sunset Watchmouse's status.wikimedia.org a Love token.
Thu, Jul 19, 10:43 PM · monitoring, Patch-For-Review, Operations

Wed, Jul 18

faidon added a comment to T196886: Replace wtp1043's sda.

What's going on with this?

Wed, Jul 18, 11:40 AM · DC-Ops, ops-eqiad, Operations
faidon added a comment to T143367: Users getting logged-out during minor network glitches.

It's been a couple of years since I filed this and I don't remember much since, so unfortunately I don't have any more insight at this point. These kind of widespread network events are very rare and there are no such outages recently I'm afraid. We could figure out ways to simulate them from e.g. mwdebug though, although I doubt that anyone has the time to investigate this in such depth, so I don't particularly disagree with resolving this task instead.

Wed, Jul 18, 1:47 AM · Performance-Team, Availability (MediaWiki-MultiDC), MediaWiki-Authentication-and-authorization

Tue, Jul 17

faidon added a project to T199816: Sunset Watchmouse's status.wikimedia.org: monitoring.
Tue, Jul 17, 4:28 PM · monitoring, Patch-For-Review, Operations
faidon added a comment to T115945: status.wikimedia.org should not load Google Analytics.

I filed T199816 for removing that page, we can follow up on that and if implemented, resolve this task and its parent.

Tue, Jul 17, 4:27 PM · Security-Core, Operations, Privacy, monitoring
faidon triaged T199816: Sunset Watchmouse's status.wikimedia.org as Normal priority.
Tue, Jul 17, 4:26 PM · monitoring, Patch-For-Review, Operations

Wed, Jul 11

faidon reopened Unknown Object (Task), a subtask of T196485: WDQS diskspace is low, as Open.
Wed, Jul 11, 1:24 PM · Operations, Discovery, Wikidata, Wikidata-Query-Service

Tue, Jul 10

faidon added a comment to T198939: Decommission servermon.

I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cases, so that may be something that could potentially work (with the caveats that Riccardo mentioned above, however).

Tue, Jul 10, 6:03 PM · Patch-For-Review, Operations
faidon triaged T199251: furud: disconnect and power down all disk shelves as High priority.
Tue, Jul 10, 5:06 PM · ops-codfw, Operations, DC-Ops

Wed, Jul 4

faidon added a comment to T185171: replace mr1-eqiad.

There seems to be another step missing: Racktables seems inconsistent. The new one is listed as "new-mr1-eqiad", while the old one as "mr1-eqiad". Can someone fix that?

Wed, Jul 4, 4:07 PM · ops-eqiad, Operations, netops

Tue, Jul 3

faidon added a comment to T196345: eqiad: (1) new stat box to offload users from stat1005.

The argument that switches between stat boxes are expensive in staff time, so we should make them less often doesn't resonate much with me (maybe we should just make them more often to avoid getting too attached to individual servers :), but happy to approve a purchase as well -- it's in the budget as @Ottomata mentioned, and it does sound like a reasonable expense to make in the grand scheme of things. Please go ahead!

Tue, Jul 3, 10:10 AM · hardware-requests, Operations, Analytics

Wed, Jun 27

faidon triaged T198344: Get Papaul access to network equipment as Normal priority.
Wed, Jun 27, 6:23 PM · SRE-Access-Requests, netops, Operations
faidon closed T197857: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group, a subtask of T186748: New service request: chromium-render/deploy, as Resolved.
Wed, Jun 27, 3:58 PM · Patch-For-Review, Readers-Web-Kanbanana-Board, Services (blocked), Service-deployment-requests, Readers-Web-Backlog, Proton, Electron-PDFs, Operations
faidon closed T197857: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group as Resolved.

Sure, that's fine :)

Wed, Jun 27, 3:58 PM · Proton, Operations, SRE-Access-Requests

Mon, Jun 25

faidon reassigned T196345: eqiad: (1) new stat box to offload users from stat1005 from elukey to RobH.

That spare assignment sounds good to me, consider it approved. @RobH, you can go ahead :)

Mon, Jun 25, 12:43 PM · hardware-requests, Operations, Analytics
faidon added a comment to T197237: Requesting access for mbsantos.

Yes, let's not block this for yet another week! Consider this approved, please go ahead.

Mon, Jun 25, 9:48 AM · Patch-For-Review, Analytics, SRE-Access-Requests, Operations

Jun 14 2018

faidon added a comment to T197173: Ship MX logs to ELK.

Our email logs can be pretty sensitive, especially since they include our corporate emails passing through (senders, recipients, timestamps etc.).

Jun 14 2018, 3:52 PM · User-herron, Wikimedia-Logstash, Mail, Operations
faidon added a comment to T197169: 10G ports seem not to work on new HP hardware.

So for at least labvirt1019 it was indeed about PXE not working (the card worked under Linux) and that was due to a BIOS misconfiguration (the "network boot" option for the card set to disabled). T194964#4283034 has more details and troubleshooting steps.

Jun 14 2018, 3:27 PM · Cloud-Services, Operations
faidon added a comment to T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.

OK, I managed to get this server to boot from its 10G interfaces. The issue was fairly straightforward to resolve ("network boot" was set to "disabled" for the 10G ports and only set to "network boot" for the first 1G port), but here are the steps I took to troubleshoot for future reference:

  • Live-hacked install1002 to update the DHCP config with the 10G port's MAC address, as this was still pointing to the 1G interface.
  • Attempted to boot with "network boot" (ESC-@ I think) and verified that I couldn't, as I was getting "media check failed" from the Broadcom PXE menu. I was running tcpdump -i any port 67 or port 68 on install1002 simultaneously to grab DHCP requests, but we didn't get that far, as the PXE option ROM wasn't even attempting to do DHCP. This pointed that either the card or cable isn't working, or more likely that this is the option ROM for a different interface, e.g. one of the 1G ones.
  • Booted into the previously installed system (running Debian) from the console and verified that the port works in Linux. I did that by setting the interface (eno49) as up, then checking the switch on the other end (asw2-b-eqiad:xe-4/0/16) with show interfaces description and show configuration interfaces xe-4/0/16 | display inheritance and verifying that it sees the link as "up up", and that the config is correct. Then I ran ethtool on the system itself, and verified that it sees the link as negotiated/up and with the right speed. Finally, I ran dhclient eno49 there and it worked and got an IP assigned. By all that I verified that both the card and the cable actually work and that the network configuration is correct, and thus the issues were just about PXE.
  • Rebooted and then entered the system config. In the BIOS/Platform config (RBSU) and the PCI interface, I disabled the 4x1G card (Embedded LOM). This is not actually required, but it made things a bit easier to debug as I could figure out e.g. whether the PXE prompt you get is from the 1G card or the 10G card.
  • In the 10G card's configuration, I disabled "HP Shared Memory", per T167299, although I'm not sure if this is actually required anymore. From that task, it sounds like it would affect the network past the PXE stage and in the installer, but I had verified that it works in Linux, so that was probably not needed (but we also don't use these features as far as I know). I also disabled SR-IOV for good measure since we don't use it, although I doubt it would affect this.
  • In the BIOS/Platform config (RBSU), under Network Options > Network Boot Options, the option "Embedded FlexibleLOM 1 Port 1" was set to "Disabled". I set that to "Network boot". This is certainly related and likely the entire cause of this issues.
  • After enabling, you immediately get a warning that says "Important: When enabling network boot support for an Embedded FlexibleLOM embedded NIC, the NIC boot option does not appear in the UEFI Boot Order or Legacy IPL lists until the next system reboot.". So I just did a server reboot after that (easy).
  • After that, I booted normally, hit ESC-@ for network boot and was presented with a PXE prompt; from there on, network boot worked, d-i started loading and also acquired an IP and the preseed configuration. It stopped with an error at a partman prompt (likely because of a misconfigured partman profile, unrelated to all this).
Jun 14 2018, 3:23 PM · Cloud-Services, Patch-For-Review, Operations, ops-eqiad

Jun 13 2018

faidon added a comment to T197169: 10G ports seem not to work on new HP hardware.

What are the symptoms?

Jun 13 2018, 9:18 PM · Cloud-Services, Operations
faidon added a comment to T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.

@Cmjohnson I'm afraid I don't understand fully what steps you've taken on which server, port or switch. So perhaps let's look at the current status: could you describe where each of labvirt1019's and labvirt1020's ports are connected to, and specifically to which ports on the switch and with what kind of cable? Thanks!

Jun 13 2018, 9:16 PM · Cloud-Services, Patch-For-Review, Operations, ops-eqiad
Tgr awarded T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible a Like token.
Jun 13 2018, 4:28 PM · Patch-For-Review, monitoring, Operations

Jun 10 2018

Gerrit Code Review <gerrit@wikimedia.org> committed rOSNBa6cc3323c8e9: Final NoteDb migration updates (authored by faidon).
Final NoteDb migration updates
Jun 10 2018, 2:50 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rOSNB8689c87c5105: Create change (authored by faidon).
Create change
Jun 10 2018, 2:50 AM

Jun 8 2018

faidon raised the priority of T193655: rack/setup/install labstore1008 & labstore1009 from Normal to High.
Jun 8 2018, 2:35 PM · cloud-services-team (Kanban), Patch-For-Review, ops-eqiad, Cloud-VPS, Operations
faidon added a comment to T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet.

@Cmjohnson regarding flerovium, sure, no problem, go ahead. (The others would need coordination with their respective service owners)

Jun 8 2018, 2:26 PM · Patch-For-Review, Datasets-General-or-Unknown, ops-eqiad, Cloud-VPS, Operations

Jun 7 2018

faidon added a comment to T175361: Upgrade mx1001/mx2001 to stretch.

So this backfired, but thankfully the fix was as simple as starting exim :) Good thinking @herron!

Jun 7 2018, 2:15 AM · User-herron, Patch-For-Review, Operations, Mail
faidon closed T196598: Phab and Gerrit emails stopped at around 1900 UTC 6th June as Resolved.

The cause was the prep for T175361, in combination with a couple of unexpected misconfigurations/SPOFs, given it's been years since the switchover from mx1001->mx2001 has been tested.

Jun 7 2018, 2:11 AM · Phabricator, Gerrit, Mail, Operations

Jun 6 2018

faidon added a comment to T185171: replace mr1-eqiad.

It's been a few months now, what's the status of this?

Jun 6 2018, 2:27 PM · ops-eqiad, Operations, netops

Jun 5 2018

faidon added a comment to T187194: zotero translation server: code stewardship request.

So we need to do something in a very short amount of time (~two months) ­-- does anyone have a game plan? @Jrbranaa what's the latest?

Jun 5 2018, 10:58 AM · User-Ryasmeen, VisualEditor, Citoid, Services (watching), Operations, Code-Stewardship-Reviews

Jun 4 2018

faidon added a comment to T196072: Analytics hosts missing in Inventory/Refresh.

We have a number of spreadsheets tracking inventory, refreshes, CapEx budgets etc. Which one are you referring to specifically (doc & sheet)?

Jun 4 2018, 12:30 PM · procurement, Operations, Analytics, DC-Ops

May 25 2018

faidon added a comment to T175361: Upgrade mx1001/mx2001 to stretch.

mx2001 has been running Stretch for a few days and has been stable. I think we're in good shape to move on to mx1001. However, there are a few configs with mx1001 hardcoded as the smtp server in the puppet repo. I'll work on removing those to simplify the depool process before rebuilding mx1001.

May 25 2018, 1:16 PM · User-herron, Patch-For-Review, Operations, Mail

May 23 2018

faidon added a comment to T193496: Allocate public v4 IPs for Neutron setup in eqiad.

OK, so it looks 185.15.56.0/24 is proposed to be used immediately in eqiad, to replace 208.80.155.128/25 in the next ~6 months. Additionally, 185.15.57.0/24 is proposed to be reserved (but not assigned) to be used tentatively in Q3 FY18-19 in codfw, for a region 2 deployment. Both of these sound good to me and you can proceed :)

May 23 2018, 12:28 PM · Cloud-Services, netops, Operations

May 21 2018

faidon added a comment to T193394: Degraded RAID on wasat.

The RAID still shows as degraded -- @RobH -or someone else- could you have a look? Thanks!

May 21 2018, 8:17 AM · Operations, ops-codfw

May 18 2018

faidon committed rOSNBf3c3f6d18405: Edit Project Config (authored by faidon).
Edit Project Config
May 18 2018, 7:56 PM
faidon committed rOSNB6a1caa720b90: Add Wikimedia's initial data (authored by faidon).
Add Wikimedia's initial data
May 18 2018, 7:56 PM
faidon committed rOSNB07a7facdf44d: Allow custom fields in the Device CSV form (authored by faidon).
Allow custom fields in the Device CSV form
May 18 2018, 7:56 PM

May 17 2018

faidon closed T194798: furud: disconnect furud-array[3-7]; connect furud-array[1-2] as Resolved.

Confirmed, thanks @Papaul!

May 17 2018, 3:15 PM · Operations, ops-codfw

May 16 2018

faidon added a comment to T193496: Allocate public v4 IPs for Neutron setup in eqiad.

The /25 -> /24 renumbering seems fairly straightforward, but given a) IPv4's depletion (we effectively cannot get more IPv4 space from any of the RIRs), b) the Neutron redesign and c) Cloud Services' growth and needs like T122406's, I think it's worthwhile to look at it a bit more broadly in order to make sure we avoid e.g. depletion or fragmentation of our IP space. Perhaps for instance we need to be looking at a larger assignment :)

May 16 2018, 12:42 PM · Cloud-Services, netops, Operations

May 15 2018

faidon closed T194796: replace/reinstall radium with a stretch system as Declined.

radium is super old hardware (2011 era) and its refresh is imminent, as part of T189317. No reason to spend time to reimage at this point :)

May 15 2018, 8:31 PM · Operations
faidon triaged T194798: furud: disconnect furud-array[3-7]; connect furud-array[1-2] as High priority.
May 15 2018, 8:29 PM · Operations, ops-codfw

May 8 2018

faidon added a comment to T192893: Access to Google Search Console for Go Fish Digital.

Thanks @Deskana :) I think that all seems sufficient and we should just go ahead with this. 2018-08-01 sounds reasonable, and we can always extend this if there's a need.

May 8 2018, 11:27 AM · SRE-Access-Requests, Operations

May 7 2018

faidon added a comment to T192893: Access to Google Search Console for Go Fish Digital.

I'm not sure if this needs my approval, but if it does, it has it, as long as:

  • The console data contain PII, so an NDA would be absolutely required with whomever we'd need to give access to this. Presumably this company is under a contract with us and that probably includes a confidentiality clause? @Deskana, can you confirm?
  • Without knowing much about this, this sounds like a one-off project, that has a start and an end date -- is that right? If so, we should make sure to revoke access to that account when the project is over (and especially if the contract, alongside its confidentialy clause, expires). We have an "expiration date" field for shell accounts, so we could do something similar here.
May 7 2018, 8:35 AM · SRE-Access-Requests, Operations

Apr 26 2018

faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

As far as periodicity goes, note that MaxMind states that GeoIP2 Country and City are updated every Tuesday and the rest every 1-4 weeks, so a weekly cronjob every Wednesday sounds like it would do the trick.

Apr 26 2018, 11:22 AM · Puppet, Patch-For-Review, Analytics-Kanban

Apr 25 2018

faidon added a comment to T187194: zotero translation server: code stewardship request.

@danstillman this is very useful information (and good news!), thank you for the detailed updated! It still seems like the options are either running a Docker image which embeds custom builds of Firefox and Node.js though, which comes with certain maintenance challenges.

Apr 25 2018, 2:10 PM · User-Ryasmeen, VisualEditor, Citoid, Services (watching), Operations, Code-Stewardship-Reviews
faidon reopened T192551: atop on stretch overloading a host as "Open".

My two cents:

  • I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in its current state, especially when the stability of the system is potentially at risk? I personally doubt it'll be very useful and it's yet another thing that we'll have parameterized (in the humongous base class with dozens of parameters no less). As a general rule, I think we should be avoiding adding hiera knobs unless there's a very good reason for it (including at least an existing user in the tree!) and rely on sane defaults and/or other properties of the systems via facter.
  • Right now setting profile::base::atop_enabled will still result in different results in jessie and stretch hosts given the -R difference upstream, so this still comes with the potential minefield that resulted in this task. I can e.g. imagine a new hire that isn't aware of this task enabling this knob in a year on a jessie host, then a month later reimaging the host as stretch and scratching their heads :)
  • Installing the atop package while disabling the cron job isn't going to be particularly useful: atop's value proposition is its recording function; tools like top and htop are at least equally good or superior in the runtime/realtime stuff.
Apr 25 2018, 1:30 PM · Upstream, Patch-For-Review, monitoring, Operations

Apr 24 2018

faidon updated subscribers of T187194: zotero translation server: code stewardship request.

So, this task has been open for a couple of months now, with the underlying issues have been present for far longer than that. In case it wasn't clear from the lengthy and detailed task description, there are currently two deadlines here:

  • Firefox 52 ESR (which this is indirectly based on) EOLs in August 2018.
  • Ubuntu 14.04 trusty EOLs in April 2019.

These are externally set, and affect security support among other things, so they're unfortunately hard deadlines.

Apr 24 2018, 1:35 PM · User-Ryasmeen, VisualEditor, Citoid, Services (watching), Operations, Code-Stewardship-Reviews
faidon added a comment to T184293: rack/setup/install lvs101[3-6].

@Cmjohnson I think @BBlack's question above was for you -- task description seems to point at a few of the steps on your side being still pending at least.

Apr 24 2018, 1:03 PM · Patch-For-Review, ops-eqiad, Operations, Traffic
faidon added a comment to T192185: request to assign spare systems as terbium equivalent.

Let's just use both of them to also set up the stand-in that you mentioned above?

Apr 24 2018, 10:49 AM · Patch-For-Review, hardware-requests, Operations

Apr 23 2018

faidon added a comment to T190323: Implement BGP graceful shutdown.

Easy enough, +1 :) Maybe Add a /* comment */ linking to the NLNOG filter guide?

Apr 23 2018, 1:30 PM · netops, Operations
faidon added a comment to T190317: Update BGP_sanitize_in filter.

I took a careful look at this -- it looks pretty good, but I'd suggest rolling it out slowly in phases just to be on the safe side. That could be separate phases for either the three different things it does (prefix length, bogon ASNs, long AS paths), the sites/BGP groups it's applied in, or both.

Apr 23 2018, 1:29 PM · Operations, netops

Apr 19 2018

faidon added a comment to T138396: Create ops dashboard with info like ipv6 traffic split .

I'm a Pivot newbie -- how could this be inferred? I've tried adding an Ip ~ ":" but that can only appear as a filter, not under split; in split I can only add "Ip" as a field, but that of course just lists different IPs, not the boolean state of IPv6 or not.

Apr 19 2018, 4:06 PM · Analytics
faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.

Apr 19 2018, 2:22 PM · Puppet, Patch-For-Review, Analytics-Kanban

Apr 18 2018

faidon changed the status of T191478: Requesting access to shell (snapshot, dumpsdata) for springle from Stalled to Open.

Seems fine :) Welcome back Sean!

Apr 18 2018, 4:57 PM · Patch-For-Review, Operations, SRE-Access-Requests
faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

I don't feel strongly about this, but I'm a bit skeptical about keeping this in puppet/volatile, given these are fairly out of scope for Puppet (it wouldn't really ever use this data AIUO). It'd be easy to forget, breakages wouldn't be immediately obvious etc.

Apr 18 2018, 3:14 PM · Puppet, Patch-For-Review, Analytics-Kanban
faidon added a comment to T192185: request to assign spare systems as terbium equivalent.

WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now.

Apr 18 2018, 12:32 AM · Patch-For-Review, hardware-requests, Operations

Apr 17 2018

faidon added a member for acl*procurement-review: LGoto.
Apr 17 2018, 11:46 PM
faidon added a comment to T192280: sda failure in hydrogen.wikimedia.org.

Yup, a replacement is underway as part of T189317 :)

Apr 17 2018, 4:05 PM · ops-eqiad, Traffic, Operations

Apr 16 2018

faidon added a comment to T182163: Update to latest kafkacat.

kafkacat 1.3.1-1~bpo9+1 should be available from Debian's stretch-backports on all stretch hosts:

$ rmadison -a amd64 kafkacat
kafkacat   | 1.3.0-1+b1     | stable            | amd64
kafkacat   | 1.3.1-1~bpo9+1 | stretch-backports | amd64
kafkacat   | 1.3.1-1        | testing           | amd64
kafkacat   | 1.3.1-1        | unstable          | amd64
Apr 16 2018, 4:38 PM · Analytics-Kanban, Patch-For-Review, Analytics, Services (watching)

Apr 13 2018

faidon added a comment to T182163: Update to latest kafkacat.

@Ottomata pinged me last week about that, I guess I hadn't seen this task or forgot about it entirely, sorry about that!

Apr 13 2018, 12:51 PM · Analytics-Kanban, Patch-For-Review, Analytics, Services (watching)

Apr 5 2018

faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

Ah! That's a regular mainboard/SATA controller, so these two drives wouldn't be able to participate in RAID groups. We've done that before I think, at least with Dells, where we had the system drives connected separately.

Apr 5 2018, 2:54 PM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

I don't understand :) Could you clarify which disks are in which slots, and how/where are they connected?

Apr 5 2018, 1:01 PM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

OK, I just saw above that this is a HPE Smart Array P440ar controller. According to the specs, the controller has "Internal: 8 SAS/SATA physical links across 2 x4 ports". So I think each of the ports connects to one of the internal cages (1I and 2I), with each holding 4 disks. That's all normal and according to the specs, and 8 disks is the maximum that this controller can hold. Where are the other two disks located (front/back?), and where are they connected?

Apr 5 2018, 9:17 AM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon raised the priority of T187373: Rebuild raids on labvirt1019 and 1020 from Normal to High.

@Cmjohnson @RobH This has been going on for weeks now, and this is too much of a delay for setting up these systems. I'm elevating this task's priority, let's get to the bottom of this ASAP. A lot of the delays were just on our side, but I see that HPE is delaying this further too; please escalate within HPE and/or with me if you are not getting timely responses.

Apr 5 2018, 9:08 AM · cloud-services-team (Kanban), Operations, Cloud-Services

Apr 4 2018

faidon added a comment to T184564: Plan Puppet 5 upgrade.

In terms of code, what would the changes required be? What are these deprecation warnings that you mentioned above? Are we tracking fixes for these somewhere and are we making sure new ones don't crop up?

Apr 4 2018, 10:17 AM · Puppet, Operations

Apr 3 2018

faidon added a comment to T183937: rack/setup/install labvirt102[12].

So @ayounsi found this: https://help.ubuntu.com/community/Installation/Netboot#Multiple_Network_Interface_Note

this seems to describe our issue. However, I'm uncertain its worth hacking around it when we can just put in a 10G spot that is free. @faidon advised to move ahead on this install, but that was before we had a potential solution.

Apr 3 2018, 9:08 PM · cloud-services-team (Kanban), Operations
faidon added a comment to T183937: rack/setup/install labvirt102[12].

So there is an issue where trusty expects the os to be on eth0, and its on eth3. However, after discussion in IRC, @ayounsi pointed out the new switch in this rack is 10G.

Apr 3 2018, 10:08 AM · cloud-services-team (Kanban), Operations

Mar 28 2018

faidon updated subscribers of T190540: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error.

These seem to be under warranty for another 2 months, so we should hurry up.

Mar 28 2018, 2:31 PM · Traffic, Operations, ops-codfw

Mar 27 2018

faidon reassigned T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs from Eevans to RobH.
Mar 27 2018, 4:23 PM · User-fgiunchedi, Services (blocked), Operations, Cassandra, User-Eevans

Mar 26 2018

faidon added a comment to T190719: Create @wikimedia.org e-mail that just discards things sent to it.

There's no-reply@wikimedia.org that just gets discarded. I'm not sure if it could be a good fit to your purpose though -- wouldn't it be possible to just remove the email address completely from those accounts and/or just disable those entirely where applicable?

Mar 26 2018, 9:44 PM · Operations, Office-IT
faidon added a comment to T183937: rack/setup/install labvirt102[12].

Is there any way we can help? Do you have logs or more information about the "trusty will only image from eth0" that we could perhaps help troubleshoot together?

Mar 26 2018, 11:37 AM · cloud-services-team (Kanban), Operations

Mar 22 2018

faidon added a comment to T190364: eqiad 10G ports needs.

I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it going either way, depending on the density we'll come up with.

Mar 22 2018, 3:34 PM · Operations, netops

Mar 19 2018

faidon added a comment to T189763: status.wikimedia.org should have an alternative privacy policy.

I don't disagree with any of that (if anything, they're all great ideas), but I'm not sure if we should be spending time on it right now. Revamping our status page and providing a proper status page that reflects our true status, and is also used for short text announcements by humans is definitely in my radar, and depending on how hiring and onboarding goes, might even happen in the next 18 months or so. Thoughts?

Mar 19 2018, 12:19 PM · monitoring, Operations, Privacy, Security

Mar 17 2018

faidon updated the task description for T185153: attach furud's new arrays (furud-array[3-7]).
Mar 17 2018, 11:28 AM · ops-codfw, Operations

Mar 14 2018

faidon closed T185153: attach furud's new arrays (furud-array[3-7]) as Resolved.

These are now attached and configured, resolving.

Mar 14 2018, 9:03 PM · ops-codfw, Operations
faidon added a comment to T185153: attach furud's new arrays (furud-array[3-7]).

Figured this out with @Papaul on IRC (thanks!).

Mar 14 2018, 5:08 PM · ops-codfw, Operations
faidon reassigned T185153: attach furud's new arrays (furud-array[3-7]) from faidon to Papaul.

I rebooted furud and is not booting right now, saying:

The total number of enclosures connected to connector 01, has exceeded
the maximum allowable limit of 4 enclosures. Please remove the extra enclosures
and then restart your system.

@RobH could you perhaps help out with the topology here?

Mar 14 2018, 3:13 PM · ops-codfw, Operations

Mar 13 2018

faidon added a comment to T188045: wdqs1004 broken.

So post-mortem, I think there are 4 different things here:

  • T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis)
  • T189522: Detect IP address collisions
  • General enhancements on our server provisioning and decommissioning pipeline, which has a bunch of long-standing issues, but also requires a more dedicated long-term effort. I'm sure there's one or more tasks related to this, but more broadly, this work stream is something that has been incorporated into our (draft) annual plan as a major item next year.
  • (Tagential) Triage the decom queue in a more prompt way to avoid servers lingering for months after their service decom.
Mar 13 2018, 5:02 PM · netops, Discovery-Wikidata-Query-Service-Sprint, ops-eqiad, Discovery, Wikidata-Query-Service, Wikidata, Operations
faidon reassigned T179042: Setup eqsin RIPE Atlas anchor from faidon to ayounsi.

We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network.

You can see your anchor when logged in to the RIPE Atlas website.

The direct link to the probe page for the anchor is here:
https://atlas.ripe.net/probes/6345/

[…]

Mar 13 2018, 12:01 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations

Mar 12 2018

faidon closed Unknown Object (Task), a subtask of T156031: Turn up network links for Asia Cache DC, as Resolved.
Mar 12 2018, 6:47 PM · Operations, Traffic
faidon triaged T189522: Detect IP address collisions as High priority.
Mar 12 2018, 6:40 PM · Patch-For-Review, Operations, netops
faidon added a comment to T189519: Audit switch ports/descriptions/enable.

I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence to make sure this doesn't happen again, e.g. by adding an alert, or a Juniper slax script to make sure enabled ports always have a description.

Mar 12 2018, 6:37 PM · ops-eqiad, Operations, netops, ops-codfw
faidon renamed T189519: Audit switch ports/descriptions/enable from audit codfw switch ports/descriptions/enable to Audit switch ports/descriptions/enable.
Mar 12 2018, 6:35 PM · ops-eqiad, Operations, netops, ops-codfw
faidon reassigned T179042: Setup eqsin RIPE Atlas anchor from faidon to ayounsi.

Just heard from RIPE:

I just finished the provisioning of sg-sin-as14907.anchors.atlas.ripe.net and noticed that port 5666 is filtered.
Mar 12 2018, 2:53 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations
faidon added a comment to T185153: attach furud's new arrays (furud-array[3-7]).

I'd like all the 5 shelves (array3-7) connected to furud, but not the 2 old ones (array1-2) until further notice. Can we just bypass array1-2 by disconnecting them entirely, and creating a chain with just array3-7?

Mar 12 2018, 2:51 PM · ops-codfw, Operations
faidon raised the priority of T188045: wdqs1004 broken from High to Unbreak Now!.
faidon@re0.cr1-eqiad> show arp no-resolve | match 10.64.0.17 
78:2b:cb:2d:fa:e6 10.64.0.17      ae1.1017                 none
Mar 12 2018, 2:43 PM · netops, Discovery-Wikidata-Query-Service-Sprint, ops-eqiad, Discovery, Wikidata-Query-Service, Wikidata, Operations
faidon reassigned T185153: attach furud's new arrays (furud-array[3-7]) from faidon to Papaul.

Thanks for taking care of this before your trip! I checked this out last week, and it seemed then (and now that I double-checked it) that only three shelves (36 disks) are visible, rather than 5 (array3-7).

Mar 12 2018, 1:20 PM · ops-codfw, Operations
faidon reopened Unknown Object (Task), a subtask of T156031: Turn up network links for Asia Cache DC, as Open.
Mar 12 2018, 11:45 AM · Operations, Traffic

Mar 9 2018

faidon added a comment to T179042: Setup eqsin RIPE Atlas anchor.

That is correct to my knowledge -- that was the case with our other anchors.

Mar 9 2018, 12:35 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations

Mar 7 2018

faidon updated the task description for T179042: Setup eqsin RIPE Atlas anchor.
Mar 7 2018, 3:10 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations
faidon added a comment to T179042: Setup eqsin RIPE Atlas anchor.

I believe this was blocked until today on an SFP replacement (T188923). It seems that the IP of the Atlas is responding now, and we even receive an SSH banner. So I just submitted the form on the RIPE Atlas panel. Now we're waiting on RIPE before this is fully online:

Thank you for installing the software for your RIPE Atlas anchor!

It may take up to a week to run the tests for your anchor.
We will keep you informed throughout the process of finalising your anchor.

Mar 7 2018, 3:10 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations
faidon added a comment to T189065: Outbound mail from Greenhouse is broken.

This has been discussed in bigger requests a couple of times before (T103893, T84201) for Greenhouse specfically, plus a bunch of other times for other third-party services. The TL;DR is that we don't really like whitelisting in SPF/DKIM/DMARC for wikimedia.org for all of the third-party services that we use, because that opens up attack vectors like email spoofing, CEO fraud to entities that we do not control nor are able to vet their security. The alternative we had proposed before was to use a separate subdomain (careers.wikimedia.org). It's still non-ideal, but it's better than allowing them and others like them to send emails us as <insert ED name>@wikimedia.org for instance.

Mar 7 2018, 2:19 PM · User-herron, Patch-For-Review, DNS, Operations, Mail
Restricted Application added a project to T103893: DNS Change for GreenHouse: Traffic.
Mar 7 2018, 2:19 PM · Traffic, Operations, Mail, DNS

Mar 2 2018

faidon closed Unknown Object (Task), a subtask of T156031: Turn up network links for Asia Cache DC, as Resolved.
Mar 2 2018, 6:54 PM · Operations, Traffic
faidon reassigned T185153: attach furud's new arrays (furud-array[3-7]) from faidon to Papaul.

Let's keep the existing arrays (array1 & array2) offline, and just connect all of the new ones.

Mar 2 2018, 6:44 PM · ops-codfw, Operations
faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

To answer my own earlier question: I was looking at nftables' wiki about the supported features compared to xtables and the updates to the Linux kernel per version. Several systems (mostly WMCS) are still using trusty and Linux 3.13, which is really the first release of nftables and with multiple pretty basic features missing (e.g. REJECT, MASQUERADE etc.). Our latest and greatest right now is 4.9, and even that is apparently missing NOTRACK (added in 4.10), which is something we're using in a few places (e.g. DNSes).

Mar 2 2018, 3:21 PM · Operations

Mar 1 2018

faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

First, I don't think we should be thinking in terms of "using software from the 90s", at least not for something that is still as widely used and well-maintained as iptables (and to something that is as seldomly used as nftables). This is not something we should judge software with; we can talk instead in terms of amounts of bugs, maintainability, upstream response times, when was the last release, if/when it was deprecated by upstream(s) etc.

Mar 1 2018, 3:32 PM · Operations

Feb 28 2018

faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

I don't think it's easy for anyone to calculate the amount of effort required for this, but the stated 1-2 year long migration sounds longer than I thought and... pretty scary. I'd like to at least be conscious of the amount of effort required here, and foresee clear, tangible benefits at the end of the line to be able to justify the effort both for the migration itself, plus all the associated risks, learning curve and confusion in the meantime.

Feb 28 2018, 3:23 AM · Operations

Feb 26 2018

faidon reassigned T183937: rack/setup/install labvirt102[12] from Cmjohnson to RobH.
Feb 26 2018, 7:03 PM · cloud-services-team (Kanban), Operations