faidon (Faidon Liambotis)
SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (206 w, 6 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF) [ Global Accounts ]

Recent Activity

Today

faidon added a comment to T199374: Delegate public IP range for Eqiad1-r OpenStack deployment to designate (neutron).

I just edited the 56.15.185.in-addr.arpa object in the RIPE database to point nameservers directly to labs-ns0/1. This should work now, no need for classful classless delegation :)

Mon, Sep 24, 9:00 PM · Patch-For-Review, Cloud-Services

Yesterday

faidon created P7582 Racktables labels categorized.
Sun, Sep 23, 12:55 PM

Thu, Sep 13

faidon added a comment to T89584: Enable TRIM for SSDs for Cassandra software raid.

As I mentioned above in my second-to-last update, they are blacklisted for queued TRIM which is suboptimal of course. However, the data corruption issues with synchronous TRIM have been long resolved ­-- they were already back in 2016, they certainly seem to be in the kernels we're running with now.

Thu, Sep 13, 8:46 PM · Operations

Wed, Sep 12

faidon added a project to T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500.": Operations.
Wed, Sep 12, 3:11 PM · Operations, Patch-For-Review, Datacenter-Switchover-2018, Parsing-Team, codfw-rollout, Cloud-Services

Mon, Sep 10

faidon reopened T196886: Replace wtp1043's sda as "Open".

We're still getting RAID alerts about this host.

Mon, Sep 10, 9:37 AM · Parsing-Team, DC-Ops, ops-eqiad, Operations

Sat, Sep 8

herron awarded T203883: Implement MTA-STS a Like token.
Sat, Sep 8, 5:01 PM · Mail, Operations
faidon triaged T203883: Implement MTA-STS as Normal priority.
Sat, Sep 8, 4:28 PM · Mail, Operations

Fri, Sep 7

faidon added a comment to T201139: Intermittent connectivity issues in eqiad's row C.

Has anything happened on this? IIRC at our meetings we talked about investigating this further e.g. with the help of JTAC, and exploring whether we should disable the JunOS' DDoS protection.

Fri, Sep 7, 2:16 PM · Operations, netops

Thu, Sep 6

faidon moved T184230: Disavow emails from wikipedia.com from Backlog to Up Next on the Mail board.
Thu, Sep 6, 11:59 PM · Patch-For-Review, Operations, Mail

Wed, Sep 5

faidon added a member for WMF-NDA: EBjune.
Wed, Sep 5, 11:53 PM
faidon added a project to T203108: Create keyholder gerrit repo: Operations.
Wed, Sep 5, 5:54 PM · User-MModell, Operations, Release-Engineering-Team (Kanban)
faidon added a comment to T203108: Create keyholder gerrit repo.

OK, I created a Gerrit repo under operations/software/keyholder and imported the existing history with:

git clone ~/wikimedia/puppet/ keyholder
git remote rm origin
git branch -m master
Wed, Sep 5, 5:53 PM · User-MModell, Operations, Release-Engineering-Team (Kanban)
faidon added a comment to T203290: syncing Ubuntu mirror fail.

Thanks for tracking that down @Dzahn!

Wed, Sep 5, 11:28 AM · Operations

Mon, Sep 3

faidon added a comment to T203108: Create keyholder gerrit repo.

Oh also, would it be possible to keep the (operations/puppet) history such as commit messages etc.? git filter-branch etc. should make this possible right?

I don't think there's much in the Diffusion fork that's useful right? Maybe the README etc. that we could cherry-pick or commit separately?

Mon, Sep 3, 4:11 PM · User-MModell, Operations, Release-Engineering-Team (Kanban)

Fri, Aug 31

faidon triaged T203261: cr2-eqdfw (MX204) vhclient log noise as Normal priority.
Fri, Aug 31, 4:10 PM · netops, Operations
faidon moved T203260: Outdated TLS config for MXes from Backlog to Up Next on the Mail board.
Fri, Aug 31, 3:46 PM · Patch-For-Review, User-herron, Mail, Operations
faidon triaged T203260: Outdated TLS config for MXes as Normal priority.
Fri, Aug 31, 3:46 PM · Patch-For-Review, User-herron, Mail, Operations
faidon moved T41785: Create a Cloud VPS SMTP smarthost from Backlog to Up Next on the Mail board.
Fri, Aug 31, 3:39 PM · Operations, Cloud-Services, Mail
faidon moved T166291: Exim panics when spamd reaches maxchildren from Backlog to Up Next on the Mail board.
Fri, Aug 31, 3:39 PM · Mail, Operations

Thu, Aug 30

faidon removed a project from T203172: wikimediafoundation.org in deutsch shows 'suspended or shutdown': Operations.

Thanks Chase, but I'm afraid we don't have anything to do with wikimediafoundation.org's operations -- it's completely out of our control. The other tag is correct though and it may get the attention of the website's operators.

Thu, Aug 30, 3:55 PM · wikimediafoundation.org
faidon added a comment to T203108: Create keyholder gerrit repo.

Oh also, would it be possible to keep the (operations/puppet) history such as commit messages etc.? git filter-branch etc. should make this possible right?

Thu, Aug 30, 12:11 PM · User-MModell, Operations, Release-Engineering-Team (Kanban)
faidon added a comment to T203108: Create keyholder gerrit repo.

My vote is under operations/software, if not under some non-operations hierarchy.

Thu, Aug 30, 12:08 PM · User-MModell, Operations, Release-Engineering-Team (Kanban)

Tue, Aug 28

faidon added a comment to T203003: Keyholder phab repo duplicate work.

Thanks for filing this! I lost about an hour debugging and (re-)fixing the above issue today, so +1 to everything you said :)

Tue, Aug 28, 3:51 PM · Release-Engineering-Team, Operations
faidon added a comment to T202952: rancid pubkey auth to Junos 17.4 failure.

This was logged every time a login was attempted, in netmon1002's /var/log/auth.log with this:
Aug 28 00:08:07 netmon1002 /ssh-agent-proxy[12127]: [<class '__main__.SshAgentProtocolError'>] SSH2_AGENTC_SIGN_REQUEST: Bad flags 0x4

Tue, Aug 28, 12:14 PM · Patch-For-Review, Operations, netops

Aug 24 2018

faidon raised the priority of T199125: rack/setup/install cloudvirt102[34] from Normal to High.
Aug 24 2018, 2:21 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Cloud-VPS, Operations

Aug 23 2018

faidon reassigned T199125: rack/setup/install cloudvirt102[34] from RobH to Muehlenhoff.
Aug 23 2018, 12:02 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Cloud-VPS, Operations

Aug 21 2018

faidon assigned T202329: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? to herron.
Aug 21 2018, 10:46 PM · Product-Analytics, User-herron, Mail, Operations, Notifications, Growth-Team

Aug 16 2018

faidon added a comment to T196886: Replace wtp1043's sda.

@RobH, @Cmjohnson, this has been open for two months now -- why is this taking such a long time to resolve?

Aug 16 2018, 10:10 AM · Parsing-Team, DC-Ops, ops-eqiad, Operations

Aug 3 2018

faidon renamed T201139: Intermittent connectivity issues in eqiad's row C from Intermitent connectivity issues between eqiad servers? to Intermitent connectivity issues in eqiad's row C.
Aug 3 2018, 10:35 AM · Operations, netops
faidon added a comment to T201139: Intermittent connectivity issues in eqiad's row C.

@Joe gave more timestamps from etcd logs on IRC:

  • Aug 2 13:59:52
  • Aug 3 01:19-01:20
  • Aug 3 01:28-01:29
  • Aug 3 01:50-01:51
  • Aug 3 02:06-02:07

These are potentially network partition hiccups/events. These seem to correlate well with the other events (dbproxy/db etc.) listed here.

Aug 3 2018, 10:34 AM · Operations, netops
faidon updated subscribers of T201039: connectivity issues between several hosts on asw2-b-eqiad.

So... what's the status of this? What else has been observed, what has been done to troubleshoot and what's the latest from Juniper? I tried to
access the Juniper case for more insight, but unfortunately I don't seem to have the right permissions to access this case (unrelated to this task and low-prio, but perhaps @ayounsi or @RobH can work with Juniper to figure out why?)

Aug 3 2018, 10:29 AM · Operations, netops
faidon added a comment to T196685: rack/setup/install rdb10[09|10].eqiad.wmnet.

I'm investigating unrelated issues in asw2-b-eqiad and this port is flapping (probably boot-looping into PXE), so I disabled it. @RobH, feel free to un-disable when you're about to install.

Aug 3 2018, 9:35 AM · User-Joe, User-Elukey, Operations
faidon added a comment to T199125: rack/setup/install cloudvirt102[34].

I'm investigating unrelated issues in asw2-b-eqiad and these ports are flapping (probably boot-looping into PXE), so I disabled them. @RobH, feel free to un-disable when you're about to install them.

Aug 3 2018, 9:35 AM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Cloud-VPS, Operations
faidon triaged T201149: cr1/2-eqiad PFE_FW_SYSLOG_IP6_GEN log entries as High priority.
Aug 3 2018, 9:15 AM · Operations, netops
faidon triaged T201148: dbproxy1006 iDRAC IP conflict as High priority.
Aug 3 2018, 9:10 AM · Operations, ops-eqiad
faidon triaged T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes as High priority.
Aug 3 2018, 9:04 AM · Wikimedia-Incident, Operations, netops
faidon moved T98006: Anycast (Auth)DNS from Configuration to Troubleshooting on the netops board.
Aug 3 2018, 1:57 AM · Patch-For-Review, netops, Operations, Traffic

Jul 25 2018

faidon added a project to T200338: Address mass overload errors in ORES (July 2018, UW origin): Operations.
Jul 25 2018, 3:31 PM · Operations, ORES, Scoring-platform-team (Current)

Jul 20 2018

faidon added a comment to T197242: Transition citoid to use Zotero's translation-server-v2.

Is there any progress and/or timeline for this? Thanks!

Jul 20 2018, 2:10 PM · VisualEditor (Current work), Patch-For-Review, Citoid, Services (watching), Operations

Jul 19 2018

Framawiki awarded T199816: Sunset Watchmouse's status.wikimedia.org a Love token.
Jul 19 2018, 10:43 PM · User-fgiunchedi, monitoring, Patch-For-Review, Operations

Jul 18 2018

faidon added a comment to T196886: Replace wtp1043's sda.

What's going on with this?

Jul 18 2018, 11:40 AM · Parsing-Team, DC-Ops, ops-eqiad, Operations
faidon added a comment to T143367: Users getting logged-out during minor network glitches.

It's been a couple of years since I filed this and I don't remember much since, so unfortunately I don't have any more insight at this point. These kind of widespread network events are very rare and there are no such outages recently I'm afraid. We could figure out ways to simulate them from e.g. mwdebug though, although I doubt that anyone has the time to investigate this in such depth, so I don't particularly disagree with resolving this task instead.

Jul 18 2018, 1:47 AM · Performance-Team, Availability (MediaWiki-MultiDC), MediaWiki-Authentication-and-authorization

Jul 17 2018

faidon added a project to T199816: Sunset Watchmouse's status.wikimedia.org: monitoring.
Jul 17 2018, 4:28 PM · User-fgiunchedi, monitoring, Patch-For-Review, Operations
faidon added a comment to T115945: status.wikimedia.org should not load Google Analytics.

I filed T199816 for removing that page, we can follow up on that and if implemented, resolve this task and its parent.

Jul 17 2018, 4:27 PM · Security-Core, Operations, Privacy, monitoring
faidon triaged T199816: Sunset Watchmouse's status.wikimedia.org as Normal priority.
Jul 17 2018, 4:26 PM · User-fgiunchedi, monitoring, Patch-For-Review, Operations

Jul 11 2018

faidon reopened Unknown Object (Task), a subtask of T196485: WDQS diskspace is low, as Open.
Jul 11 2018, 1:24 PM · Discovery, Operations, Wikidata-Query-Service, Wikidata

Jul 10 2018

faidon added a comment to T198939: Decommission servermon.

I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cases, so that may be something that could potentially work (with the caveats that Riccardo mentioned above, however).

Jul 10 2018, 6:03 PM · Patch-For-Review, Operations
faidon triaged T199251: furud: disconnect and power down all disk shelves as High priority.
Jul 10 2018, 5:06 PM · ops-codfw, Operations, DC-Ops

Jul 4 2018

faidon added a comment to T185171: replace mr1-eqiad.

There seems to be another step missing: Racktables seems inconsistent. The new one is listed as "new-mr1-eqiad", while the old one as "mr1-eqiad". Can someone fix that?

Jul 4 2018, 4:07 PM · ops-eqiad, Operations, netops

Jun 27 2018

faidon triaged T198344: Get Papaul access to network equipment as Normal priority.
Jun 27 2018, 6:23 PM · SRE-Access-Requests, netops, Operations
faidon closed T197857: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group, a subtask of T186748: New service request: chromium-render/deploy, as Resolved.
Jun 27 2018, 3:58 PM · User-notice, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1), Patch-For-Review, Readers-Web-Kanbanana-Board-Old, Services (blocked), Service-deployment-requests, Proton, Operations, Electron-PDFs
faidon closed T197857: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group as Resolved.

Sure, that's fine :)

Jun 27 2018, 3:58 PM · Proton, Operations, SRE-Access-Requests

Jun 25 2018

faidon added a comment to T197237: Requesting access for mbsantos.

Yes, let's not block this for yet another week! Consider this approved, please go ahead.

Jun 25 2018, 9:48 AM · Patch-For-Review, Analytics, Operations, SRE-Access-Requests

Jun 14 2018

faidon added a comment to T197173: Ship MX logs to ELK.

Our email logs can be pretty sensitive, especially since they include our corporate emails passing through (senders, recipients, timestamps etc.).

Jun 14 2018, 3:52 PM · User-herron, Wikimedia-Logstash, Mail, Operations
faidon added a comment to T197169: 10G ports seem not to work on new HP hardware.

So for at least labvirt1019 it was indeed about PXE not working (the card worked under Linux) and that was due to a BIOS misconfiguration (the "network boot" option for the card set to disabled). T194964#4283034 has more details and troubleshooting steps.

Jun 14 2018, 3:27 PM · Cloud-Services, Operations
faidon added a comment to T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.

OK, I managed to get this server to boot from its 10G interfaces. The issue was fairly straightforward to resolve ("network boot" was set to "disabled" for the 10G ports and only set to "network boot" for the first 1G port), but here are the steps I took to troubleshoot for future reference:

  • Live-hacked install1002 to update the DHCP config with the 10G port's MAC address, as this was still pointing to the 1G interface.
  • Attempted to boot with "network boot" (ESC-@ I think) and verified that I couldn't, as I was getting "media check failed" from the Broadcom PXE menu. I was running tcpdump -i any port 67 or port 68 on install1002 simultaneously to grab DHCP requests, but we didn't get that far, as the PXE option ROM wasn't even attempting to do DHCP. This pointed that either the card or cable isn't working, or more likely that this is the option ROM for a different interface, e.g. one of the 1G ones.
  • Booted into the previously installed system (running Debian) from the console and verified that the port works in Linux. I did that by setting the interface (eno49) as up, then checking the switch on the other end (asw2-b-eqiad:xe-4/0/16) with show interfaces description and show configuration interfaces xe-4/0/16 | display inheritance and verifying that it sees the link as "up up", and that the config is correct. Then I ran ethtool on the system itself, and verified that it sees the link as negotiated/up and with the right speed. Finally, I ran dhclient eno49 there and it worked and got an IP assigned. By all that I verified that both the card and the cable actually work and that the network configuration is correct, and thus the issues were just about PXE.
  • Rebooted and then entered the system config. In the BIOS/Platform config (RBSU) and the PCI interface, I disabled the 4x1G card (Embedded LOM). This is not actually required, but it made things a bit easier to debug as I could figure out e.g. whether the PXE prompt you get is from the 1G card or the 10G card.
  • In the 10G card's configuration, I disabled "HP Shared Memory", per T167299, although I'm not sure if this is actually required anymore. From that task, it sounds like it would affect the network past the PXE stage and in the installer, but I had verified that it works in Linux, so that was probably not needed (but we also don't use these features as far as I know). I also disabled SR-IOV for good measure since we don't use it, although I doubt it would affect this.
  • In the BIOS/Platform config (RBSU), under Network Options > Network Boot Options, the option "Embedded FlexibleLOM 1 Port 1" was set to "Disabled". I set that to "Network boot". This is certainly related and likely the entire cause of this issues.
  • After enabling, you immediately get a warning that says "Important: When enabling network boot support for an Embedded FlexibleLOM embedded NIC, the NIC boot option does not appear in the UEFI Boot Order or Legacy IPL lists until the next system reboot.". So I just did a server reboot after that (easy).
  • After that, I booted normally, hit ESC-@ for network boot and was presented with a PXE prompt; from there on, network boot worked, d-i started loading and also acquired an IP and the preseed configuration. It stopped with an error at a partman prompt (likely because of a misconfigured partman profile, unrelated to all this).
Jun 14 2018, 3:23 PM · Cloud-Services, Patch-For-Review, ops-eqiad, Operations

Jun 13 2018

faidon added a comment to T197169: 10G ports seem not to work on new HP hardware.

What are the symptoms?

Jun 13 2018, 9:18 PM · Cloud-Services, Operations
faidon added a comment to T194964: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020.

@Cmjohnson I'm afraid I don't understand fully what steps you've taken on which server, port or switch. So perhaps let's look at the current status: could you describe where each of labvirt1019's and labvirt1020's ports are connected to, and specifically to which ports on the switch and with what kind of cable? Thanks!

Jun 13 2018, 9:16 PM · Cloud-Services, Patch-For-Review, ops-eqiad, Operations
Tgr awarded T170150: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible a Like token.
Jun 13 2018, 4:28 PM · Patch-For-Review, monitoring, Operations

Jun 10 2018

Gerrit Code Review <gerrit@wikimedia.org> committed rOSNBa6cc3323c8e9: Final NoteDb migration updates (authored by faidon).
Final NoteDb migration updates
Jun 10 2018, 2:50 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rOSNB8689c87c5105: Create change (authored by faidon).
Create change
Jun 10 2018, 2:50 AM

Jun 8 2018

faidon raised the priority of T193655: rack/setup/install cloudstore1008 & cloudstore1009 from Normal to High.
Jun 8 2018, 2:35 PM · cloud-services-team (Kanban), Patch-For-Review, ops-eqiad, Cloud-VPS, Operations
faidon added a comment to T196651: rack upgraded storage capacity in labstore100[67].eqiad.wmnet.

@Cmjohnson regarding flerovium, sure, no problem, go ahead. (The others would need coordination with their respective service owners)

Jun 8 2018, 2:26 PM · Patch-For-Review, Datasets-General-or-Unknown, ops-eqiad, Cloud-VPS, Operations

Jun 7 2018

faidon added a comment to T175361: Upgrade mx1001/mx2001 to stretch.

So this backfired, but thankfully the fix was as simple as starting exim :) Good thinking @herron!

Jun 7 2018, 2:15 AM · User-herron, Patch-For-Review, Operations, Mail
faidon closed T196598: Phab and Gerrit emails stopped at around 1900 UTC 6th June as Resolved.

The cause was the prep for T175361, in combination with a couple of unexpected misconfigurations/SPOFs, given it's been years since the switchover from mx1001->mx2001 has been tested.

Jun 7 2018, 2:11 AM · Phabricator, Gerrit, Mail, Operations

Jun 6 2018

faidon added a comment to T185171: replace mr1-eqiad.

It's been a few months now, what's the status of this?

Jun 6 2018, 2:27 PM · ops-eqiad, Operations, netops

Jun 5 2018

faidon added a comment to T187194: zotero translation server: code stewardship request.

So we need to do something in a very short amount of time (~two months) ­-- does anyone have a game plan? @Jrbranaa what's the latest?

Jun 5 2018, 10:58 AM · User-Ryasmeen, VisualEditor, Citoid, Services (watching), Operations, Code-Stewardship-Reviews

Jun 4 2018

faidon added a comment to T196072: Analytics hosts missing in Inventory/Refresh.

We have a number of spreadsheets tracking inventory, refreshes, CapEx budgets etc. Which one are you referring to specifically (doc & sheet)?

Jun 4 2018, 12:30 PM · procurement, Operations, Analytics, DC-Ops

May 25 2018

faidon added a comment to T175361: Upgrade mx1001/mx2001 to stretch.

mx2001 has been running Stretch for a few days and has been stable. I think we're in good shape to move on to mx1001. However, there are a few configs with mx1001 hardcoded as the smtp server in the puppet repo. I'll work on removing those to simplify the depool process before rebuilding mx1001.

May 25 2018, 1:16 PM · User-herron, Patch-For-Review, Operations, Mail

May 23 2018

faidon added a comment to T193496: Allocate public v4 IPs for Neutron setup in eqiad.

OK, so it looks 185.15.56.0/24 is proposed to be used immediately in eqiad, to replace 208.80.155.128/25 in the next ~6 months. Additionally, 185.15.57.0/24 is proposed to be reserved (but not assigned) to be used tentatively in Q3 FY18-19 in codfw, for a region 2 deployment. Both of these sound good to me and you can proceed :)

May 23 2018, 12:28 PM · Cloud-Services, netops, Operations

May 21 2018

faidon added a comment to T193394: Degraded RAID on wasat.

The RAID still shows as degraded -- @RobH -or someone else- could you have a look? Thanks!

May 21 2018, 8:17 AM · Operations, ops-codfw

May 18 2018

faidon committed rOSNBf3c3f6d18405: Edit Project Config (authored by faidon).
Edit Project Config
May 18 2018, 7:56 PM
faidon committed rOSNB6a1caa720b90: Add Wikimedia's initial data (authored by faidon).
Add Wikimedia's initial data
May 18 2018, 7:56 PM
faidon committed rOSNB07a7facdf44d: Allow custom fields in the Device CSV form (authored by faidon).
Allow custom fields in the Device CSV form
May 18 2018, 7:56 PM

May 17 2018

faidon closed T194798: furud: disconnect furud-array[3-7]; connect furud-array[1-2] as Resolved.

Confirmed, thanks @Papaul!

May 17 2018, 3:15 PM · Operations, ops-codfw

May 16 2018

faidon added a comment to T193496: Allocate public v4 IPs for Neutron setup in eqiad.

The /25 -> /24 renumbering seems fairly straightforward, but given a) IPv4's depletion (we effectively cannot get more IPv4 space from any of the RIRs), b) the Neutron redesign and c) Cloud Services' growth and needs like T122406's, I think it's worthwhile to look at it a bit more broadly in order to make sure we avoid e.g. depletion or fragmentation of our IP space. Perhaps for instance we need to be looking at a larger assignment :)

May 16 2018, 12:42 PM · Cloud-Services, netops, Operations

May 15 2018

faidon closed T194796: replace/reinstall radium with a stretch system as Declined.

radium is super old hardware (2011 era) and its refresh is imminent, as part of T189317. No reason to spend time to reimage at this point :)

May 15 2018, 8:31 PM · Operations
faidon triaged T194798: furud: disconnect furud-array[3-7]; connect furud-array[1-2] as High priority.
May 15 2018, 8:29 PM · Operations, ops-codfw

May 8 2018

faidon added a comment to T192893: Access to Google Search Console for Go Fish Digital.

Thanks @Deskana :) I think that all seems sufficient and we should just go ahead with this. 2018-08-01 sounds reasonable, and we can always extend this if there's a need.

May 8 2018, 11:27 AM · SRE-Access-Requests, Operations

May 7 2018

faidon added a comment to T192893: Access to Google Search Console for Go Fish Digital.

I'm not sure if this needs my approval, but if it does, it has it, as long as:

  • The console data contain PII, so an NDA would be absolutely required with whomever we'd need to give access to this. Presumably this company is under a contract with us and that probably includes a confidentiality clause? @Deskana, can you confirm?
  • Without knowing much about this, this sounds like a one-off project, that has a start and an end date -- is that right? If so, we should make sure to revoke access to that account when the project is over (and especially if the contract, alongside its confidentialy clause, expires). We have an "expiration date" field for shell accounts, so we could do something similar here.
May 7 2018, 8:35 AM · SRE-Access-Requests, Operations

Apr 26 2018

faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

As far as periodicity goes, note that MaxMind states that GeoIP2 Country and City are updated every Tuesday and the rest every 1-4 weeks, so a weekly cronjob every Wednesday sounds like it would do the trick.

Apr 26 2018, 11:22 AM · Puppet, Patch-For-Review, Analytics-Kanban

Apr 25 2018

faidon added a comment to T187194: zotero translation server: code stewardship request.

@danstillman this is very useful information (and good news!), thank you for the detailed updated! It still seems like the options are either running a Docker image which embeds custom builds of Firefox and Node.js though, which comes with certain maintenance challenges.

Apr 25 2018, 2:10 PM · User-Ryasmeen, VisualEditor, Citoid, Services (watching), Operations, Code-Stewardship-Reviews
faidon reopened T192551: atop on stretch overloading a host as "Open".

My two cents:

  • I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in its current state, especially when the stability of the system is potentially at risk? I personally doubt it'll be very useful and it's yet another thing that we'll have parameterized (in the humongous base class with dozens of parameters no less). As a general rule, I think we should be avoiding adding hiera knobs unless there's a very good reason for it (including at least an existing user in the tree!) and rely on sane defaults and/or other properties of the systems via facter.
  • Right now setting profile::base::atop_enabled will still result in different results in jessie and stretch hosts given the -R difference upstream, so this still comes with the potential minefield that resulted in this task. I can e.g. imagine a new hire that isn't aware of this task enabling this knob in a year on a jessie host, then a month later reimaging the host as stretch and scratching their heads :)
  • Installing the atop package while disabling the cron job isn't going to be particularly useful: atop's value proposition is its recording function; tools like top and htop are at least equally good or superior in the runtime/realtime stuff.
Apr 25 2018, 1:30 PM · Upstream, Patch-For-Review, monitoring, Operations

Apr 24 2018

faidon updated subscribers of T187194: zotero translation server: code stewardship request.

So, this task has been open for a couple of months now, with the underlying issues have been present for far longer than that. In case it wasn't clear from the lengthy and detailed task description, there are currently two deadlines here:

  • Firefox 52 ESR (which this is indirectly based on) EOLs in August 2018.
  • Ubuntu 14.04 trusty EOLs in April 2019.

These are externally set, and affect security support among other things, so they're unfortunately hard deadlines.

Apr 24 2018, 1:35 PM · User-Ryasmeen, VisualEditor, Citoid, Services (watching), Operations, Code-Stewardship-Reviews
faidon added a comment to T184293: rack/setup/install lvs101[3-6].

@Cmjohnson I think @BBlack's question above was for you -- task description seems to point at a few of the steps on your side being still pending at least.

Apr 24 2018, 1:03 PM · Patch-For-Review, ops-eqiad, Operations, Traffic
faidon added a comment to T192185: request to assign spare systems as terbium equivalent.

Let's just use both of them to also set up the stand-in that you mentioned above?

Apr 24 2018, 10:49 AM · Patch-For-Review, hardware-requests, Operations

Apr 23 2018

faidon added a comment to T190323: Implement BGP graceful shutdown.

Easy enough, +1 :) Maybe Add a /* comment */ linking to the NLNOG filter guide?

Apr 23 2018, 1:30 PM · Operations, netops
faidon added a comment to T190317: Update BGP_sanitize_in filter.

I took a careful look at this -- it looks pretty good, but I'd suggest rolling it out slowly in phases just to be on the safe side. That could be separate phases for either the three different things it does (prefix length, bogon ASNs, long AS paths), the sites/BGP groups it's applied in, or both.

Apr 23 2018, 1:29 PM · Operations, netops

Apr 19 2018

faidon added a comment to T138396: Create ops dashboard with info like ipv6 traffic split .

I'm a Pivot newbie -- how could this be inferred? I've tried adding an Ip ~ ":" but that can only appear as a filter, not under split; in split I can only add "Ip" as a field, but that of course just lists different IPs, not the boolean state of IPv6 or not.

Apr 19 2018, 4:06 PM · Analytics
faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.

Apr 19 2018, 2:22 PM · Puppet, Patch-For-Review, Analytics-Kanban

Apr 18 2018

faidon changed the status of T191478: Requesting access to shell (snapshot, dumpsdata) for springle from Stalled to Open.

Seems fine :) Welcome back Sean!

Apr 18 2018, 4:57 PM · Patch-For-Review, Operations, SRE-Access-Requests
faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

I don't feel strongly about this, but I'm a bit skeptical about keeping this in puppet/volatile, given these are fairly out of scope for Puppet (it wouldn't really ever use this data AIUO). It'd be easy to forget, breakages wouldn't be immediately obvious etc.

Apr 18 2018, 3:14 PM · Puppet, Patch-For-Review, Analytics-Kanban
faidon added a comment to T192185: request to assign spare systems as terbium equivalent.

WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now.

Apr 18 2018, 12:32 AM · Patch-For-Review, hardware-requests, Operations

Apr 17 2018

faidon added a member for acl*procurement-review: LGoto.
Apr 17 2018, 11:46 PM
faidon added a comment to T192280: sda failure in hydrogen.wikimedia.org.

Yup, a replacement is underway as part of T189317 :)

Apr 17 2018, 4:05 PM · ops-eqiad, Traffic, Operations

Apr 16 2018

faidon added a comment to T182163: Update to latest kafkacat.

kafkacat 1.3.1-1~bpo9+1 should be available from Debian's stretch-backports on all stretch hosts:

$ rmadison -a amd64 kafkacat
kafkacat   | 1.3.0-1+b1     | stable            | amd64
kafkacat   | 1.3.1-1~bpo9+1 | stretch-backports | amd64
kafkacat   | 1.3.1-1        | testing           | amd64
kafkacat   | 1.3.1-1        | unstable          | amd64
Apr 16 2018, 4:38 PM · Analytics-Kanban, Patch-For-Review, Analytics, Services (watching)

Apr 13 2018

faidon added a comment to T182163: Update to latest kafkacat.

@Ottomata pinged me last week about that, I guess I hadn't seen this task or forgot about it entirely, sorry about that!

Apr 13 2018, 12:51 PM · Analytics-Kanban, Patch-For-Review, Analytics, Services (watching)

Apr 5 2018

faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

Ah! That's a regular mainboard/SATA controller, so these two drives wouldn't be able to participate in RAID groups. We've done that before I think, at least with Dells, where we had the system drives connected separately.

Apr 5 2018, 2:54 PM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

I don't understand :) Could you clarify which disks are in which slots, and how/where are they connected?

Apr 5 2018, 1:01 PM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

OK, I just saw above that this is a HPE Smart Array P440ar controller. According to the specs, the controller has "Internal: 8 SAS/SATA physical links across 2 x4 ports". So I think each of the ports connects to one of the internal cages (1I and 2I), with each holding 4 disks. That's all normal and according to the specs, and 8 disks is the maximum that this controller can hold. Where are the other two disks located (front/back?), and where are they connected?

Apr 5 2018, 9:17 AM · cloud-services-team (Kanban), Operations, Cloud-Services