faidon (Faidon Liambotis)
SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (184 w, 2 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Faidon Liambotis (WMF)

Recent Activity

Today

faidon added a comment to T138396: Create ops dashboard with info like ipv6 traffic split .

I'm a Pivot newbie -- how could this be inferred? I've tried adding an Ip ~ ":" but that can only appear as a filter, not under split; in split I can only add "Ip" as a field, but that of course just lists different IPs, not the boolean state of IPv6 or not.

Thu, Apr 19, 4:06 PM · Analytics
faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.

Thu, Apr 19, 2:22 PM · Puppet, Patch-For-Review, Analytics-Kanban

Yesterday

faidon changed the status of T191478: Requesting access to shell (snapshot, dumpsdata) for springle from Stalled to Open.

Seems fine :) Welcome back Sean!

Wed, Apr 18, 4:57 PM · Patch-For-Review, Operations, Ops-Access-Requests
faidon added a comment to T136732: Puppetize job that saves old versions of Maxmind geoIP database.

I don't feel strongly about this, but I'm a bit skeptical about keeping this in puppet/volatile, given these are fairly out of scope for Puppet (it wouldn't really ever use this data AIUO). It'd be easy to forget, breakages wouldn't be immediately obvious etc.

Wed, Apr 18, 3:14 PM · Puppet, Patch-For-Review, Analytics-Kanban
faidon added a comment to T192185: request to assign WMF3565 as terbium equivalent.

WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now.

Wed, Apr 18, 12:32 AM · hardware-requests, Operations

Tue, Apr 17

faidon added a member for acl*procurement-review: LGoto.
Tue, Apr 17, 11:46 PM
faidon added a comment to T192280: sda failure in hydrogen.wikimedia.org.

Yup, a replacement is underway as part of T189317 :)

Tue, Apr 17, 4:05 PM · ops-eqiad, Traffic, Operations

Mon, Apr 16

faidon added a comment to T182163: Update to latest kafkacat.

kafkacat 1.3.1-1~bpo9+1 should be available from Debian's stretch-backports on all stretch hosts:

$ rmadison -a amd64 kafkacat
kafkacat   | 1.3.0-1+b1     | stable            | amd64
kafkacat   | 1.3.1-1~bpo9+1 | stretch-backports | amd64
kafkacat   | 1.3.1-1        | testing           | amd64
kafkacat   | 1.3.1-1        | unstable          | amd64
Mon, Apr 16, 4:38 PM · Analytics, Services (watching)

Fri, Apr 13

faidon added a comment to T182163: Update to latest kafkacat.

@Ottomata pinged me last week about that, I guess I hadn't seen this task or forgot about it entirely, sorry about that!

Fri, Apr 13, 12:51 PM · Analytics, Services (watching)

Thu, Apr 5

faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

Ah! That's a regular mainboard/SATA controller, so these two drives wouldn't be able to participate in RAID groups. We've done that before I think, at least with Dells, where we had the system drives connected separately.

Thu, Apr 5, 2:54 PM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

I don't understand :) Could you clarify which disks are in which slots, and how/where are they connected?

Thu, Apr 5, 1:01 PM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon added a comment to T187373: Rebuild raids on labvirt1019 and 1020.

OK, I just saw above that this is a HPE Smart Array P440ar controller. According to the specs, the controller has "Internal: 8 SAS/SATA physical links across 2 x4 ports". So I think each of the ports connects to one of the internal cages (1I and 2I), with each holding 4 disks. That's all normal and according to the specs, and 8 disks is the maximum that this controller can hold. Where are the other two disks located (front/back?), and where are they connected?

Thu, Apr 5, 9:17 AM · cloud-services-team (Kanban), Operations, Cloud-Services
faidon raised the priority of T187373: Rebuild raids on labvirt1019 and 1020 from Normal to High.

@Cmjohnson @RobH This has been going on for weeks now, and this is too much of a delay for setting up these systems. I'm elevating this task's priority, let's get to the bottom of this ASAP. A lot of the delays were just on our side, but I see that HPE is delaying this further too; please escalate within HPE and/or with me if you are not getting timely responses.

Thu, Apr 5, 9:08 AM · cloud-services-team (Kanban), Operations, Cloud-Services

Wed, Apr 4

faidon added a comment to T184564: Plan Puppet 5 upgrade.

In terms of code, what would the changes required be? What are these deprecation warnings that you mentioned above? Are we tracking fixes for these somewhere and are we making sure new ones don't crop up?

Wed, Apr 4, 10:17 AM · Puppet, Operations

Tue, Apr 3

faidon added a comment to T183937: rack/setup/install labvirt102[12].

So @ayounsi found this: https://help.ubuntu.com/community/Installation/Netboot#Multiple_Network_Interface_Note

this seems to describe our issue. However, I'm uncertain its worth hacking around it when we can just put in a 10G spot that is free. @faidon advised to move ahead on this install, but that was before we had a potential solution.

Tue, Apr 3, 9:08 PM · cloud-services-team (Kanban), Operations
faidon added a comment to T183937: rack/setup/install labvirt102[12].

So there is an issue where trusty expects the os to be on eth0, and its on eth3. However, after discussion in IRC, @ayounsi pointed out the new switch in this rack is 10G.

Tue, Apr 3, 10:08 AM · cloud-services-team (Kanban), Operations

Wed, Mar 28

faidon updated subscribers of T190540: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error.

These seem to be under warranty for another 2 months, so we should hurry up.

Wed, Mar 28, 2:31 PM · Traffic, Operations, ops-codfw

Tue, Mar 27

faidon reassigned T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs from Eevans to RobH.
Tue, Mar 27, 4:23 PM · ops-eqiad, Services (blocked), Operations, hardware-requests, Cassandra, User-Eevans

Mon, Mar 26

faidon added a comment to T190719: Create @wikimedia.org e-mail that just discards things sent to it.

There's no-reply@wikimedia.org that just gets discarded. I'm not sure if it could be a good fit to your purpose though -- wouldn't it be possible to just remove the email address completely from those accounts and/or just disable those entirely where applicable?

Mon, Mar 26, 9:44 PM · Operations, Office-IT
faidon added a comment to T183937: rack/setup/install labvirt102[12].

Is there any way we can help? Do you have logs or more information about the "trusty will only image from eth0" that we could perhaps help troubleshoot together?

Mon, Mar 26, 11:37 AM · cloud-services-team (Kanban), Operations

Thu, Mar 22

faidon added a comment to T190364: eqiad 10G ports needs.

I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it going either way, depending on the density we'll come up with.

Thu, Mar 22, 3:34 PM · netops, Operations

Mar 17 2018

faidon updated the task description for T185153: attach furud's new arrays (furud-array[3-7]).
Mar 17 2018, 11:28 AM · ops-codfw, Operations

Mar 14 2018

faidon closed T185153: attach furud's new arrays (furud-array[3-7]) as Resolved.

These are now attached and configured, resolving.

Mar 14 2018, 9:03 PM · ops-codfw, Operations
faidon added a comment to T185153: attach furud's new arrays (furud-array[3-7]).

Figured this out with @Papaul on IRC (thanks!).

Mar 14 2018, 5:08 PM · ops-codfw, Operations
faidon reassigned T185153: attach furud's new arrays (furud-array[3-7]) from faidon to Papaul.

I rebooted furud and is not booting right now, saying:

The total number of enclosures connected to connector 01, has exceeded
the maximum allowable limit of 4 enclosures. Please remove the extra enclosures
and then restart your system.

@RobH could you perhaps help out with the topology here?

Mar 14 2018, 3:13 PM · ops-codfw, Operations

Mar 13 2018

faidon added a comment to T188045: wdqs1004 broken.

So post-mortem, I think there are 4 different things here:

  • T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis)
  • T189522: Detect IP address collisions
  • General enhancements on our server provisioning and decommissioning pipeline, which has a bunch of long-standing issues, but also requires a more dedicated long-term effort. I'm sure there's one or more tasks related to this, but more broadly, this work stream is something that has been incorporated into our (draft) annual plan as a major item next year.
  • (Tagential) Triage the decom queue in a more prompt way to avoid servers lingering for months after their service decom.
Mar 13 2018, 5:02 PM · netops, Discovery-Wikidata-Query-Service-Sprint, ops-eqiad, Discovery, Wikidata, Wikidata-Query-Service, Operations
faidon reassigned T179042: Setup eqsin RIPE Atlas anchor from faidon to ayounsi.

We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network.

You can see your anchor when logged in to the RIPE Atlas website.

The direct link to the probe page for the anchor is here:
https://atlas.ripe.net/probes/6345/

[…]

Mar 13 2018, 12:01 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations

Mar 12 2018

faidon closed Unknown Object (Task), a subtask of T156031: Turn up network links for Asia Cache DC, as Resolved.
Mar 12 2018, 6:47 PM · Operations, Traffic
faidon triaged T189522: Detect IP address collisions as High priority.
Mar 12 2018, 6:40 PM · Operations, netops
faidon added a comment to T189519: Audit switch ports/descriptions/enable.

I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence to make sure this doesn't happen again, e.g. by adding an alert, or a Juniper slax script to make sure enabled ports always have a description.

Mar 12 2018, 6:37 PM · ops-eqiad, Operations, netops, ops-codfw
faidon renamed T189519: Audit switch ports/descriptions/enable from audit codfw switch ports/descriptions/enable to Audit switch ports/descriptions/enable.
Mar 12 2018, 6:35 PM · ops-eqiad, Operations, netops, ops-codfw
faidon reassigned T179042: Setup eqsin RIPE Atlas anchor from faidon to ayounsi.

Just heard from RIPE:

I just finished the provisioning of sg-sin-as14907.anchors.atlas.ripe.net and noticed that port 5666 is filtered.
Mar 12 2018, 2:53 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations
faidon added a comment to T185153: attach furud's new arrays (furud-array[3-7]).

I'd like all the 5 shelves (array3-7) connected to furud, but not the 2 old ones (array1-2) until further notice. Can we just bypass array1-2 by disconnecting them entirely, and creating a chain with just array3-7?

Mar 12 2018, 2:51 PM · ops-codfw, Operations
faidon raised the priority of T188045: wdqs1004 broken from High to Unbreak Now!.
faidon@re0.cr1-eqiad> show arp no-resolve | match 10.64.0.17 
78:2b:cb:2d:fa:e6 10.64.0.17      ae1.1017                 none
Mar 12 2018, 2:43 PM · netops, Discovery-Wikidata-Query-Service-Sprint, ops-eqiad, Discovery, Wikidata-Query-Service, Wikidata, Operations
faidon reassigned T185153: attach furud's new arrays (furud-array[3-7]) from faidon to Papaul.

Thanks for taking care of this before your trip! I checked this out last week, and it seemed then (and now that I double-checked it) that only three shelves (36 disks) are visible, rather than 5 (array3-7).

Mar 12 2018, 1:20 PM · ops-codfw, Operations
faidon reopened Unknown Object (Task), a subtask of T156031: Turn up network links for Asia Cache DC, as Open.
Mar 12 2018, 11:45 AM · Operations, Traffic

Mar 9 2018

faidon added a comment to T179042: Setup eqsin RIPE Atlas anchor.

That is correct to my knowledge -- that was the case with our other anchors.

Mar 9 2018, 12:35 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations

Mar 7 2018

faidon updated the task description for T179042: Setup eqsin RIPE Atlas anchor.
Mar 7 2018, 3:10 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations
faidon added a comment to T179042: Setup eqsin RIPE Atlas anchor.

I believe this was blocked until today on an SFP replacement (T188923). It seems that the IP of the Atlas is responding now, and we even receive an SSH banner. So I just submitted the form on the RIPE Atlas panel. Now we're waiting on RIPE before this is fully online:

Thank you for installing the software for your RIPE Atlas anchor!

It may take up to a week to run the tests for your anchor.
We will keep you informed throughout the process of finalising your anchor.

Mar 7 2018, 3:10 PM · Patch-For-Review, Traffic, ops-eqsin, netops, Operations
faidon added a comment to T189065: Outbound mail from Greenhouse is broken.

This has been discussed in bigger requests a couple of times before (T103893, T84201) for Greenhouse specfically, plus a bunch of other times for other third-party services. The TL;DR is that we don't really like whitelisting in SPF/DKIM/DMARC for wikimedia.org for all of the third-party services that we use, because that opens up attack vectors like email spoofing, CEO fraud to entities that we do not control nor are able to vet their security. The alternative we had proposed before was to use a separate subdomain (careers.wikimedia.org). It's still non-ideal, but it's better than allowing them and others like them to send emails us as <insert ED name>@wikimedia.org for instance.

Mar 7 2018, 2:19 PM · Patch-For-Review, DNS, Operations, Mail
Restricted Application added a project to T103893: DNS Change for GreenHouse: Traffic.
Mar 7 2018, 2:19 PM · Traffic, Operations, Mail, DNS

Mar 2 2018

faidon closed Unknown Object (Task), a subtask of T156031: Turn up network links for Asia Cache DC, as Resolved.
Mar 2 2018, 6:54 PM · Operations, Traffic
faidon reassigned T185153: attach furud's new arrays (furud-array[3-7]) from faidon to Papaul.

Let's keep the existing arrays (array1 & array2) offline, and just connect all of the new ones.

Mar 2 2018, 6:44 PM · ops-codfw, Operations
faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

To answer my own earlier question: I was looking at nftables' wiki about the supported features compared to xtables and the updates to the Linux kernel per version. Several systems (mostly WMCS) are still using trusty and Linux 3.13, which is really the first release of nftables and with multiple pretty basic features missing (e.g. REJECT, MASQUERADE etc.). Our latest and greatest right now is 4.9, and even that is apparently missing NOTRACK (added in 4.10), which is something we're using in a few places (e.g. DNSes).

Mar 2 2018, 3:21 PM · Operations

Mar 1 2018

faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

First, I don't think we should be thinking in terms of "using software from the 90s", at least not for something that is still as widely used and well-maintained as iptables (and to something that is as seldomly used as nftables). This is not something we should judge software with; we can talk instead in terms of amounts of bugs, maintainability, upstream response times, when was the last release, if/when it was deprecated by upstream(s) etc.

Mar 1 2018, 3:32 PM · Operations

Feb 28 2018

faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

I don't think it's easy for anyone to calculate the amount of effort required for this, but the stated 1-2 year long migration sounds longer than I thought and... pretty scary. I'd like to at least be conscious of the amount of effort required here, and foresee clear, tangible benefits at the end of the line to be able to justify the effort both for the migration itself, plus all the associated risks, learning curve and confusion in the meantime.

Feb 28 2018, 3:23 AM · Operations

Feb 26 2018

faidon reassigned T183937: rack/setup/install labvirt102[12] from Cmjohnson to RobH.
Feb 26 2018, 7:03 PM · cloud-services-team (Kanban), Operations
faidon added a comment to T183937: rack/setup/install labvirt102[12].

So for some reason (WMCS bad luck!), these seem to have been ordered with Intel NIC daughter cards. We have had Intel NICs only in the distant past, 99% of our 10G fleet is on QLogic (née Broadcom) these days. We still have kernel command-line options in the puppet tree to make those work with our optics, and it's very likely that we'd be able to make these work somehow.

Feb 26 2018, 7:02 PM · cloud-services-team (Kanban), Operations
faidon renamed T188075: eqiad/codfw: (4)+(4) hardware access request for videoscalers from Site: (2) hardware access request for videoscalers to eqiad/codfw: (4)+(4) hardware access request for videoscalers.
Feb 26 2018, 2:50 PM · hardware-requests, Operations
faidon assigned T188075: eqiad/codfw: (4)+(4) hardware access request for videoscalers to RobH.

Sounds good. Note that eqiad has 6 imagescalers (mw1293-mw1298) and codfw has 4 now ( mw2244-2245/mw2150-2151) but let's go with reassigning 4+4 for videoscaling for symmetry. (Note that this is blocked on T188062 right now to my knowledge)

Feb 26 2018, 2:47 PM · hardware-requests, Operations

Feb 22 2018

faidon added a comment to T187994: netfilter software at WMF: iptables vs nftables.

However, iptables is being replaced by nftables.

It seems to me like nftables is still not very widely used (as also evidenced by the upstreams you mentioned not having adopted it yet) and we might be early adopters; is that your impression as well?

Feb 22 2018, 1:36 PM · Operations

Feb 20 2018

faidon closed T187688: rhenium running out of disk space on / as Resolved.

I've deleted a 7.7G file and freed up some space. As for Postgres, it's for a temporary situation for a bit of a high-priority and unusual situation, so I'd ask to ignore that for a couple more months. rhenium is a very old box that we'll need to replace in the next 6 months anyway, so we can decide what to do with it when we do replace it.

Feb 20 2018, 12:59 PM · netops, Operations

Feb 15 2018

faidon reassigned T181264: Refresh or replace oxygen from faidon to RobH.

OK, let's do this, approved. It's spinning rust which is unfortunate, but with 64GB of RAM we could probably fit most of the dataset in the page cache, so... :)

Feb 15 2018, 10:32 PM · hardware-requests, Operations, Analytics
faidon added a comment to T187456: Decommission labstore100[12] and their disk shelves.

My apologies, this is all confusing! I corrected the task description to reflect that labstore100[12] have been replaced by labstore100[45]. I guess we can wait until labstore100[89] are procured (T186931), but in general let's please decom systems soon after we replace them in the future :)

Feb 15 2018, 9:05 PM · cloud-services-team, Data-Services, Operations, DC-Ops, ops-eqiad
faidon updated the task description for T187456: Decommission labstore100[12] and their disk shelves.
Feb 15 2018, 9:03 PM · cloud-services-team, Data-Services, Operations, DC-Ops, ops-eqiad
faidon updated the task description for T187474: Decommission old and unused/spare servers in codfw.
Feb 15 2018, 7:05 PM · hardware-requests, Operations, DC-Ops, ops-codfw
faidon updated the task description for T187473: Decommission old and unused/spare servers in eqiad.
Feb 15 2018, 5:33 PM · hardware-requests, Operations, DC-Ops, ops-eqiad
faidon triaged T187474: Decommission old and unused/spare servers in codfw as Normal priority.
Feb 15 2018, 5:27 PM · hardware-requests, Operations, DC-Ops, ops-codfw
faidon triaged T187473: Decommission old and unused/spare servers in eqiad as Normal priority.
Feb 15 2018, 5:25 PM · hardware-requests, Operations, DC-Ops, ops-eqiad
faidon added a comment to T165781: rack/setup/install labcontrol100[34].

What is the status of this and is there an ETA? The reason I'm asking is that labcontrol100[12] (very old/replaced by this) are still online and presumably waiting for this to be done :)

Feb 15 2018, 4:54 PM · cloud-services-team (Kanban), Cloud-Services, Operations
faidon reassigned T184481: hardware request for tin replacement from faidon to RobH.

Approved then.

Feb 15 2018, 3:57 PM · hardware-requests, Operations
faidon renamed T187456: Decommission labstore100[12] and their disk shelves from Decommission labstore100[12] to Decommission labstore100[12] and their disk shelves.
Feb 15 2018, 3:46 PM · cloud-services-team, Data-Services, Operations, DC-Ops, ops-eqiad
faidon triaged T187456: Decommission labstore100[12] and their disk shelves as Low priority.
Feb 15 2018, 3:45 PM · cloud-services-team, Data-Services, Operations, DC-Ops, ops-eqiad
faidon triaged T187447: Decommission restbase-test200[123] as Low priority.
Feb 15 2018, 2:00 PM · Patch-For-Review, hardware-requests, DC-Ops, Operations, ops-codfw
faidon triaged T187446: Decommission xenon, cerium, praseodymium as Low priority.
Feb 15 2018, 1:59 PM · Patch-For-Review, hardware-requests, DC-Ops, Operations, ops-eqiad
faidon triaged T187445: Decommission osm-db200[12] and osm-web200[1234] as Low priority.
Feb 15 2018, 1:56 PM · DC-Ops, Operations, ops-codfw
faidon reopened T171179: Decommisson restbase-dev100[1-3] as "Open".

These appear to be still racked in Racktables. Reopening to investigate per IRC conversation.

Feb 15 2018, 1:42 PM · hardware-requests, ops-eqiad, Operations
faidon reopened T171179: Decommisson restbase-dev100[1-3], a subtask of T166181: rack/setup/install restbase-dev100[456], as Open.
Feb 15 2018, 1:42 PM · Patch-For-Review, Services (watching), User-fgiunchedi, DC-Ops, ops-eqiad, Operations
faidon moved T170157: decommission rcs100[12] from Backlog to Decommission on the ops-eqiad board.
Feb 15 2018, 1:34 PM · Patch-For-Review, ops-eqiad, Analytics, Operations, hardware-requests, Wikimedia-Stream
faidon edited projects for T170157: decommission rcs100[12], added: ops-eqiad; removed Patch-For-Review.
Feb 15 2018, 1:34 PM · Patch-For-Review, ops-eqiad, Analytics, Operations, hardware-requests, Wikimedia-Stream
faidon merged T181825: decommission rcs1001/1002 into T170157: decommission rcs100[12].
Feb 15 2018, 1:33 PM · Patch-For-Review, ops-eqiad, Analytics, Operations, hardware-requests, Wikimedia-Stream
faidon merged task T181825: decommission rcs1001/1002 into T170157: decommission rcs100[12].
Feb 15 2018, 1:33 PM · DC-Ops, Operations, ops-eqiad
faidon added a comment to T181825: decommission rcs1001/1002.

This is a duplicate of T170157. I'll tag that one with ops-eqiad, and close this as duplicate.

Feb 15 2018, 1:33 PM · DC-Ops, Operations, ops-eqiad
faidon added a comment to T166179: singapore caching center: eqiad staging tracking task.

@RobH, this can be resolved now, right?

Feb 15 2018, 12:09 AM · Traffic, ops-eqsin, DC-Ops, Operations

Feb 12 2018

faidon assigned T185494: Degraded RAID on restbase-dev1006 to RobH.
Feb 12 2018, 9:21 AM · User-Eevans, ops-eqiad, Operations

Feb 9 2018

faidon reopened T186814: Missing servers in racktables as "Open".

The original inquiry was for kafka1023, which is a different box than analytics1023 (confusing!).

Feb 9 2018, 9:49 AM · ops-eqiad, Operations

Feb 6 2018

faidon reassigned T184480: hardware request for bast1001 replacement from faidon to RobH.

Approved.

Feb 6 2018, 5:11 PM · hardware-requests, Operations
faidon added a comment to T186539: tools-services-01: issue with aptly repo release file.

Can't aptly (whichever version) become part of trusty-tools instead? That would remove the interdependency with production's apt, which seems especially relevant given toolforge has its own ways, apt repository et al.

Feb 6 2018, 12:05 PM · Patch-For-Review, Cloud-Services

Feb 1 2018

faidon added a comment to T185667: setup/install eventlog1002.eqiad.wmnet.

I had a look at both modules/eventlogging/files/eventloggingctl and modules/eventlogging/templates/upstart/*. They all seemed fairly easy to reimplement with systemd (with or without templates; for the former, a good reference would be e.g. the Tor package's units). It all feels to me like less than a day's effort unless I've gravely misunderstood how this all works and underestimating it.

Feb 1 2018, 10:41 AM · Patch-For-Review, Analytics, Operations
faidon removed a project from T183970: wikidumpparse is using 1.2TB of 5T available NFS misc storage: Operations.
Feb 1 2018, 10:29 AM · cloud-services-team, Cloud-VPS

Jan 30 2018

faidon closed T185971: Add some ssd's to phab1001 and phab2001 as Declined.

This has no problem statement, diagnosis, root cause analysis or evidence of I/O starvation -- and yet we're jumping to actionables with dubious propositions ("would make everything 100% faster"). Please don't do that, and if you're experiencing issues with something, file a task about the issue you're experiencing with the symptoms that you've observed.

Jan 30 2018, 5:47 PM · Release-Engineering-Team, Phabricator, Operations
faidon merged T185796: Consider ssd's for phabricator into T185971: Add some ssd's to phab1001 and phab2001.
Jan 30 2018, 5:24 PM · Release-Engineering-Team, Phabricator, Operations
faidon merged task T185796: Consider ssd's for phabricator into T185971: Add some ssd's to phab1001 and phab2001.
Jan 30 2018, 5:24 PM · Release-Engineering-Team, Operations, Phabricator
faidon added a comment to T185667: setup/install eventlog1002.eqiad.wmnet.

Trusty has about a year left of upstream support, and likely less for our own purposes. Any reason to not switch to somewhere more recent while we're at it?

Jan 30 2018, 12:26 AM · Patch-For-Review, Analytics, Operations

Jan 29 2018

faidon assigned T162857: Some Core availability Catchpoint tests might be more expensive than they need to be to Volans.
Jan 29 2018, 2:54 AM · monitoring, Patch-For-Review, Operations

Jan 25 2018

faidon renamed T185667: setup/install eventlog1002.eqiad.wmnet from setup/install evenlog1002.eqiad.wmnet to setup/install eventlog1002.eqiad.wmnet.
Jan 25 2018, 6:30 PM · Patch-For-Review, Analytics, Operations

Jan 23 2018

faidon added a comment to T185350: Vet reliability of the response_size field for data analysis purposes.

Seems fairly consistent: LibreNMS has recorded November as 2.45PB. October is incomplete, unfortunately, so we can't compare that :(

Jan 23 2018, 12:38 AM · Operations, Traffic, Analytics-Data-Quality
faidon added a comment to T185319: IRC RecentChanges feed: code stewardship request.

However, it's operating without any redundancy, in terms of both individual hardware failure, and datacenter failure.

This seems something that would easily be fixed by placing a irc server per datacenter and link them the old way.

Making edits on any datacenter to propagate there, all using the same nickname, would be a bit harder, but probably doable with little effort, too.

Jan 23 2018, 12:33 AM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews

Jan 22 2018

faidon added a comment to T185350: Vet reliability of the response_size field for data analysis purposes.

Interesting! So with a ratio in:out of approximately 25:1 (based on January's figures), this means that we could estimate the other direction to be around 80TB. So the total estimated in+out would be 2018+80 = 2098TB, compared to the actual (as received from LibreNMS) 2300TB, which is… 10% off. Not too bad!

Jan 22 2018, 11:06 PM · Operations, Traffic, Analytics-Data-Quality
faidon added a comment to T185350: Vet reliability of the response_size field for data analysis purposes.

In the meantime, I ran a query to estimate how much data was transferred in the download direction last month overall *if* the response_size field can be relied upon.
The answer is 6200 (decimal) Terabytes, with 25 kilobyte per request on average.

SELECT 
SUM(response_size) AS total_bytes,
SUM(1) AS requests
FROM wmf.webrequest
WHERE year = 2017 AND month = 12;

total_bytes	requests
6198158138966004	240795481820
Jan 22 2018, 9:49 PM · Operations, Traffic, Analytics-Data-Quality
faidon moved T181036: Pull netflow data in realtime from Kafka via Tranquillity/Spark from Backlog to In progress on the monitoring board.
Jan 22 2018, 4:16 PM · User-Elukey, monitoring, netops, Operations
faidon moved T183177: memory errors not showing in icinga from Backlog to Up next on the monitoring board.
Jan 22 2018, 4:16 PM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
faidon merged task T119774: Upgrade graphite to 0.9.15 into T166173: Upgrade graphite from 0.9.x to 1.x.
Jan 22 2018, 4:11 PM · monitoring, WMDE-Analytics-Engineering, Graphite
faidon merged T119774: Upgrade graphite to 0.9.15 into T166173: Upgrade graphite from 0.9.x to 1.x.
Jan 22 2018, 4:11 PM · monitoring, Performance-Team (Radar), Graphite
faidon moved T183209: decom uranium from In progress to Externally blocked on the monitoring board.
Jan 22 2018, 4:09 PM · Patch-For-Review, hardware-requests, ops-eqiad, monitoring, Technical-Debt, Operations
faidon reassigned T184551: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. from faidon to RobH.

Sounds good, please go ahead :)

Jan 22 2018, 3:25 PM · Analytics, hardware-requests, Operations

Jan 20 2018

faidon added a comment to T185345: os_version strict distro check doesn't work.

I just pushed a change to make the rspec more extensive, including a test case for the scenario that you described here. It seems to pass fine, so I merged the change. Is there a specific server/VPS you're experiencing this on? I could give it a look.

Jan 20 2018, 1:19 AM · Patch-For-Review, Operations, Puppet
faidon added a comment to T185345: os_version strict distro check doesn't work.

I can't reproduce. I'm testing with this for example:

$is_trusty = os_version('ubuntu trusty')
notice("is trusty is ${is_trusty}")
Jan 20 2018, 12:56 AM · Patch-For-Review, Operations, Puppet

Jan 19 2018

faidon added a project to T185319: IRC RecentChanges feed: code stewardship request: Operations.
Jan 19 2018, 3:46 PM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews
faidon created T185319: IRC RecentChanges feed: code stewardship request.
Jan 19 2018, 3:44 PM · Tools, Operations, Analytics, Wikimedia-IRC-RC-Server, Code-Stewardship-Reviews