fgiunchedi (Filippo Giunchedi)
Awesome

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (194 w, 22 h)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi

Recent Activity

Fri, Jun 15

fgiunchedi closed T196067: Clean up cpjobqueue metrics, a subtask of T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus, as Resolved.
Fri, Jun 15, 9:49 AM · Patch-For-Review, MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), Services (doing), Goal, EventBus, Analytics, MediaWiki-JobQueue
fgiunchedi closed T196067: Clean up cpjobqueue metrics as Resolved.
Fri, Jun 15, 9:49 AM · User-fgiunchedi, Graphite, Operations, Services (watching), EventBus, Analytics, MediaWiki-JobQueue
fgiunchedi added a comment to T196067: Clean up cpjobqueue metrics.

List of metrics at https://phabricator.wikimedia.org/P7262, I'll remove those if the list looks good.

Fri, Jun 15, 9:30 AM · User-fgiunchedi, Graphite, Operations, Services (watching), EventBus, Analytics, MediaWiki-JobQueue

Thu, Jun 14

fgiunchedi moved T196067: Clean up cpjobqueue metrics from Backlog to Up next on the User-fgiunchedi board.
Thu, Jun 14, 3:06 PM · User-fgiunchedi, Graphite, Operations, Services (watching), EventBus, Analytics, MediaWiki-JobQueue
fgiunchedi added a project to T196067: Clean up cpjobqueue metrics: User-fgiunchedi.
Thu, Jun 14, 3:03 PM · User-fgiunchedi, Graphite, Operations, Services (watching), EventBus, Analytics, MediaWiki-JobQueue

Wed, Jun 13

fgiunchedi closed T183177: memory errors not showing in icinga as Resolved.

I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking (or SIGBUS'ing the process) in case uncorrectable errors are reported.

Details below, it looks like for UE the kernel already does the right thing.

This conflicts with @BBlack's comment above though, T183177#4088202

When a UE hits memory that matters (corrupts memory actually in-use for data/code), the kernel should panic, as it's the only reasonable recourse at that point. Clearly, that's not currently happening via kernel or userspace tools/settings.

But

On our side we don't currently monitor process exists for SIGBUS though, those would usually get restarted by systemd.

Maybe that's the key here?

As per https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters

edac_mc_panic_on_ue - Panic on UE control file

An uncorrectable error will cause a machine panic. This is usually desirable. It is a bad idea to continue when an uncorrectable error occurs - it is indeterminate what was uncorrected and the operating system context might be so mangled that continuing will lead to further corruption. If the kernel has MCE configured, then EDAC will never notice the UE.

So while looking at MCE configuration for x86-64 for the mce kernel bootparameter from https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt:

mce=tolerancelevel[,monarchtimeout] (number,number)
		tolerance levels:
		0: always panic on uncorrected errors, log corrected errors
		1: panic or SIGBUS on uncorrected errors, log corrected errors
		2: SIGBUS or log uncorrected errors, log corrected errors
		3: never panic or SIGBUS, log all errors (for testing only)
		Default is 1
		Can be also set using sysfs which is preferable.
		monarchtimeout:
		Sets the time in us to wait for other CPUs on machine checks. 0
		to disable.

The respective sysfs file can be checked and it is already 1 by default:

grep -H . /sys/devices/system/machinecheck/machinecheck*/tolerant

It's not clear to me when the kernel panics and when the it sends a SIGBUS to the process if mce=tolerancelevel 1 (which is the default)

Wed, Jun 13, 12:53 PM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi reopened T183177: memory errors not showing in icinga as "Open".
Wed, Jun 13, 10:25 AM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi closed T183177: memory errors not showing in icinga as Resolved.

I'm resolving this task since we're alerting on uncorrectable memory errors found by EDAC now. Uncorrectable errors get either a kernel panic or SIGBUS to the process. See T197084: Report problems found in server's IPMI SEL and more importantly T197086: Report problems found by mcelog for followups.

Wed, Jun 13, 10:24 AM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi created T197086: Report problems found by mcelog.
Wed, Jun 13, 10:22 AM · monitoring, Operations
fgiunchedi created T197084: Report problems found in server's IPMI SEL.
Wed, Jun 13, 10:09 AM · Operations, monitoring
fgiunchedi added a comment to T183177: memory errors not showing in icinga.

I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking (or SIGBUS'ing the process) in case uncorrectable errors are reported.

Wed, Jun 13, 9:47 AM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi raised the priority of T165252: cp1053 possible hardware issues from Normal to High.

There have been edac correctable memory errors reported for this host, raising priority to high since the cpu temp alerts also persist

Wed, Jun 13, 8:44 AM · ops-eqiad, Traffic, Operations

Tue, Jun 12

fgiunchedi added a comment to T196989: mailman listing unresponsive (fermium high latency).

A bigger nail in the coffin for GET requests is also going to be enabling caching by apache, at least for listinfo the information doesn't change frequently and we can safely cache for 30min or so.

Tue, Jun 12, 4:23 PM · Patch-For-Review, Mail, Operations, Wikimedia-Mailing-lists
fgiunchedi added a comment to T196873: ms-be1036 in power off status, not responsive to power on commands.

@Cmjohnson ok! thanks, I'll being removing the machine from swift tomorrow

Tue, Jun 12, 2:17 PM · ops-eqiad, Operations
fgiunchedi added a comment to T196989: mailman listing unresponsive (fermium high latency).

Looks like high load is back with a whole lot of listinfo requests

Tue, Jun 12, 1:47 PM · Patch-For-Review, Mail, Operations, Wikimedia-Mailing-lists
fgiunchedi triaged T196994: Open Phab tasks on SMART failure as Normal priority.
Tue, Jun 12, 1:00 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi added a comment to T136312: Encrypt syslog traffic.

Latest rsyslog release containing the fix is already packaged in Debian unstable, it'd be easier to backport that to stretch instead of jessie. Once we have a replacement for lithium in place (T195416) and running stretch I'll test the backport there.

Tue, Jun 12, 12:10 PM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
fgiunchedi raised the priority of T196873: ms-be1036 in power off status, not responsive to power on commands from Normal to High.

Thanks @Cmjohnson ! Please treat this with urgency, do you know if there's an ETA? If more than a couple of days I'll remove the machine from swift.

Tue, Jun 12, 12:01 PM · ops-eqiad, Operations
fgiunchedi added a comment to T195569: Degraded RAID on ms-be1034.

Yeah I think it might have been the controller barfing and the disk is actually ok. I couldn't find related logs on lithium tho so hard to know for sure. The disk can be sent back, we'll order it back if need be.

Tue, Jun 12, 8:00 AM · ops-eqiad, Operations

Mon, Jun 11

Gerrit Code Review <gerrit@wikimedia.org> committed R1903:1e4a2ff42837: Update patch set 6 (authored by fgiunchedi).
Update patch set 6
Mon, Jun 11, 10:12 AM
Gerrit Code Review <gerrit@wikimedia.org> committed R1903:3e0a30152404: Update patch set 2 (authored by fgiunchedi).
Update patch set 2
Mon, Jun 11, 10:12 AM

Sun, Jun 10

Gerrit Code Review <gerrit@wikimedia.org> committed rESCD4de4b2903f9e: Update patch set 4 (authored by fgiunchedi).
Update patch set 4
Sun, Jun 10, 1:49 PM

May 16 2018

fgiunchedi added a comment to T187962: Rack/cable/configure asw2-c-eqiad switch stack.

WRT ms-fe servers (1008 and 1007), please move to asw2 and reallocate to be in two different physical racks.

May 16 2018, 2:24 PM · Patch-For-Review, Operations, ops-eqiad, netops
fgiunchedi added a comment to T194814: Reduce amount of headers sent from web responses.

Ditto for some Thumbor headers:

May 16 2018, 10:32 AM · Performance-Team (Radar), Patch-For-Review, media-storage, Operations, Traffic

May 15 2018

fgiunchedi created T194757: cp1068 memory correctable errors.
May 15 2018, 1:30 PM · ops-eqiad, Traffic, Operations
fgiunchedi added a comment to T183177: memory errors not showing in icinga.

See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that went by unnoticed (but we tend to notice on reboots).

I think there's a few things we need to think about with this situation in the general case:

  1. Uncorrectable errors (UE) are events, not states. A UE happens, and then life moves on. Other than persistent SEL logs or syslogs, we don't expect an isolated transient event to persist (it's not like the DIMM itself stores some kind of SMART-like data on past failures of itself or whatever). It's technically possible for an error to be truly-transient and never come back (e.g. "cosmic rays" or whatever). But a pattern of UE (or really, even a significant pattern of CE) is a sign that a module needs replacing.
  2. When a UE hits memory that matters (corrupts memory actually in-use for data/code), the kernel should panic, as it's the only reasonable recourse at that point. Clearly, that's not currently happening via kernel or userspace tools/settings.
  3. Either via the kernel interfaces directly, or via userspace edac tools, *something* should be logging UEs (well if they don't panic) and CEs to syslog. I think prometheus looks at sysfs directly.
  4. There were in times past, sysfs settings controlling panic_on_ue, log_ue, and log_ce, but these all seem to be missing from present kernels on cp*. Likely this stuff changed since I last looked, maybe that's considered userspace responsibility at this point?
May 15 2018, 11:31 AM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi added a comment to T190978: Update ResourceLoader dashboard to query varnishrls data from Prometheus instead.

@fgiunchedi Thanks. This makes the per-dc stacks much easier without the need to iterate over each data source. Interestingly though, while I still need the aggregated rules for more complex graphs, I did find a way to make the simpler graphs work without the global rules. Namely, Grafana supports a way to mix multiple data sources in a single graph:

May 15 2018, 8:25 AM · MediaWiki-ResourceLoader, Performance-Team

May 14 2018

fgiunchedi moved T183177: memory errors not showing in icinga from Up next to In progress on the monitoring board.
May 14 2018, 3:01 PM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi closed T137397: revisit swift (sys)logging as Resolved.

Resolving, swift (sys)log has been fixed a while ago but this task never resolved.

May 14 2018, 2:08 PM · Patch-For-Review, Operations

May 11 2018

fgiunchedi added a comment to T194012: labsdb1004 and labsdb1005 some hard disks not healthy.

For sure! It means the drive(s) are not healthy according to smartmontools. I'll add some details to https://wikitech.wikimedia.org/wiki/SMART about this but tl;dr smartctl --health /dev/bus/0 -d <DEVICE> will show why.

May 11 2018, 5:04 PM · Cloud-Services
fgiunchedi moved T151009: Provide authenticated access to Prometheus native web interface from Backlog to Up next on the User-fgiunchedi board.
May 11 2018, 10:10 AM · monitoring, Patch-For-Review, User-fgiunchedi, Operations, Prometheus-metrics-monitoring

May 10 2018

fgiunchedi added a comment to T159354: Move coal from graphite#001 nodes to webperf#001.

Also ticking off the "backup files in bacula" checkbox, because we now use the regular carbon storage, which at some point between between 2015 and now has been added to the backup process (despite being considerably larger than the subset of coal metrics).

Resource: puppet:/modules/profile/manifests/backup/director.pp#L145

May 10 2018, 2:56 PM · Patch-For-Review, Performance-Team, Operations
fgiunchedi added inline comments to D1049: Swift file storage engine for Phabricator.
May 10 2018, 2:53 PM · Phabricator, media-storage
fgiunchedi merged task T190842: Compiler fails to generate html with non-ascii characters into T173518: Errors dealing with non-ascii characters in output.
May 10 2018, 12:46 PM · puppet-compiler
fgiunchedi merged T190842: Compiler fails to generate html with non-ascii characters into T173518: Errors dealing with non-ascii characters in output.
May 10 2018, 12:46 PM · puppet-compiler
fgiunchedi added a comment to D1049: Swift file storage engine for Phabricator.

Nice work! Looking forward to see this working in beta.

May 10 2018, 12:00 PM · Phabricator, media-storage
fgiunchedi added inline comments to D1049: Swift file storage engine for Phabricator.
May 10 2018, 8:37 AM · Phabricator, media-storage
fgiunchedi added a comment to T187962: Rack/cable/configure asw2-c-eqiad switch stack.

For swift / ms servers the requirements are as follows:

  • ms-fe* to be depooled and moved one at a time.
  • ms-be* to be moved one at a time, just a clean poweroff is enough, no depooling needed.

Agreed with @ayounsi please spread said servers across racks in row C as much as possible. I'll be on vacation starting Thurs 17th, I can assist with the move before that though.

May 10 2018, 8:26 AM · Patch-For-Review, Operations, ops-eqiad, netops

May 9 2018

fgiunchedi added a comment to T194171: rdb2002 correctable memory errors.
4   | May-06-2018 | 04:46:06 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 40h
May 9 2018, 10:36 AM · Operations, ops-codfw
fgiunchedi added a comment to T194174: wtp2013 memory correctable errors.
wtp2013:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | Jan-15-2015 | 23:04:45 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | Dec-21-2016 | 01:41:38 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
3   | Dec-21-2016 | 01:41:39 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
4   | Dec-14-2017 | 07:21:25 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
5   | Dec-14-2017 | 07:21:25 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
6   | Feb-22-2018 | 01:27:39 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
7   | Feb-22-2018 | 03:08:43 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
8   | Apr-24-2018 | 17:56:42 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
9   | Apr-24-2018 | 20:07:26 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
May 9 2018, 10:36 AM · Operations, ops-codfw
fgiunchedi created T194249: kafka1023 correctable memory errors.
May 9 2018, 10:28 AM · Operations, ops-eqiad

May 8 2018

fgiunchedi closed T127762: Update Debian Package for Scap3 as Resolved.
May 8 2018, 4:54 PM · Patch-For-Review, Scap
fgiunchedi created T194176: wtp2020 correctable memory errors.
May 8 2018, 4:26 PM · Operations, ops-codfw
fgiunchedi created T194174: wtp2013 memory correctable errors.
May 8 2018, 4:12 PM · Operations, ops-codfw
fgiunchedi created T194172: mw2213 correctable memory errors.
May 8 2018, 4:02 PM · Operations, ops-codfw
fgiunchedi updated subscribers of T183177: memory errors not showing in icinga.

The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and there seem to be a few different "classes" or "states":

  1. high count of CEs and recent kernel messages
  2. low count of CEs and no recent kernel messages
May 8 2018, 3:54 PM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
fgiunchedi created T194171: rdb2002 correctable memory errors.
May 8 2018, 3:47 PM · Operations, ops-codfw
fgiunchedi merged task T133392: save grafana dashboards in revision control / puppet into T171482: Programmatic generation of grafana dashboards.
May 8 2018, 2:14 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi merged T133392: save grafana dashboards in revision control / puppet into T171482: Programmatic generation of grafana dashboards.
May 8 2018, 2:14 PM · Graphite, User-fgiunchedi, monitoring, Operations
fgiunchedi closed T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet as Resolved.

Rebalance has completed, resolving

May 8 2018, 2:11 PM · User-fgiunchedi, Operations
fgiunchedi added a comment to D1049: Swift file storage engine for Phabricator.

See preliminary comments inline, something else to keep in mind wrt big files: swift is limited by default to 4-5GB files as a single object. Going over that means either using SLOs or DLOs: https://docs.openstack.org/swift/latest/overview_large_objects.html

May 8 2018, 8:41 AM · Phabricator, media-storage
fgiunchedi added a comment to T127762: Update Debian Package for Scap3.

@thcipriani for sure! package is built, LMK when available and we'll deploy it

May 8 2018, 8:23 AM · Patch-For-Review, Scap

May 7 2018

fgiunchedi added a comment to T193766: Ship host syslogs to ELK.
  • Capacity - I chatted with @Gehel at the last ops friday hangout about ELK and friends, it would be nice to get our feet wet with multiple indices instead of one single index. Syslog might be a good occasion for that, in this context I'm saying "one index" but it is really one index per day, prefixed e.g. with syslog.

Sounds good to me! Something like logstash-syslog-date should help keep things expecting logstash-* working as-is

May 7 2018, 1:05 PM · User-herron, Patch-For-Review, Operations
fgiunchedi created T194036: mw1230 sdb "Raw_Read_Error_Rate" SMART .
May 7 2018, 12:42 PM · User-fgiunchedi, Operations
fgiunchedi updated the task description for T86552: Monitor and alarm on SMART attributes.
May 7 2018, 12:15 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
fgiunchedi added a comment to T136312: Encrypt syslog traffic.

Upstream has fixed the issue, should be included in the next rsyslog release. When that happens we'll try it out on the central syslog servers.

May 7 2018, 12:15 PM · Patch-For-Review, monitoring, User-fgiunchedi, Operations
fgiunchedi moved T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs from Backlog to Radar on the User-fgiunchedi board.
May 7 2018, 12:14 PM · User-fgiunchedi, Services (blocked), Operations, Cassandra, User-Eevans
fgiunchedi awarded T106381: spare/unused disks on application servers a 100 token.
May 7 2018, 10:09 AM · Patch-For-Review, Operations

May 4 2018

fgiunchedi added a comment to T193793: Icinga SMART check returns OK when not getting data.

The screenshot above is from a time when a host being reinstalled and every other check on the host was red. Is it possible that it was actually ok during this time ?

May 4 2018, 1:48 PM · Patch-For-Review, Operations, monitoring
fgiunchedi added a comment to T193272: Prometheus vs. CPU usage vs. hyperthreading.

In a Prometheus world the cpu utilization is calculated from the number of seconds each cpu has spent in each mode, from the numbers in /proc/stat. e.g. https://grafana.wikimedia.org/dashboard/db/host-overview uses that in the cpu utilization, divided by the number of cores to normalize the graph at 100%. There's also more information on https://www.robustperception.io/understanding-machine-cpu-usage/. AFAICS the graphs in labs-capacity-planning are using graphite/diamond as their source, were you looking to port the dashboard to Prometheus instead?

May 4 2018, 8:22 AM · Operations, cloud-services-team, monitoring
fgiunchedi added a comment to T193766: Ship host syslogs to ELK.

Thanks for kickstarting this! +1, having syslogs in ELK would be very useful indeed. Some partial answers to the things to figure out:

May 4 2018, 8:08 AM · User-herron, Patch-For-Review, Operations
fgiunchedi added a comment to T193793: Icinga SMART check returns OK when not getting data.

That's the current behavior of the check, i.e. when things are ok exit 0 and no output. We can change it to print "OK" or sth similar, and the values/thresholds perhaps

May 4 2018, 7:57 AM · Patch-For-Review, Operations, monitoring

May 3 2018

fgiunchedi added a comment to T178690: Better organization for ops grafana dashboards.

Thanks for the feedback!

May 3 2018, 10:41 AM · User-fgiunchedi, monitoring, Operations

May 2 2018

fgiunchedi closed T193186: Use recording rules for k8s Prometheus alerts as Resolved.
May 2 2018, 4:38 PM · Patch-For-Review, Kubernetes, User-fgiunchedi
fgiunchedi created T193651: labstore1003 SMART failure.
May 2 2018, 3:29 PM · Cloud-VPS, ops-eqiad, cloud-services-team, Operations
fgiunchedi created T193628: tungsten disk 1 and 8 SMART failure.
May 2 2018, 1:09 PM · ops-eqiad, Operations
fgiunchedi added a comment to T193488: Make python-logstash Debian package build for python 3.
reprepro copy stretch-wikimedia jessie-wikimedia python-logstash

Aren't the same commands needed for python3-logstash?

I'm still unable to get python3-logstash from Vagrant right now, which is on Stretch. I'm able to pick up version 0.4.6-2 of python-logstash correctly, though.

May 2 2018, 12:12 PM · Performance-Team
fgiunchedi added a comment to T193488: Make python-logstash Debian package build for python 3.

@Gilles done! Should be good to go

May 2 2018, 10:36 AM · Performance-Team
fgiunchedi closed T193488: Make python-logstash Debian package build for python 3 as Resolved.

Done! For reference the commands I used (note this package has -2 as its Debian revision, thus the upstream source is already uploaded, we are changing only the built packages.

May 2 2018, 10:36 AM · Performance-Team
fgiunchedi added a comment to T193488: Make python-logstash Debian package build for python 3.

@Gilles sounds good, can you send a gerrit review against operations/debs/python-logstash instead since I've imported the package there after the first upload?

May 2 2018, 9:04 AM · Performance-Team
fgiunchedi added a comment to T186069: Icinga: page in case all MediaWiki are throwing 5xx.

There has been a spike of 500s yesterday in codfw, looks like from search.wikimedia.org (tracked at T193600)

May 2 2018, 7:33 AM · Patch-For-Review, Wikimedia-Incident, Icinga, Operations, monitoring

Apr 30 2018

fgiunchedi moved T193186: Use recording rules for k8s Prometheus alerts from Backlog to Doing on the User-fgiunchedi board.
Apr 30 2018, 2:54 PM · Patch-For-Review, Kubernetes, User-fgiunchedi
fgiunchedi added a comment to T178690: Better organization for ops grafana dashboards.

I've put together a sample dashboard to play around with some concepts/ideas emerged in this task at https://grafana.wikimedia.org/dashboard/db/dashboard-redesign-proposal . Notably missing is the navigation story among different dashboards, but tl;dr it would be based on dashboards tags to create dropdowns. Which grouping/dropdown menus make sense is still TBD.

Apr 30 2018, 2:18 PM · User-fgiunchedi, monitoring, Operations
fgiunchedi closed T192763: Create a prometheus exporter for mcrouter as Resolved.

Upstream has merged the changes I submitted, the Debian package has been uploaded to stretch-wikimedia and the puppetization merged. Resolving for now.

Apr 30 2018, 8:15 AM · Patch-For-Review, Performance-Team (Radar), User-fgiunchedi, User-Joe, Availability (MediaWiki-MultiDC), Operations
fgiunchedi closed T192763: Create a prometheus exporter for mcrouter, a subtask of T192370: Deploy mcrouter to production as a wancache backend, as Resolved.
Apr 30 2018, 8:15 AM · Patch-For-Review, Performance-Team (Radar), Availability (MediaWiki-MultiDC), Operations
fgiunchedi updated the task description for T136562: Audit/fix hosts with no RAID configured.
Apr 30 2018, 7:33 AM · Patch-For-Review, Operations

Apr 27 2018

fgiunchedi added a comment to T181523: labtest puppetmaster is not working for clients.

Hey @fgiunchedi could you give us some help with the certificate issue?

  • it seems we are using the labtestpuppetmaster puppet clients certs to server apache/puppetmaster clients (VMs in the labtest deployment)
  • this cert above is generated by puppetmaster1001.eqiad.wmnet and is commited to the private repo for using in the puppetmaster::ca_server puppet class (which is used in labtestpuppetmaster given the role as puppet master / CA server)
  • @akosiaris suggested that this apache/puppetmaster cert should be a self-signed cert (aka root CA) which is generated at puppet package install time
  • there seems to be some chicken/egg problem. If we need the self-signed certificate auto-generated at package install time, but we need to commit that cert to the private repo, how is that supposed to work?

    We will really appreciate any hint, docs or clarification :-) thanks!
Apr 27 2018, 11:39 AM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS, Epic

Apr 26 2018

fgiunchedi created T193186: Use recording rules for k8s Prometheus alerts.
Apr 26 2018, 4:27 PM · Patch-For-Review, Kubernetes, User-fgiunchedi
fgiunchedi moved T183454: Deprovision Diamond collectors no longer in use from Up next to Doing on the User-fgiunchedi board.
Apr 26 2018, 10:23 AM · Patch-For-Review, User-fgiunchedi, monitoring, Operations
fgiunchedi moved T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet from Backlog to Doing on the User-fgiunchedi board.
Apr 26 2018, 9:51 AM · User-fgiunchedi, Operations
fgiunchedi added a project to T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet: User-fgiunchedi.
Apr 26 2018, 9:50 AM · User-fgiunchedi, Operations

Apr 25 2018

fgiunchedi added a comment to T174431: Upgrade mw* servers to Debian Stretch (using HHVM).

While investigating cronspam from recent reimages I took a look at mw1247 (for example) and noticed it has two disks but no software raid (T106381). I think we should also fix that while we're reimaging with Stretch anyways.

Apr 25 2018, 4:24 PM · Patch-For-Review, User-Elukey, HHVM, Operations
fgiunchedi added a comment to T192763: Create a prometheus exporter for mcrouter.

I sent some changes upstream that I think would be beneficial, https://github.com/Dev25/mcrouter_exporter/pull/3

Apr 25 2018, 3:35 PM · Patch-For-Review, Performance-Team (Radar), User-fgiunchedi, User-Joe, Availability (MediaWiki-MultiDC), Operations
fgiunchedi claimed T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet.
Apr 25 2018, 8:00 AM · User-fgiunchedi, Operations
fgiunchedi merged task T191896: Rack and setup ms-be1040-1043 into T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet.
Apr 25 2018, 7:59 AM · Patch-For-Review, Operations
fgiunchedi merged T191896: Rack and setup ms-be1040-1043 into T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet.
Apr 25 2018, 7:59 AM · User-fgiunchedi, Operations
demon awarded T191525: Surface broken originals in mediawiki a Like token.
Apr 25 2018, 12:22 AM · Wikimedia-Hackathon-2018, MediaWiki-Uploading, Multimedia

Apr 24 2018

fgiunchedi added a comment to T161296: Upgrade mysqld_exporter to 0.10.0.

This could be done massively right now. Missing hosts with 0.9.0 still (that are not set as spares, waiting for decommissioning):

  • db[2033-2034,2036-2037,2042,2044,2069-2078,2080-2082,2084-2093].codfw.wmnet
  • db[1051,1053,1055-1056,1059,1063,1065,1073,1096-1099,1101,1103,1105,1107-1108,1113-1115].eqiad.wmnet
  • es[1012-1013,1017].eqiad.wmnet
  • es[2011-2019].codfw.wmnet

    However, at least some, if not most of them say: prometheus-mysqld-exporter is already the newest version (0.9.0+ds-3+b2) I guess the package is not available on stretch?
Apr 24 2018, 5:26 PM · User-fgiunchedi, Operations, Prometheus-metrics-monitoring
fgiunchedi reassigned T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs from Cmjohnson to Eevans.

I've gone ahead and reimaged restbase1010, all cassandra instances are masked ATM but the host is otherwise good to be tested again.

Apr 24 2018, 4:53 PM · User-fgiunchedi, Services (blocked), Operations, Cassandra, User-Eevans
fgiunchedi added a comment to T191896: Rack and setup ms-be1040-1043.

@Cmjohnson confirmed raid config is the same on all of those, I rebooted the hosts showing the incorrect order and indeed upon reboot the order is as expected:

Apr 24 2018, 3:17 PM · Patch-For-Review, Operations
fgiunchedi added a comment to T192899: Restore Graphite whipser data from April 23th.

We're not backing up graphite's data directory, though metrics are mirrored to codfw too so we can copy back from there. Which files you need?

Apr 24 2018, 1:48 PM · Analytics, Operations, Graphite
fgiunchedi closed T192874: Degraded RAID on ms-be1043 as Invalid.

Host being setup in T191896: Rack and setup ms-be1040-1043

Apr 24 2018, 1:42 PM · ops-eqiad, Operations
fgiunchedi added a comment to T191896: Rack and setup ms-be1040-1043.

Looks like 3 out of 4 hosts have sda or sdb as one of the HDDs, not SSDs. The remaining host has sda/sdb as SSDs and two additional mdadm raid arrays.

Apr 24 2018, 1:26 PM · Patch-For-Review, Operations
fgiunchedi added a project to T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs: User-fgiunchedi.
Apr 24 2018, 8:47 AM · User-fgiunchedi, Services (blocked), Operations, Cassandra, User-Eevans
fgiunchedi added a comment to T192768: wdqs-updater crashing not cleanly.

No planned upgrades ATM, though a newer upstream version might help with understanding (hopefully fixing) T192456: Prometheus metrics missing for some hosts too, so definitely welcome!

Apr 24 2018, 8:44 AM · Patch-For-Review, Discovery, Wikidata, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service
fgiunchedi reassigned T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs from RobH to Cmjohnson.

@Cmjohnson restbase1010 is powered down and ready to have all of its ssd swapped

Apr 24 2018, 8:10 AM · User-fgiunchedi, Services (blocked), Operations, Cassandra, User-Eevans

Apr 23 2018

fgiunchedi added a comment to T192551: atop on stretch overloading a host.

+1 to remove atop as a daemon/cron, possibly the package altogether too

Apr 23 2018, 4:16 PM · Upstream, Patch-For-Review, monitoring, Operations
fgiunchedi added a comment to T186069: Icinga: page in case all MediaWiki are throwing 5xx.

Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change over time (e.g. when more sites alert) so icinga needs to be instructed to re-alert whenever the text changes. Other improvements include printing the "worst" value found among all metrics that match the query.

Apr 23 2018, 3:23 PM · Patch-For-Review, Wikimedia-Incident, Icinga, Operations, monitoring
fgiunchedi added a comment to T192763: Create a prometheus exporter for mcrouter.

I'll be helping with mcrouter_exporter packaging/setup/etc, I tried it and looks like it is doing the right thing (though asking mcrouter directly, not using stats files)

Apr 23 2018, 1:56 PM · Patch-For-Review, Performance-Team (Radar), User-fgiunchedi, User-Joe, Availability (MediaWiki-MultiDC), Operations
fgiunchedi moved T192763: Create a prometheus exporter for mcrouter from Backlog to Doing on the User-fgiunchedi board.
Apr 23 2018, 1:06 PM · Patch-For-Review, Performance-Team (Radar), User-fgiunchedi, User-Joe, Availability (MediaWiki-MultiDC), Operations