fgiunchedi (Filippo Giunchedi)
Awesome

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.
User Since
Oct 3 2014, 8:06 AM (124 w, 2 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi

Recent Activity

Fri, Feb 17

fgiunchedi renamed T157355: Labs Prometheus not recording k8s stats since 2017-01-24T06:00 from "Labs Promethius not recording k8s stats since 2017-01-24T06:00" to "Labs Prometheus not recording k8s stats since 2017-01-24T06:00".
Fri, Feb 17, 4:56 PM · Prometheus-metrics-monitoring, Tools-Kubernetes, Labs, Tool-Labs
fgiunchedi added a comment to T157355: Labs Prometheus not recording k8s stats since 2017-01-24T06:00.

Sure @chasemp! Looks like the tag kubernetes_namespace has been renamed to kubernetes, possibly following the k8s upgrade as @scfc pointed out.

Fri, Feb 17, 4:37 PM · Prometheus-metrics-monitoring, Tools-Kubernetes, Labs, Tool-Labs
fgiunchedi added a comment to T158337: codfw: ms-be2028-ms-be2039 rack/setup.

@Papaul racking looks good to me, will all of those be 10G?
I'm assuming you have space to install the new hw alongside the old one ? If that's not the case we can do one row at a time.

Fri, Feb 17, 10:39 AM · ops-codfw, Operations
fgiunchedi added a comment to T158338: Set up DNS caching for node services.

@GWicke when I flipped over statsd.eqiad.wmnet after the CNAME change it has been sufficient to restart the services, not change any IP address. It looks like the dns name is what's in the config but it is resolved only once ?

Fri, Feb 17, 10:29 AM · Services (next), codfw-rollout, codfw-rollout-Jan-Mar-2016
fgiunchedi added a comment to T157949: Thumbor leaks pipes.

@Gilles no it doesn't look like swift proxy was restarted, I checked uptime on all frontends and it is Jan 20th for swift-proxy and 2016 for memcached

Fri, Feb 17, 10:18 AM · Thumbor, Performance-Team
fgiunchedi added a comment to T157949: Thumbor leaks pipes.

I did some log spelunking on a sample instance that was showing 401s, in this case thumbor1002 thumbor@8806

Fri, Feb 17, 10:16 AM · Thumbor, Performance-Team
fgiunchedi added a comment to T156955: Standardizing our partman recipes.

+1 on reducing the number of partman recipes!

Fri, Feb 17, 9:34 AM · Patch-For-Review, Operations
fgiunchedi added a comment to T157949: Thumbor leaks pipes.

re: token life, ATM it is set to a week

Fri, Feb 17, 8:50 AM · Thumbor, Performance-Team
fgiunchedi closed T127762: Update Debian Package for Scap3 as "Resolved".
Fri, Feb 17, 8:48 AM · Patch-For-Review, Deployment-Systems, Scap

Thu, Feb 16

fgiunchedi added a comment to T157949: Thumbor leaks pipes.

@Gilles good question, I'm not sure expired swift tokens and pipe leaks are related but they could be very well be. e.g. now I'm not seeing pipes leaking but all thumbor instances have short uptime now

Thu, Feb 16, 3:55 PM · Thumbor, Performance-Team
fgiunchedi added a comment to T158288: Unclean stop of jobrunner service via puppet.

The cure for the moment is to 'systemctl reset-failed jobrunner' to restore non-degraded systemd state

Thu, Feb 16, 10:07 AM · Operations
fgiunchedi renamed T158288: Unclean stop of jobrunner service via puppet from "'systemctl restart jobrunner' broken via salt" to "Unclean stop of jobrunner service via puppet".
Thu, Feb 16, 9:55 AM · Operations
fgiunchedi added a comment to T158288: Unclean stop of jobrunner service via puppet.

Updated https://wikitech.wikimedia.org/wiki/Service_restarts#Application_servers_.28also_image.2Fvideo_scalers_and_job_runners.29 with a disclaimer about stop/start

Thu, Feb 16, 9:34 AM · Operations
fgiunchedi created T158288: Unclean stop of jobrunner service via puppet.
Thu, Feb 16, 9:31 AM · Operations
fgiunchedi created T158286: Raise default logging level of prometheus-hhvm-exporter.
Thu, Feb 16, 9:08 AM · HHVM, Prometheus-metrics-monitoring
fgiunchedi closed T140927: Make git 2.2.0+ (preferably 2.8.x) available as "Resolved".

This is completed, use_experimental can be removed once deployment servers are migrated to stretch

Thu, Feb 16, 8:41 AM · Patch-For-Review, Scap, Operations, Release-Engineering-Team (Long-Lived-Branches)
fgiunchedi added a comment to T149451: Get 5xx logs into kibana/logstash.

We could set up a special varnishkafka instance for this, if that makes sense. But, hm, I think using kafkatee would be better! kafkatee supports piped output, so we don't have to do any special log file tailing. We just add a configuration to pipe the grepped logs to whatever process we want (netcat, perhaps?). Is there a logstash stdin producer? This would look something like:

output pipe 1 /bin/grep --line-buffered '"http_status":"5' | logstash-stdin-producer
Thu, Feb 16, 8:10 AM · Wikimedia-Logstash, Operations

Tue, Feb 14

fgiunchedi edited the description of T123728: Upgrade fluorine to trusty/jessie.
Tue, Feb 14, 3:34 PM · Patch-For-Review, Operations
fgiunchedi closed T148652: 2016-10-17 API cluster overload as "Resolved".

@greg I think it is safe to close, we've mitigated the issue by having a separate cluster for async processing in T151702 and related.

Tue, Feb 14, 1:11 PM · Wikimedia-Incident, HHVM, Release-Engineering-Team, Operations
fgiunchedi added a comment to T123728: Upgrade fluorine to trusty/jessie.
Tue, Feb 14, 10:58 AM · Patch-For-Review, Operations
fgiunchedi added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

If the above counts are consistent, I'd to:

  1. reimage 3 appservers (40 cores) as api_appservers
  2. reimage 2 appservers (40 cores) as imagescalers
  3. reimage 1 appserver (40 cores) as jobrunner
  4. reimage 2 appservers (32 cores) as videoscalers
Tue, Feb 14, 9:31 AM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Mon, Feb 13

fgiunchedi added a comment to T123728: Upgrade fluorine to trusty/jessie.

For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know if mediawiki is already able to do that via its logging configuration, @bd808 you might know how/if we can do that? thanks!

The config changes needed would be done in wmf-config/logging.php. Every MediaWiki\Logger\Monolog\LegacyHandler object in that file is a path to sending log event data to fluorine. Monolog has a Monolog\Handler\GroupHandler class that could be used to replace each one with a GroupHander containing two MediaWiki\Logger\Monolog\LegacyHandler objects, one that sends to fluorine and another that sends to mwlog1001.

The $wmgMonologHandlers['wgDebugLogFile'] handler is a special case that would either need to be treated explicitly or ignored. Generally it points to /dev/null, but on testwiki and test2wiki or when special request settings are present it gets pointed to distinct local or UDP log sinks. It would probably be easiest to just ignore all of this complexity and let it go to wherever $wmfUdp2logDest is pointing. That could be either of the two log destinations.

The other way that all of this could be handled would be to setup udp2log on mwlog1001 to relay everything it sees back to fluorine and then just switch $wmfUdp2logDest to point to mwlog1001. That would make the MediaWiki config stable and let techops control things on fluorine via udp2log configuration. This is basically the opposite of the configuration being applied in https://gerrit.wikimedia.org/r/335625. The benefit I see of this is that auditing can be done on mwlog1001 to know when all the things have been switched. If a log is on fluorine that is not on mwlog1001 then you have found something else that needs its configuration to be changed.

Mon, Feb 13, 3:32 PM · Patch-For-Review, Operations
fgiunchedi created T157972: Puppet fails only once when restarting ferm is not successful.
Mon, Feb 13, 3:22 PM · Operations
fgiunchedi edited the description of T123728: Upgrade fluorine to trusty/jessie.
Mon, Feb 13, 2:58 PM · Patch-For-Review, Operations
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

Read traffic has been switched over to graphite2001 now and seems to work.

Note that graphite2001 was unable to talk to eventlog1001, the root cause is that stat1001 was still in ferm's configuration and @resolve wouldn't work for it, thus not reloading rules (puppet didn't fail either on this)

Can you elaborate? Where was stat1001 used in ferm configuration? resolve() is only resolved during ferm startup/restarts, i.e. if an IP behind a CNAME changes, ferm needs a reload.

stat1001 is still used on eventlog1001 in ferm's rsync rules and indeed that was what prevented a ferm reload. The only place I can still see stat1001 is in statistics_servers in hieradata

eventlog1001:~$ sudo grep -ir stat1001 /etc/ferm/
/etc/ferm/conf.d/10_eventlogging_rsyncd:&R_SERVICE(tcp, 873, @resolve((stat1001.eqiad.wmnet stat1002.eqiad.wmnet stat1003.eqiad.wmnet analytics1027.eqiad.wmnet dataset1001.wikimedia.org thorium.eqiad.wmnet)));
eventlog1001:~$ host stat1001.eqiad.wmnet
Host stat1001.eqiad.wmnet not found: 3(NXDOMAIN)
Mon, Feb 13, 11:49 AM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi created T157949: Thumbor leaks pipes.
Mon, Feb 13, 10:44 AM · Thumbor, Performance-Team

Fri, Feb 10

fgiunchedi added a comment to T157237: Degraded RAID on ms-be1012.

I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but we have spares on-site.

Fri, Feb 10, 2:46 PM · ops-eqiad, Operations
fgiunchedi added a comment to T157794: diamond crashing on hosts using systemd-timesyncd.

Should we remove /usr/share/diamond/collectors/ntpd/ if systemd-timesyncd is in use?

Fri, Feb 10, 12:22 PM · Patch-For-Review, Traffic, Monitoring, Operations
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

@fgiunchedi I have the ssds on-site. The disk is in a 3.5" internal disk bay and will need to be powered off for the replacement.

Fri, Feb 10, 12:08 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi added a comment to T127976: Graphite DC fail-over / per-DC setup.

So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.

Fri, Feb 10, 11:58 AM · Patch-For-Review, codfw-rollout, codfw-rollout-Jan-Mar-2016

Thu, Feb 9

fgiunchedi added a comment to T127976: Graphite DC fail-over / per-DC setup.

While working on T157022: Suspected faulty SSD on graphite1001 and in particular the DNS switchover, not all services are observing the DNS TTL of 1H.
Below is the list of statsd prefixes still coming in to graphite1001 after several hours after flipping the dns

Thu, Feb 9, 6:29 PM · Patch-For-Review, codfw-rollout, codfw-rollout-Jan-Mar-2016
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

Read traffic has been switched over to graphite2001 now and seems to work.

Note that graphite2001 was unable to talk to eventlog1001, the root cause is that stat1001 was still in ferm's configuration and @resolve wouldn't work for it, thus not reloading rules (puppet didn't fail either on this)

Can you elaborate? Where was stat1001 used in ferm configuration? resolve() is only resolved during ferm startup/restarts, i.e. if an IP behind a CNAME changes, ferm needs a reload.

Thu, Feb 9, 2:39 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi added a comment to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).

warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

That would significantly complicate the script as well as the actual switchover process. You'd have to deploy many changes to mw-config during the switchover to gradually read-only more and more wikis. The warmup script, meanwhile, takes less than a minute to run. I doubt we'd be reasonably saving any time considering the gradual read-only switching would have to be done manually and is about saving a subset of 50 seconds time.

Thu, Feb 9, 9:35 AM · Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
fgiunchedi closed T155907: Degraded RAID on ms-be1013 as "Resolved".

thanks @Cmjohnson ! disk is rebuilding

Thu, Feb 9, 9:22 AM · media-storage, ops-eqiad, Operations
fgiunchedi added a comment to T118154: determine hardware needs for dumps in eqiad and codfw.
  • @fgiunchedi mentioned that the esams swift cluster could be used to hold dumps as a viability test if we want to go that route.
Thu, Feb 9, 9:20 AM · Operations, Dumps-Generation

Tue, Feb 7

fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

The replacement SSDs have arrived onsite, and planning for replacing them can take place on this task.

Tue, Feb 7, 7:18 AM · Patch-For-Review, ops-eqiad, Operations

Fri, Feb 3

fgiunchedi assigned T86556: monitor SSD wear levels to Volans.

Moving to @Volans as per hangout chat :)

Fri, Feb 3, 6:11 PM · Operations-Software-Development, Operations, Monitoring
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

graphite2001 has been added to cr1/cr2 for analytics-in4

Fri, Feb 3, 5:59 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

I've also searched from graphite1001's address in router configs and the only place it shows up is analytics-in4 filter for carbon/statsd traffic.

Fri, Feb 3, 5:53 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

Read traffic has been switched over to graphite2001 now and seems to work.

Fri, Feb 3, 5:48 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

I've staged the patches needed for failover in a series of reviews above. There's also a graphite-codfw dashboard at https://grafana.wikimedia.org/dashboard/db/graphite-codfw for which some graphite-related metrics won't be right until the failover happens, the system metrics are correct however.

Fri, Feb 3, 9:00 AM · Patch-For-Review, ops-eqiad, Operations

Thu, Feb 2

fgiunchedi updated subscribers of T157022: Suspected faulty SSD on graphite1001.

Note that the same behaviour is now showing up on sdb too. I've asked @RobH to bump quantity to order in T157065, assuming worst case all SSDs will eventually show the same behaviour and will need replacement.

Thu, Feb 2, 10:18 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi merged T157034: Degraded RAID on graphite1001 into T157022: Suspected faulty SSD on graphite1001.
Thu, Feb 2, 2:54 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi merged task T157034: Degraded RAID on graphite1001 into T157022: Suspected faulty SSD on graphite1001.
Thu, Feb 2, 2:54 PM · ops-eqiad, Operations
fgiunchedi added a comment to T157022: Suspected faulty SSD on graphite1001.

model INTEL SSDSC2BB600G4

Thu, Feb 2, 2:03 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi created T157022: Suspected faulty SSD on graphite1001.
Thu, Feb 2, 12:52 PM · Patch-For-Review, ops-eqiad, Operations
fgiunchedi updated subscribers of T123728: Upgrade fluorine to trusty/jessie.

For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know if mediawiki is already able to do that via its logging configuration, @bd808 you might know how/if we can do that? thanks!

Thu, Feb 2, 10:53 AM · Patch-For-Review, Operations
fgiunchedi added a comment to T140927: Make git 2.2.0+ (preferably 2.8.x) available.

FWIW we're using the same method via jessie-wikimedia/experimental on cache boxes for e.g. nginx or the kernel, this is the relevant puppet config:

Thu, Feb 2, 9:22 AM · Patch-For-Review, Scap, Operations, Release-Engineering-Team (Long-Lived-Branches)
fgiunchedi added a comment to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).

thanks @Krinkle !
I have some questions mostly due to my ignorance of what mw does with memcache, if we were to wipe the caches in codfw say today without touching mw config and run the warmup script against codfw, would that be a realistic test of what would happen during the switchover in terms of performance?
Also if I understand correctly the warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

Thu, Feb 2, 9:13 AM · Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
fgiunchedi committed rOSPJ0ee30e891c51: scap: use cassandra dsh group (authored by fgiunchedi).
scap: use cassandra dsh group
Thu, Feb 2, 8:45 AM

Wed, Feb 1

fgiunchedi added a comment to T140927: Make git 2.2.0+ (preferably 2.8.x) available.

@demon I was indeed able to build stretch's git as-is on jessie, resulting in 2.11.0-2~bpo8+1. Uploading it internally to jessie-wikimedia is a possibility but it would also mean all jessie machines get the upgrade. ATM I don't think we have a good way of pinning a particular version only for e.g. tin/mira. We could upload git to experimental component though and enable said component on tin/mira though.

Wed, Feb 1, 3:36 PM · Patch-For-Review, Scap, Operations, Release-Engineering-Team (Long-Lived-Branches)
fgiunchedi accepted D546: Address excessive open file descriptors.

LGTM overall, just a nit on self.swift vs self.swift()

Wed, Feb 1, 12:49 PM
fgiunchedi edited the description of T145659: Port application-specific metrics from ganglia to prometheus.
Wed, Feb 1, 11:16 AM · Patch-For-Review, Prometheus-metrics-monitoring, Operations
fgiunchedi reopened T151441: Thumbor should handle "temp" thumbnail requests as "Open".

Reopening this as we've seen what looks like a fd leak on thumbor for swift connections today. In addition to that constant outbound network traffic was observed on thumbor machines, though upon restart of thumbor such traffic went away.

Wed, Feb 1, 8:48 AM · Patch-For-Review, Operations, Performance-Team, Thumbor
fgiunchedi reopened T151441: Thumbor should handle "temp" thumbnail requests, a subtask of T121388: Service-based thumbnailing re-architecture in production with Thumbor, as "Open".
Wed, Feb 1, 8:48 AM · Patch-For-Review, Performance-Team, Thumbor

Tue, Jan 31

fgiunchedi closed T131506: Broken logging configuration on Ganglia aggregator (carbon) as "Declined".

We're replacing Ganglia with Prometheus, declining

Tue, Jan 31, 10:06 AM · Monitoring

Mon, Jan 30

fgiunchedi closed T52613: identify associated log for Vanadium's ganglia graphs as "Declined".

We're replacing Ganglia with Prometheus. Nowadays the same functionality might be a blend of logstash/grafana.

Mon, Jan 30, 7:27 PM · WorkType-NewFunctionality, Wikimedia-General-or-Unknown
fgiunchedi renamed T123000: labtest in "misc" cluster from "labs (labvirt/labservices) in "misc" cluster" to "labtest in "misc" cluster".
Mon, Jan 30, 7:26 PM · Labs
fgiunchedi renamed T123000: labtest in "misc" cluster from "labs (labvirt/labservices) in "misc" ganglia cluster" to "labs (labvirt/labservices) in "misc" cluster".
Mon, Jan 30, 7:23 PM · Labs
fgiunchedi closed T81659: ganglia graphs should not have "N" as units as "Declined".

We're replacing Ganglia with Prometheus

Mon, Jan 30, 7:23 PM · Operations, Monitoring
fgiunchedi closed T119520: provide aggregated cluster data with graphite, similar to ganglia as "Declined".

Declining, this functionality is now provided by Prometheus

Mon, Jan 30, 7:10 PM · Patch-For-Review, Graphite, Operations
fgiunchedi edited the description of T123728: Upgrade fluorine to trusty/jessie.
Mon, Jan 30, 5:25 PM · Patch-For-Review, Operations
fgiunchedi added a comment to T66214: Define an official thumb API.

WRT deployment strategy note that ATM all thumb accesses after varnish go through our custom swift rewrite middleware, how would that (if?) change? Would all thumbs with arbitrary height be stored alongside existing thumbs?

Mon, Jan 30, 9:11 AM · Reading Epics (Thumbnails), Services (next), ArchCom-Has-shepherd, RfC, Traffic, Operations, Services-next, ArchCom-RfC, Zero, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, Reading-Admin, Commons, Performance-Team, Epic, RESTBase-API, Parsoid, Performance, Multimedia, MediaWiki-File-management

Fri, Jan 27

fgiunchedi added a comment to T156475: Investigate spike in 500s during asw-c2-eqiad replacement.

Logstash during that time period ( January 26th 2017, 17:56:23.220 to January 26th 2017, 18:25:00.000): https://logstash.wikimedia.org/goto/8d722054b5849d315d057bd46fe8a894

Fri, Jan 27, 12:16 PM · MediaWiki-General-or-Unknown, Operations
fgiunchedi added a comment to T155875: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues.

During the whole 30 minute window there was also an increased response time from the MediaWiki API, that cascaded into MobileApps issues and RESTBase alerts. This was likely an effect of the (purposefully not depooled) database servers in that rack (db1055-56-57-59) and possibly caused by MediaWiki not responding well to unresponsive hosts. Investigation for this is still on-going.

Fri, Jan 27, 12:09 PM · Labs, netops, Operations
fgiunchedi created T156475: Investigate spike in 500s during asw-c2-eqiad replacement.
Fri, Jan 27, 12:08 PM · MediaWiki-General-or-Unknown, Operations
fgiunchedi added a comment to T155872: graphite1003 short of available RAM.

Merging as T116767 duplicate, we can followup there as heavy queries were the root cause anyways.

Fri, Jan 27, 10:37 AM · Monitoring, Operations, Graphite
fgiunchedi added a comment to T155876: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM.

See also T116767: limit the impact of heavy/large graphite queries to track heavy graphite queries, closing as its duplicate.

Fri, Jan 27, 10:36 AM · Monitoring, Operations, Graphite
fgiunchedi merged task T155876: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM into T116767: limit the impact of heavy/large graphite queries.
Fri, Jan 27, 10:33 AM · Monitoring, Operations, Graphite
fgiunchedi merged task T155872: graphite1003 short of available RAM into T116767: limit the impact of heavy/large graphite queries.
Fri, Jan 27, 10:33 AM · Monitoring, Operations, Graphite
fgiunchedi merged tasks T155872: graphite1003 short of available RAM, T155876: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM into T116767: limit the impact of heavy/large graphite queries.
Fri, Jan 27, 10:33 AM · Monitoring, Operations
fgiunchedi added a comment to T116767: limit the impact of heavy/large graphite queries.

Updated wikitech Graphite troubleshooting on how to identify such queries.

Fri, Jan 27, 10:33 AM · Monitoring, Operations
fgiunchedi added a comment to T116767: limit the impact of heavy/large graphite queries.

See also T155872: graphite1003 short of available RAM for a case where heavy queries were not impacting uwsgi but carbon-cache instead using a lot of memory.

Fri, Jan 27, 10:25 AM · Monitoring, Operations
fgiunchedi added a comment to T155872: graphite1003 short of available RAM.

I've tracked this down to expensive queries on graphite1003 making carbon-cache explode in memory. Namely cassandra-related 99percentile SSTablesPerReadHistogram for all columnfamilies for all instances, can generate >100MB responses in pickle data. This is sort-of related to T116767: limit the impact of heavy/large graphite queries and I'm adding this case to it too.

Fri, Jan 27, 10:24 AM · Monitoring, Operations, Graphite

Thu, Jan 26

fgiunchedi added a comment to T104161: ms-be1015 idrac not working, no more sessions.

I've come across this error again, now documented on wikitech how to fix it: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Troubleshooting

Thu, Jan 26, 1:29 PM · Operations, ops-eqiad
fgiunchedi assigned T155907: Degraded RAID on ms-be1013 to Cmjohnson.

moving to @Cmjohnson for disk replacement

Thu, Jan 26, 10:16 AM · media-storage, ops-eqiad, Operations
fgiunchedi added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

re: misc, I gave a quick look at both lists of hosts and excluding a few miscategorized hosts (provisioned but no puppet roles applied perhaps, mc / mw / aqs). The rest is either misc hostname systems or eqiad-only things (druid, thumbor, oresdb, dataset, netmon, snapshot, notebook, etc) or codfw-only (labtest).

Thu, Jan 26, 9:58 AM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
fgiunchedi added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

Clusters where eqiad has more hosts than codfw. Note that the list needs more auditing due to various factors e.g. decommissioned hosts, hosts in misc serve a plethora of functions, etc

Thu, Jan 26, 9:39 AM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Wed, Jan 25

fgiunchedi added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

Yes, in fact we can already answer these questions with Prometheus. I've drafted a dashboard showing CPU/host/memory differences per-cluster in https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-audit

Wed, Jan 25, 5:22 PM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
fgiunchedi accepted D542: Reintroduce request and loader metrics.

LGTM, thanks @Gilles for the explanations!

Wed, Jan 25, 4:39 PM

Tue, Jan 24

fgiunchedi created T156151: Cronspam from mwlog*.
Tue, Jan 24, 5:03 PM · User-Elukey, Operations
fgiunchedi renamed T156143: High CPU usage from swift-proxy on frontend machines from "high CPU usage from swift-proxy on frontend machines" to "High CPU usage from swift-proxy on frontend machines".
Tue, Jan 24, 4:07 PM · Operations, media-storage
fgiunchedi created T156143: High CPU usage from swift-proxy on frontend machines.
Tue, Jan 24, 4:05 PM · Operations, media-storage

Mon, Jan 23

fgiunchedi edited the description of T154658: Prepare and improve the datacenter switchover procedure.
Mon, Jan 23, 5:36 PM · DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
fgiunchedi added inline comments to D542: Reintroduce request and loader metrics.
Mon, Jan 23, 3:45 PM
fgiunchedi added a comment to T155907: Degraded RAID on ms-be1013.

Also puppet is broken because of this, is this expected @fgiunchedi ? Shouldn't we detect the failed disk and let puppet continue running instead?

Mon, Jan 23, 2:16 PM · media-storage, ops-eqiad, Operations

Jan 20 2017

fgiunchedi added a comment to T155095: Rack and set up ms-fe100[5-7].

thanks Chris! it looks like ms-fe1008 issue with the installer is an instance of T149845: Something is wrong with installer root disk stuff for which we don't have a root cause yet. I was able to fix it by manually assembling the arrays and subsequent reboots should be fine.

Jan 20 2017, 11:33 AM · Patch-For-Review, Operations, ops-eqiad

Jan 19 2017

fgiunchedi added a comment to T123728: Upgrade fluorine to trusty/jessie.

mwlog[12]001 have been provisioned with jessie and are up and running. udp2log-mw runs as a systemd unit and so does xenon-log now.

Jan 19 2017, 3:00 PM · Patch-For-Review, Operations
fgiunchedi closed T153361: setup/install mwlog1001/WMF4724 as "Resolved".

mwlog1001 has its roles applied now, resolving and following up in T123728

Jan 19 2017, 2:26 PM · Patch-For-Review, Operations
fgiunchedi closed T153361: setup/install mwlog1001/WMF4724, a subtask of T153008: eqiad: (1) Mediawiki log host to replace fluorine, as "Resolved".
Jan 19 2017, 2:26 PM · hardware-requests, Operations
fgiunchedi added a comment to T155690: troubleshoot drac on ms-be2010.codfw.wmnet.

Host can be taken down at any time with a clean shutdown to make sure all services are stopped

Jan 19 2017, 9:20 AM · ops-codfw, Operations
fgiunchedi added a comment to T155689: ms-be2002.codfw.wmnet has drac issues.

Host can be taken down at any time with a clean shutdown to make sure all services are stopped

Jan 19 2017, 9:20 AM · ops-codfw, Operations

Jan 17 2017

fgiunchedi closed T155363: Degraded RAID on ms-be2003 as "Resolved".

Thanks @Papaul ! disk is rebuilding

Jan 17 2017, 8:58 PM · media-storage, Operations, ops-codfw

Jan 16 2017

fgiunchedi added a comment to T154864: Plot number of cached objects on a per-server per-DC basis.

The number of objects in varnish for frontend/backend is now also available at https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats

Jan 16 2017, 1:38 PM · Wikimedia-Incident, Monitoring, Traffic, Operations
fgiunchedi added a comment to T155323: unknown error occurred in storage backend "local-swift-codfw".

Does this happen on every server side upload? On what files and times? I'm asking to better track down the error in MW logs

Jan 16 2017, 12:03 PM · media-storage, Operations

Jan 6 2017

fgiunchedi accepted D525: Add PoolCounter support.
Jan 6 2017, 12:41 AM

Jan 5 2017

fgiunchedi edited the description of T117972: swift upgrade plans: jessie and swift 2.x.
Jan 5 2017, 9:05 PM · Patch-For-Review, media-storage, Operations
fgiunchedi added a comment to T143349: Deprecate precise instances in Labs by 03/31/2017.

re: filippo-test-precise2 please nuke

Jan 5 2017, 8:55 PM · Labs-Infrastructure, Labs

Jan 4 2017

fgiunchedi created T154629: Rename/relabel restbase-test1* to restbase-dev1*.
Jan 4 2017, 11:38 PM · ops-eqiad, Operations
fgiunchedi renamed T151075: setup/install restbase-dev100[123] from "setup/install restbase-test100[123]" to "setup/install restbase-dev100[123]".
Jan 4 2017, 11:36 PM · Patch-For-Review, ops-eqiad, Services (blocked), Operations, Cassandra