fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (18)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (218 w, 6 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Today

fgiunchedi moved T171482: Programmatic generation of grafana dashboards from Doing to Up next on the User-fgiunchedi board.
Thu, Dec 13, 3:41 PM · Patch-For-Review, Graphite, User-fgiunchedi, monitoring, Operations
fgiunchedi moved T203169: Logstash hardware expansion from Doing to Radar on the User-fgiunchedi board.
Thu, Dec 13, 3:41 PM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Operations
fgiunchedi reassigned T211070: decommission of restbase200[1-6] (lease return in December 2018) from Eevans to RobH.

Ready for decom @RobH

Thu, Dec 13, 1:22 PM · Patch-For-Review, DC-Ops, decommission
fgiunchedi updated the task description for T211070: decommission of restbase200[1-6] (lease return in December 2018).
Thu, Dec 13, 1:21 PM · Patch-For-Review, DC-Ops, decommission
fgiunchedi closed T191315: Cassandra Graphite metrics space usage audit and cleanup as Declined.

We no longer have separate cassandra metrics hosts since moving to Prometheus.

Thu, Dec 13, 10:39 AM · User-fgiunchedi, Services (watching), Graphite, Operations
fgiunchedi moved T211018: Move restbase cassandra checks to Prometheus from Backlog to Radar on the User-fgiunchedi board.
Thu, Dec 13, 9:38 AM · User-Eevans, RESTBase-Cassandra, User-fgiunchedi
fgiunchedi moved T209921: ms-be2047 spontaneous reboots from Backlog to Radar on the User-fgiunchedi board.
Thu, Dec 13, 9:37 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi moved T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet from Backlog to Doing on the User-fgiunchedi board.
Thu, Dec 13, 9:37 AM · User-fgiunchedi, media-storage, Operations
fgiunchedi closed T209615: rack/setup/install restbase201[3-8].codfw.wmnet as Resolved.

Completed!

Thu, Dec 13, 9:04 AM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations
fgiunchedi closed T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 as Resolved.

This is completed, modulo ms-be2047 being diagnosed in T209921

Thu, Dec 13, 9:02 AM · User-fgiunchedi, Patch-For-Review, Operations, ops-codfw
fgiunchedi added a project to T209921: ms-be2047 spontaneous reboots: User-fgiunchedi.
Thu, Dec 13, 9:02 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi added a project to T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet: User-fgiunchedi.
Thu, Dec 13, 8:50 AM · User-fgiunchedi, media-storage, Operations
fgiunchedi triaged T211859: cronspam from elasticsearch-curator on stretch as Normal priority.
Thu, Dec 13, 8:33 AM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Operations
fgiunchedi added a comment to T211765: 504 from /api/rest_v1/page/random/summary.

Looks like the 504s started on Dec 3rd ~12:00

Thu, Dec 13, 8:20 AM · Reading-Infrastructure-Team-Backlog, Mobile-Content-Service, Core Platform Team Kanban (Doing), Services (doing), RESTBase

Yesterday

Eevans awarded T211750: Introduce Python code formatters usage a Cookie token.
Wed, Dec 12, 3:21 PM · Operations, Operations-Software-Development
fgiunchedi created T211765: 504 from /api/rest_v1/page/random/summary.
Wed, Dec 12, 1:48 PM · Reading-Infrastructure-Team-Backlog, Mobile-Content-Service, Core Platform Team Kanban (Doing), Services (doing), RESTBase
fgiunchedi added a comment to T208215: Metrics from wdqs updater JMX should be prefixed.

Any update?

Wed, Dec 12, 10:31 AM · Discovery-Search (Current work), Patch-For-Review, Wikidata, Wikidata-Query-Service
fgiunchedi added a comment to T211661: Automatically clean up unused thumbnails in Swift.

Thanks @Gilles for kickstarting this! For context these are the notes I took when we did the first round of cleanup a couple of years back: https://wikitech.wikimedia.org/wiki/Swift/Thumbnails_Cleanup

Wed, Dec 12, 10:05 AM · Traffic, media-storage, Operations, Performance-Team
fgiunchedi created T211750: Introduce Python code formatters usage.
Wed, Dec 12, 9:54 AM · Operations, Operations-Software-Development
fgiunchedi added a comment to T209921: ms-be2047 spontaneous reboots.

Given that the other hosts in this batch are fine and we've replaced the parts Dell wanted to replace what's the next step?

Wed, Dec 12, 9:16 AM · User-fgiunchedi, Operations, ops-codfw
fgiunchedi reassigned T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet from fgiunchedi to RobH.

@RobH looks like of these hosts only ms-be1050 is accessible from cumin atm? ditto for logging in as my user via ssh

Wed, Dec 12, 9:15 AM · User-fgiunchedi, media-storage, Operations

Tue, Dec 11

fgiunchedi awarded T211654: puppet-provisioned dashboards not found in Grafana 5 a Like token.
Tue, Dec 11, 5:18 PM · Patch-For-Review, Operations, monitoring, User-CDanis
fgiunchedi added a comment to T211416: Put restbase201[3-8] into conftool and LVS.

"fixed" for now by manually installing python-dnspython, following up on T209136 for a proper fix

Tue, Dec 11, 4:49 PM · Core Platform Team Kanban (Done with CPT), Services (done), User-fgiunchedi, Operations
fgiunchedi renamed T209136: python3-etcd needs python3-dnspython from python3-conftool needs python3-dns to python3-etcd needs python3-dnspython.
Tue, Dec 11, 4:09 PM · Patch-For-Review, Operations, Operations-Software-Development
fgiunchedi added a comment to T211416: Put restbase201[3-8] into conftool and LVS.

Turns out depool-restbase isn't successful:

Tue, Dec 11, 3:40 PM · Core Platform Team Kanban (Done with CPT), Services (done), User-fgiunchedi, Operations
fgiunchedi added a comment to T211124: Move mediawiki to new logging infrastructure.

Thanks @bd808 for the context/insight, I agree having the change in core is the right path. I took a stab at the patch, will need some guidance for sure on the mw core production deployment part.

The CeeFormatter patch is merged now, so the class should be live on the beta clusters wikis and will get pushed out to the prod wikis with the next train (week of December 10th).

Tue, Dec 11, 1:55 PM · Patch-For-Review, MediaWiki-Logging, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T211125: Move service-runner to new logging infrastructure.

There's not that much logging happening in beta services, I would guess we should start big, so deployment-restbase0[1,2]?

Tue, Dec 11, 12:20 PM · Core Platform Team Backlog (Next), Services (next), service-runner, Wikimedia-Logstash, Operations
fgiunchedi awarded T211661: Automatically clean up unused thumbnails in Swift a Yellow Medal token.
Tue, Dec 11, 9:29 AM · Traffic, media-storage, Operations, Performance-Team
fgiunchedi triaged T211654: puppet-provisioned dashboards not found in Grafana 5 as Normal priority.
Tue, Dec 11, 8:29 AM · Patch-For-Review, Operations, monitoring, User-CDanis

Mon, Dec 10

fgiunchedi closed T207040: Graphite1001 disk usage at 96% as Resolved.

Resolving, we're onto new graphite hardware now with more resources.

Mon, Dec 10, 4:29 PM · Operations, monitoring
fgiunchedi moved T200209: Decom graphite2001 from Up next to Externally blocked on the monitoring board.
Mon, Dec 10, 4:24 PM · decommission, ops-codfw, Operations, monitoring
fgiunchedi moved T200210: Decom graphite2002 from Up next to Externally blocked on the monitoring board.
Mon, Dec 10, 4:24 PM · decommission, monitoring, Operations, ops-codfw
fgiunchedi moved T209738: decom einsteinium from In progress to Externally blocked on the monitoring board.
Mon, Dec 10, 4:23 PM · monitoring, Icinga, decommission, Operations
fgiunchedi placed T37611: Remove port 29418 from cloning process up for grabs.

Unassigning as I'm not going to work on this

Mon, Dec 10, 3:11 PM · Developer-Advocacy, Operations, Gerrit
fgiunchedi added a comment to T211416: Put restbase201[3-8] into conftool and LVS.

They are independent, though. Cassandra doesn't go into LVS at all, and RESTBase is fully functional on these nodes. Not having them in LVS/conftool does not allow us to deploy fresh code to them, so that is problematic.

Mon, Dec 10, 1:40 PM · Core Platform Team Kanban (Done with CPT), Services (done), User-fgiunchedi, Operations
fgiunchedi added a comment to T211459: rancid causes puppet to flap on netmon1002.

I guess we should change puppet to create configs too and get rid of the placeholder

Mon, Dec 10, 8:56 AM · monitoring
fgiunchedi added a comment to T211250: Create a mediawiki::cronjob define.

I recommend sending cronjobs output to logstash (as well as files?), when cronjobs are logging to syslog you can opt-in via ./modules/profile/files/rsyslog/lookup_table_output.json

Mon, Dec 10, 8:46 AM · User-jijiki, Operations

Sat, Dec 8

Krinkle awarded T37611: Remove port 29418 from cloning process a Orange Medal token.
Sat, Dec 8, 2:11 AM · Developer-Advocacy, Operations, Gerrit

Fri, Dec 7

fgiunchedi added a project to T211416: Put restbase201[3-8] into conftool and LVS: User-fgiunchedi.

Indeed, FWIW I tend to treat restbase and cassandra separate so this will be done as soon as the cassandra reshape (T210843) is done.

Fri, Dec 7, 1:55 PM · Core Platform Team Kanban (Done with CPT), Services (done), User-fgiunchedi, Operations
fgiunchedi added a subtask for T209615: rack/setup/install restbase201[3-8].codfw.wmnet: T211416: Put restbase201[3-8] into conftool and LVS.
Fri, Dec 7, 1:14 PM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations
fgiunchedi added a parent task for T211416: Put restbase201[3-8] into conftool and LVS: T209615: rack/setup/install restbase201[3-8].codfw.wmnet.
Fri, Dec 7, 1:14 PM · Core Platform Team Kanban (Done with CPT), Services (done), User-fgiunchedi, Operations
fgiunchedi awarded T210416: Upgrade grafana to 5.x a Like token.
Fri, Dec 7, 9:13 AM · Performance-Team (Radar), Patch-For-Review, Operations, monitoring, User-CDanis
fgiunchedi added a comment to T211125: Move service-runner to new logging infrastructure.

So, currently we only support sending to syslog via UDP using the node-bunyan-syslog-upd package created by @mobrovac It's pure-js implementation since we want to avoid native bindings.

Searching for existing libraries for bynuan-syslog adapters doesn't reveal a lot and it's a mess. Most of the packages do not look supported or well-used. The only implementation that claims to support Unix dgram sockets seems abandoned, plus it uses unix-dgram package under the hood that is introducing a native binding. There are some implementations for just syslog clients, without bunyan support, but even though we just need a domain socket (supported out of the box in node) they all rely on unix-dgram package.

To sum up, I would not be comfortable relying on any of the existing implementations. How much more preferable Unix sockets would be over UDP as a transport? I will try to implement a proof of concept implementation that does not rely on a bulk of the native code, let's see how that goes.

Fri, Dec 7, 9:09 AM · Core Platform Team Backlog (Next), Services (next), service-runner, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T211124: Move mediawiki to new logging infrastructure.

I've looked briefly at how to implement prefixing syslog json messages with @cee: and I'd say we could do it on the "syslog side" i.e. ./includes/debug/logger/monolog/SyslogHandler.php or "logstash side" i.e. ./includes/debug/logger/monolog/LogstashFormatter.php. I don't have strong opinions or either really!

If I am understanding the upstream documentation correctly, the desired result in the on-the-wire UDP packet payload is something like: <PRI>DATETIME HOSTNAME PROGRAM: @cee: {...json here..}. The "@cee: " string in the middle here is the new part that needs to be added.

I concur with @fgiunchedi that either the handler or the formatter could be changed to implement this. In either case I would suggest doing it via a new class rather than adding a feature flag to the existing classes that are used in the WMF production configuration. To me it makes a bit more sense to implement the change in the formatter because this change is really about the message payload encoded in the syslog UDP packet rather than a change to the basic syslog packet itself.

If I was going to undertake this work (which I'm not volunteering for due to other existing commitments), I would probably introduce a new MediaWiki\Logger\Monolog\CeeFormatter class on the MediaWiki side that extends MediaWiki\Logger\Monolog\LogstashFormatter. This class would override the public function format(array $record) method of the parent class something like this:

public function format( array $record ) {
    return "@cee: " . parent::format( $record );
}

Once this change was available on all wikis configuration could be changed to use this formatter instead of the current MediaWiki\Logger\Monolog\LogstashFormatter and the handler output directed towards the rsyslog collector (i.e. deploy in week 0, config change in week 1).

There are certainly other ways to solve this--and some that may be quicker to get into production, like hacking the formatter directly into mediawiki-config.git's wmf-config/logging.php file--but I think this would be the more long term stable and supportable route.

Fri, Dec 7, 8:59 AM · Patch-For-Review, MediaWiki-Logging, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T211065: rack/setup/install codfw logstash elasticsearch storage servers.

@fgiunchedi the installation is complaining about not finding any swap partition.

──────────────────────┤ [!!] Partition disks ├────────────────────────┐

│                                                                       │    
│ You have not selected any partitions for use as swap space. Enabling  │    
│ swap space is recommended so that the system can make better use of   │    
│ the available physical memory, and so that it behaves better when     │    
│ if you do not have enough physical memory.                            │    
│                                                                       │    
│ If you do not go back to the partitioning menu and assign a swap      │    
│ partition, the installation will continue without swap space.         │    
│                                                                       │    
│ Do you want to return to the partitioning menu?

I checked logstash.cfg, there is no line that disable swap. if you don't want to use swap can you please add
partman-basicfilesystems partman-basicfilesystems/no_swap boolean false to logstash.cfg,

Fri, Dec 7, 8:37 AM · Patch-For-Review, Operations, ops-codfw
fgiunchedi added a comment to T211027: puppet (systemd::service) attempts to start manually masked units.

https://tickets.puppetlabs.com/browse/PUP-1253

https://github.com/puppetlabs/puppet/pull/3141

"If a service is masked, it is deemed to also be disabled. If a service is
masked and changed to enabled, it will first be unmasked since the
standard 'systemctl enable' command does not properly unmask the command
first."

Is this the reason for the change in behaviour? https://github.com/puppetlabs/puppet/pull/4770/files/7a5176c88f261402a8e73d926034e390f189a1d0#r56208804

Fri, Dec 7, 8:34 AM · Operations
fgiunchedi renamed T211027: puppet (systemd::service) attempts to start manually masked units from puppet (systemd::service) attempts to start masked units to puppet (systemd::service) attempts to start manually masked units.
Fri, Dec 7, 8:32 AM · Operations
fgiunchedi updated the task description for T210843: Reshape RESTBase Cassandra cluster for server refresh.
Fri, Dec 7, 7:53 AM · Core Platform Team, Services (doing), User-Eevans, User-fgiunchedi, ops-codfw, Operations

Wed, Dec 5

fgiunchedi added a comment to T211184: Correctly collect logs from php-fpm pools.

I took a quick look at this as well and indeed openlog() seems the simplest way. Also because altering programname in rsyslog isn't allowed, thus to fix this on the rsyslog side we'd have to use a different template for example, not really worth it IMO. Plus the bug is supposedly fixed in php 7.3 anyways.

Wed, Dec 5, 2:06 PM · Performance-Team (Radar), Patch-For-Review, User-Joe, Core Platform Team Backlog (Watching / External), User-ArielGlenn, HHVM, Operations
fgiunchedi updated the task description for T211125: Move service-runner to new logging infrastructure.
Wed, Dec 5, 11:03 AM · Core Platform Team Backlog (Next), Services (next), service-runner, Wikimedia-Logstash, Operations
fgiunchedi updated the task description for T211124: Move mediawiki to new logging infrastructure.
Wed, Dec 5, 11:02 AM · Patch-For-Review, MediaWiki-Logging, Wikimedia-Logstash, Operations

Tue, Dec 4

fgiunchedi updated the task description for T210843: Reshape RESTBase Cassandra cluster for server refresh.
Tue, Dec 4, 4:32 PM · Core Platform Team, Services (doing), User-Eevans, User-fgiunchedi, ops-codfw, Operations
fgiunchedi updated the task description for T211124: Move mediawiki to new logging infrastructure.
Tue, Dec 4, 4:30 PM · Patch-For-Review, MediaWiki-Logging, Wikimedia-Logstash, Operations
fgiunchedi triaged T211125: Move service-runner to new logging infrastructure as Normal priority.
Tue, Dec 4, 4:25 PM · Core Platform Team Backlog (Next), Services (next), service-runner, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T211124: Move mediawiki to new logging infrastructure.

I've looked briefly at how to implement prefixing syslog json messages with @cee: and I'd say we could do it on the "syslog side" i.e. ./includes/debug/logger/monolog/SyslogHandler.php or "logstash side" i.e. ./includes/debug/logger/monolog/LogstashFormatter.php. I don't have strong opinions or either really!

Tue, Dec 4, 4:18 PM · Patch-For-Review, MediaWiki-Logging, Wikimedia-Logstash, Operations
fgiunchedi triaged T211124: Move mediawiki to new logging infrastructure as Normal priority.
Tue, Dec 4, 4:12 PM · Patch-For-Review, MediaWiki-Logging, Wikimedia-Logstash, Operations
fgiunchedi added a comment to T211065: rack/setup/install codfw logstash elasticsearch storage servers.

@fgiunchedi Please provide partman recipe to use. I have 4x4TB disks

Tue, Dec 4, 1:31 PM · Patch-For-Review, Operations, ops-codfw
fgiunchedi added a comment to T211065: rack/setup/install codfw logstash elasticsearch storage servers.

Also please rack these systems across different rows, any combination of rows will do. The rest of the task LGTM

Tue, Dec 4, 1:14 PM · Patch-For-Review, Operations, ops-codfw
fgiunchedi added a subtask for T203169: Logstash hardware expansion: T211065: rack/setup/install codfw logstash elasticsearch storage servers.
Tue, Dec 4, 1:10 PM · Wikimedia-Logstash, User-fgiunchedi, User-herron, Operations
fgiunchedi added a parent task for T211065: rack/setup/install codfw logstash elasticsearch storage servers: T203169: Logstash hardware expansion.
Tue, Dec 4, 1:10 PM · Patch-For-Review, Operations, ops-codfw
fgiunchedi closed T206633: Setup rsyslog to be able to produce logs to Kafka as Resolved.

This is completed!

Tue, Dec 4, 1:07 PM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi closed T206633: Setup rsyslog to be able to produce logs to Kafka, a subtask of T205849: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal), as Resolved.
Tue, Dec 4, 1:07 PM · User-herron, Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi updated the task description for T206633: Setup rsyslog to be able to produce logs to Kafka.
Tue, Dec 4, 1:06 PM · Patch-For-Review, User-fgiunchedi, Wikimedia-Logstash, Operations
fgiunchedi moved T210843: Reshape RESTBase Cassandra cluster for server refresh from Backlog to Doing on the User-fgiunchedi board.
Tue, Dec 4, 10:53 AM · Core Platform Team, Services (doing), User-Eevans, User-fgiunchedi, ops-codfw, Operations
fgiunchedi updated the task description for T211027: puppet (systemd::service) attempts to start manually masked units.
Tue, Dec 4, 10:46 AM · Operations
fgiunchedi added a comment to T211027: puppet (systemd::service) attempts to start manually masked units.

Looks like this is working as intended for systemd provider (/usr/lib/ruby/vendor_ruby/puppet/provider/service/systemd.rb)

Tue, Dec 4, 10:43 AM · Operations
fgiunchedi updated the task description for T210843: Reshape RESTBase Cassandra cluster for server refresh.
Tue, Dec 4, 10:02 AM · Core Platform Team, Services (doing), User-Eevans, User-fgiunchedi, ops-codfw, Operations
fgiunchedi added a comment to T211065: rack/setup/install codfw logstash elasticsearch storage servers.

@Papaul names replaced! thanks

Tue, Dec 4, 9:57 AM · Patch-For-Review, Operations, ops-codfw
fgiunchedi updated the task description for T211065: rack/setup/install codfw logstash elasticsearch storage servers.
Tue, Dec 4, 9:57 AM · Patch-For-Review, Operations, ops-codfw
fgiunchedi closed T208096: Degraded RAID on ms-be2021 as Resolved.

LGTM on my side too, I've reenabled the event handler.

Tue, Dec 4, 8:08 AM · Operations, ops-codfw
fgiunchedi added a project to T177747: grafana-labs often fails to generate graphs with c.datapoints is undefined: cloud-services-team.
Tue, Dec 4, 8:05 AM · cloud-services-team, Graphite, Cloud-VPS
fgiunchedi updated subscribers of T177747: grafana-labs often fails to generate graphs with c.datapoints is undefined.
Tue, Dec 4, 8:05 AM · cloud-services-team, Graphite, Cloud-VPS
fgiunchedi added a comment to T177747: grafana-labs often fails to generate graphs with c.datapoints is undefined.

tentatively resolving, graphite 0.9.15 is on labmon1001 (jessie) while production runs graphite 1.x on stretch

@fgiunchedi Is there a task to update labmon to 1.x? I know why prod is on a newer version, but I can see that tripping up future work.

Tue, Dec 4, 8:05 AM · cloud-services-team, Graphite, Cloud-VPS
fgiunchedi updated subscribers of T205712: wtp2020: correctable memory errors.

This is back, any chance for reseating or swapping memory @Papaul ?

Tue, Dec 4, 7:51 AM · Operations, ops-codfw
fgiunchedi added a comment to T209615: rack/setup/install restbase201[3-8].codfw.wmnet.

@fgiunchedi: Can you advise if these are fully online, and if so, can we start to proceed on the decommission of the older restbase systems via T211070?

Tue, Dec 4, 7:48 AM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations

Mon, Dec 3

fgiunchedi added a comment to T209615: rack/setup/install restbase201[3-8].codfw.wmnet.

All hosts had their first puppet run done, and restbase2013 is bootstrapping cassandra instances. On the remaining hosts I had to chmod a-x /usr/sbin/cassandra due to T211027: puppet (systemd::service) attempts to start manually masked units and we'll need to restore that one host at a time when bootstrapping time comes.

Mon, Dec 3, 4:35 PM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations
fgiunchedi created T211027: puppet (systemd::service) attempts to start manually masked units.
Mon, Dec 3, 4:08 PM · Operations
fgiunchedi created T211018: Move restbase cassandra checks to Prometheus.
Mon, Dec 3, 3:11 PM · User-Eevans, RESTBase-Cassandra, User-fgiunchedi
fgiunchedi added a comment to T210486: Audit "misc" cluster hosts.

@colewhite @fgiunchedi should we add a checklist of actions need to be done in order to consider this task as "Resolved?"

Mon, Dec 3, 12:09 PM · User-Marostegui, Patch-For-Review, Operations
fgiunchedi updated the task description for T210486: Audit "misc" cluster hosts.
Mon, Dec 3, 12:09 PM · User-Marostegui, Patch-For-Review, Operations
fgiunchedi closed T210990: Degraded RAID on restbase2018 as Invalid.

reimage

Mon, Dec 3, 11:30 AM · Operations, ops-codfw
fgiunchedi closed T210984: Degraded RAID on restbase2014 as Invalid.

reimage

Mon, Dec 3, 11:29 AM · Operations, ops-codfw
fgiunchedi closed T177747: grafana-labs often fails to generate graphs with c.datapoints is undefined as Resolved.

tentatively resolving, graphite 0.9.15 is on labmon1001 (jessie) while production runs graphite 1.x on stretch

Mon, Dec 3, 11:29 AM · cloud-services-team, Graphite, Cloud-VPS
fgiunchedi added a comment to T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs.

I can indeed reproduce the problem when fetching e.g. https://upload.wikimedia.org/wikipedia/commons/8/8e/Sunset_Toronto_Skyline_Panorama_from_Snake_Island.jpg

Mon, Dec 3, 11:23 AM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown
fgiunchedi added a comment to T195847: Clean up artifacts from LaTeX based math rendering.

Is cleaning up swift global-math-render.* containers in scope for this? afaik with mathoid now these containers shouldn't be used anymore?

Mon, Dec 3, 11:18 AM · Patch-For-Review, Operations, Math
fgiunchedi added a comment to T210416: Upgrade grafana to 5.x.

It does! +1

Mon, Dec 3, 11:15 AM · Performance-Team (Radar), Patch-For-Review, Operations, monitoring, User-CDanis
fgiunchedi moved T209615: rack/setup/install restbase201[3-8].codfw.wmnet from Backlog to Doing on the User-fgiunchedi board.
Mon, Dec 3, 10:55 AM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations
fgiunchedi added a comment to T210863: Reconfigure hardware and reimage restbase201[3-8].codfw.wmnet.

Completed auto-reimage of hosts:

['restbase2013.codfw.wmnet']

Of which those FAILED:

['restbase2013.codfw.wmnet']
Mon, Dec 3, 9:01 AM · Patch-For-Review, User-Eevans, User-fgiunchedi, Services (watching), ops-codfw, Operations
fgiunchedi closed T210863: Reconfigure hardware and reimage restbase201[3-8].codfw.wmnet as Resolved.

This is completed, thanks @Papaul and all involved.

Mon, Dec 3, 8:48 AM · Patch-For-Review, User-Eevans, User-fgiunchedi, Services (watching), ops-codfw, Operations
fgiunchedi closed T210863: Reconfigure hardware and reimage restbase201[3-8].codfw.wmnet, a subtask of T209615: rack/setup/install restbase201[3-8].codfw.wmnet, as Resolved.
Mon, Dec 3, 8:48 AM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations

Fri, Nov 30

fgiunchedi added a project to T209615: rack/setup/install restbase201[3-8].codfw.wmnet: User-fgiunchedi.
Fri, Nov 30, 1:44 PM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations
fgiunchedi claimed T209615: rack/setup/install restbase201[3-8].codfw.wmnet.

I'll be preparing these hosts for cassandra to be bootstrapped there

Fri, Nov 30, 1:43 PM · User-fgiunchedi, Patch-For-Review, Services (watching), ops-codfw, Operations
fgiunchedi added a comment to T209863: graph server temperature metrics.

Thanks @CDanis for looking into this! re: max() I have an hunch it might be due to having two prometheus servers backing the prometheus.svc endpoint in eqiad and codfw. To test this theory I tried looking for temperatures e.g. in esams. However with esams selected and e.g. cp3007 selected I'm not seeing any temperatures at all.

Fri, Nov 30, 8:55 AM · Patch-For-Review, Operations, monitoring, User-CDanis
fgiunchedi added a comment to T209618: rack/setup/install ms-be10[44-50].eqiad.wmnet.

@fgiunchedi For racking this is the space I have

I can do at least 3 in A with out a problem,

I can only 2 in C and that would be the same rack (C2)

B can handle 3 or more and D can handle 2 or more with one racking having at least 2.

Fri, Nov 30, 8:44 AM · User-fgiunchedi, media-storage, Operations
fgiunchedi awarded T210750: Track NFS statistics through Prometheus a Like token.
Fri, Nov 30, 8:13 AM · cloud-services-team (Kanban)

Thu, Nov 29

fgiunchedi added a project to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts: User-fgiunchedi.
Thu, Nov 29, 1:42 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

Note we've been here before in T172921: Nrpe command_timeout and "Service Check Timed Out" errors and sadly the command check timeout can be changed only globally on the icinga side, not per-service.

Thu, Nov 29, 1:40 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi renamed T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts from Address recurrent service check time out for "HP RAID" to Address recurrent service check time out for "HP RAID" on swift backend hosts.
Thu, Nov 29, 1:12 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi created T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.
Thu, Nov 29, 1:12 PM · User-fgiunchedi, Operations, monitoring
fgiunchedi closed T210718: [Cloud VPS alert] Puppet failure on deployment-logstash2.deployment-prep.eqiad.wmflabs as Resolved.

Yes this has been fixed by me a few hours ago! I was doing tests on that VM and disabled puppet, resolving.

Thu, Nov 29, 12:56 PM · Cloud-VPS, cloud-services-team, Puppet, Beta-Cluster-Infrastructure