Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (19)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (245 w, 4 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
Filippo Giunchedi [ Global Accounts ]

Recent Activity

Fri, Jun 14

fgiunchedi added a comment to T225721: PHP statsd client doesn't support tagging metrics.

Indeed, there's been a push to move onto Prometheus as the supported platform. We (SRE) haven't done any work towards supporting tags in statsd/graphite though as that platform is supported but considered legacy, IOW I don't know if mediawiki started sending tagged statsd whether it'll work or not. The graphite wmcs stack however is setup the same as production in terms of software components so tagged statsd can be tested there if needed.

Fri, Jun 14, 3:22 PM · Operations, Graphite
fgiunchedi created T225813: ErrorException from line 125 of ApiQueryQueryPage.php: PHP Notice: Undefined property: stdClass::$value.
Fri, Jun 14, 12:54 PM · MW-1.34-notes (1.34.0-wmf.11; 2019-06-26), Patch-For-Review, patch-welcome, ProofreadPage, Wikimedia-production-error
fgiunchedi added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

So, the timeout patch above bumped the timeouts to 100s I think. On many hosts (e.g. ms-be1036, ms-be1037) these checks seemed to take about 1.5-3 minutes to run, so this issue would not be addressed by that. However, I also wondered why such a relatively simple thing would take such a long time to execute. The response seems to be two-fold:

  1. dsa-check-hpssacli isn't that efficient, running hpssacli e.g. on ms-be1036 32 times in total. I refactored it, and this speeded it up by ~10x (cf. r516725, and r516726 for ssacli). On an example run on ms-be1037, this made the run go from 1m52 to 11s, which would make it easily under the limit.
  2. There is something really off with the CPU management on those systems. Things seemed abysmally slow even by a casual SSHing into the machine and running commands. Load is in the 60s-80s on a 48-core HT CPU. Despite that, the CPU frequencies (grep MHz /proc/cpuinfo) seemed to be... relatively low. The scaling governor was set to ondemand, so I tried setting it to performance (for cpu in /sys/devices/system/cpu/cpu*; do echo performance > $cpu/cpufreq/scaling_governor ; done) and... the check now runs in ~2 seconds. Even more interestingly... CPU utlization dropped from 50% to 5%, load to 14 (and even the temperatures of the CPUs dropped visibly!). This probably requires a separate task and investigation of its own...
Fri, Jun 14, 9:46 AM · Patch-For-Review, User-fgiunchedi, Operations, observability
fgiunchedi added a comment to T225742: Send logs from webperf processor services to Logstash.

When the programs are already logging on journald then onboarding on logstash is easy, namely adding the program name to the whitelist as per https://wikitech.wikimedia.org/wiki/Logstash/Interface#UNIX_Socket_(/dev/log)

Fri, Jun 14, 7:46 AM · Wikimedia-Logstash, observability, Performance-Team

Thu, Jun 13

fgiunchedi closed T222654: ms-be2043 'sdd' throwing lots of errors as Resolved.

All done, resolving.

Thu, Jun 13, 1:33 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi closed T222654: ms-be2043 'sdd' throwing lots of errors, a subtask of T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications, as Resolved.
Thu, Jun 13, 1:33 PM · observability, media-storage, Operations
fgiunchedi closed T225601: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> as Resolved.

No problem @Tkshamburg ! Thanks for your report.

Thu, Jun 13, 1:18 PM · Parsoid, Operations, MediaWiki-Releasing
fgiunchedi added a comment to T225601: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org>.

Hi @fgiunchedi ,

thanks for creating the new key (now 10 years validity), but actually the repository was not updated:

https://releases.wikimedia.org/debian/dists/jessie-mediawiki/Release.gpg --> Release.gpg 2018-12-05 23:25 833

Maybe I have to wait for synchronization of the new file and will test it again in a few hours.

Thu, Jun 13, 12:22 PM · Parsoid, Operations, MediaWiki-Releasing
fgiunchedi updated the task description for T225713: CPU scaling governor audit.
Thu, Jun 13, 11:02 AM · media-storage, Operations
fgiunchedi created T225713: CPU scaling governor audit.
Thu, Jun 13, 11:01 AM · media-storage, Operations
fgiunchedi created T225710: Error while checking binary files for python shebang.
Thu, Jun 13, 10:41 AM · Patch-For-Review, Operations, Operations-Software-Development
fgiunchedi added a comment to T225601: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org>.

Instructions at https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org updated with the new key id, namely sudo apt-key advanced --keyserver keys.gnupg.net --recv-keys AF380A3036A03444. @Tkshamburg could you try again with the new key?

Thu, Jun 13, 9:48 AM · Parsoid, Operations, MediaWiki-Releasing
fgiunchedi added a comment to T225601: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org>.

Ok this should be done now, the new key is published on the key servers and the releases repo has been signed with the new key:

Thu, Jun 13, 9:42 AM · Parsoid, Operations, MediaWiki-Releasing
fgiunchedi claimed T225601: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org>.

I'll be looking into renewing this key

Thu, Jun 13, 8:50 AM · Parsoid, Operations, MediaWiki-Releasing

Wed, Jun 12

fgiunchedi added a project to T225613: Swift / puppet interaction can fill up root filesystem: User-fgiunchedi.
Wed, Jun 12, 3:57 PM · User-fgiunchedi, Patch-For-Review, media-storage
fgiunchedi added a comment to T225613: Swift / puppet interaction can fill up root filesystem.

The remounting is something we need to investigate why for sure, although related I think there are two independent issues. To be clear the intent of the patch above is to address a disk being e.g. waiting to be replaced and making sure the correct permissions are in place. IOW we'd have this issue even if remounting (by systemd I believe) wasn't happening

Wed, Jun 12, 3:03 PM · User-fgiunchedi, Patch-For-Review, media-storage
fgiunchedi merged T225633: Degraded RAID on ms-be2018 into T225630: ms-be2018 sdc unreadable sector.
Wed, Jun 12, 2:51 PM · ops-codfw, Operations
fgiunchedi merged task T225633: Degraded RAID on ms-be2018 into T225630: ms-be2018 sdc unreadable sector.
Wed, Jun 12, 2:51 PM · Operations, ops-codfw
fgiunchedi added a comment to T225630: ms-be2018 sdc unreadable sector.

Also forcibly remove the physical disk

Wed, Jun 12, 2:27 PM · ops-codfw, Operations
Restricted Application added a project to T225630: ms-be2018 sdc unreadable sector: Operations.
Wed, Jun 12, 2:24 PM · ops-codfw, Operations
fgiunchedi created T225630: ms-be2018 sdc unreadable sector.
Wed, Jun 12, 2:23 PM · ops-codfw, Operations
fgiunchedi added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

We could use swift's expiring objects support, although that is something we'd have to deploy first (puppetization of the configuration mostly) cfr https://docs.openstack.org/swift/2.10.2/overview_expiring_objects.html

Wed, Jun 12, 2:13 PM · Research, Operations, Discovery, Analytics
fgiunchedi added a comment to T222166: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade.

Could you try running the upgrade twice ? But yes if the jessie image is updated then it should come with the latest rsyslog I believe, assuming jessie-wikimedia repo is included by default

Wed, Jun 12, 2:08 PM · Continuous-Integration-Infrastructure, Operations
fgiunchedi added a comment to T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.

Right now there are 14 outstanding alerts, or about 50% of all outstanding alerts:

This is not OK. Let's please roll out -at least- a workaround ASAP.

Wed, Jun 12, 2:04 PM · Patch-For-Review, User-fgiunchedi, Operations, observability
fgiunchedi added a comment to T225553: gmail users being suspended from mediawiki-l due to excessive bounces.

Indeed Munge From seems the least intrusive, AFAICT lists administrators should be able to self-set this option for the list to test it works as expected.

Wed, Jun 12, 1:21 PM · Operations, Wikimedia-Mailing-lists
fgiunchedi created T225613: Swift / puppet interaction can fill up root filesystem.
Wed, Jun 12, 11:25 AM · User-fgiunchedi, Patch-For-Review, media-storage
fgiunchedi added a comment to T225604: log spam from mtail 3.0.0~rc19 on wezen.

Ok I've temporarily fixed the problem by installing a locally modified version of mtail from unstable. Good enough for now IMHO, the bigger plan here is to upgrade central syslog servers to Stretch or Buster (T200706: rack/setup/install centrallog1001.eqiad.wmnet and T224564: Reimage wezen to Stretch (and rename to centrallog2001)). Leaving open because we'll need the unstable version of mtail anyways even on the reimaged servers.

Wed, Jun 12, 11:04 AM · observability
fgiunchedi updated the task description for T225604: log spam from mtail 3.0.0~rc19 on wezen.
Wed, Jun 12, 10:48 AM · observability
fgiunchedi added a comment to T225604: log spam from mtail 3.0.0~rc19 on wezen.

Fixed versions for mtail: v3.0.0-rc30 v3.0.0-rc29 v3.0.0-rc28 v3.0.0-rc27 v3.0.0-rc26 v3.0.0-rc25 v3.0.0-rc24 v3.0.0-rc23 v3.0.0-rc22 v3.0.0-rc21 so upgrading to unstable version of mtail would do it (3.0.0~rc24.1-1)

Wed, Jun 12, 10:47 AM · observability
fgiunchedi created T225604: log spam from mtail 3.0.0~rc19 on wezen.
Wed, Jun 12, 10:46 AM · observability

Tue, Jun 11

fgiunchedi reopened T223518: ms-be1033 not powering up as "Open".
Tue, Jun 11, 12:53 PM · User-fgiunchedi, Operations, ops-eqiad
fgiunchedi added a project to T225296: High Prometheus TCP retransmits: User-fgiunchedi.
Tue, Jun 11, 11:24 AM · User-fgiunchedi, Cloud-Services, observability, Analytics

Fri, Jun 7

fgiunchedi moved T225108: Prometheus logs showing errors for routinator from Backlog to Radar on the observability board.
Fri, Jun 7, 12:24 PM · observability, netops, Operations

Wed, Jun 5

fgiunchedi added a comment to T225108: Prometheus logs showing errors for routinator .

Unrelated to the issue at hand, but I'd also recommend prefixing metrics with routinator_ so it is clear where they are coming from

Wed, Jun 5, 4:30 PM · observability, netops, Operations
fgiunchedi added a comment to T225108: Prometheus logs showing errors for routinator .

Yes that looks like an error on routinator side, you can also use promtool check rules to see what prometheus makes of that

Wed, Jun 5, 3:54 PM · observability, netops, Operations
fgiunchedi created T225079: Alert on unmounted swift partitions.
Wed, Jun 5, 10:09 AM · media-storage
fgiunchedi added a comment to T177779: Generate instance list of database hosts to be monitored automatically from exported resources.

I was reviewing observability backlog, to me it looks like this is a duplicate of T145072: Generate instance list of active database hosts to be monitored from prometheus ?

Wed, Jun 5, 10:01 AM · observability, DBA, Operations
fgiunchedi added a comment to T214183: Setup graphs for power usage readings in Grafana.

@fgiunchedi: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1 lists:

The data is collected from eqiad and codfw sites PDUs via SNMP by LibreNMS, exported to Graphite and calculated as:

sum(current) * max(voltage) / sqrt(3)

PDUs in ulsfo expose pre-calculated watts readings, thus data is taken as-is.

In all cases data is cumulative for all PDUs inlet cords.

See also T171823

Since ulsfo is now using the same servertechs with snmp data in librenms, this should be updated right?

Wed, Jun 5, 9:31 AM · DC-Ops, observability

Tue, Jun 4

fgiunchedi moved T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts from Backlog to Up next on the observability board.
Tue, Jun 4, 1:40 PM · Patch-For-Review, User-fgiunchedi, Operations, observability
fgiunchedi added a comment to T224236: include the 'Server:' response header in varnishkafka.

I've added a 'top 20 backend' panel to https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X ! thanks all

Tue, Jun 4, 11:52 AM · Analytics-Kanban, User-Elukey, Traffic, Analytics, Operations

Mon, Jun 3

fgiunchedi added a comment to T219825: Update dashboards to node-exporter 0.16+ metric names.

We found another instance in the cluster overview dashboard where dashboard variables were using node_boot_time, I've fixed it to use node_boot_time_seconds instead.

Mon, Jun 3, 10:28 AM · Patch-For-Review, observability

Fri, May 31

fgiunchedi added a comment to T220401: Introduce kask session storage service to kubernetes.
  1. We use an Exec probe that executes something like curl https://<POD_IP>:8081/healthz. This is generally suboptimal as the execution of an external command is more expensive than an HTTP GET probe from the kubelet. I am also not clear on how the pod IP would be communicated to the command, need to research that more.
  2. We amend kask to not require TLS for the /healthz endpoint. This is ugly and would complicate the code considerably I think.

We can do this. It'd technically be a different server (different Server object, different port), even if bound to the same Go process, will it still satisfy its mandate? I wonder, does it make sense to do the same with the Prometheus agent? We could add the idea of a management interface or similar, make the listen address and port configurable, and hang /health and /metrics there.

Fri, May 31, 3:17 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
fgiunchedi assigned T218544: ms-be1043 sdk failed to Cmjohnson.

News on this @Cmjohnson ?

Fri, May 31, 9:51 AM · observability, Operations-Software-Development, Operations, ops-eqiad

Wed, May 29

fgiunchedi moved T199406: rsyslog's in:imtcp thread stuck on old sockets from Backlog to Up next on the User-fgiunchedi board.
Wed, May 29, 7:48 AM · User-fgiunchedi, Operations

Tue, May 28

fgiunchedi added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

Disk has been replaced, thanks @Papaul !

Tue, May 28, 4:09 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi awarded T93208: (U)EFI support a Love token.
Tue, May 28, 8:17 AM · Operations

Mon, May 27

fgiunchedi moved T220590: Decom ms-be101[345] from Doing to Radar on the User-fgiunchedi board.
Mon, May 27, 1:05 PM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi added a comment to T220590: Decom ms-be101[345].

@fgiunchedi FYI we got some email to root@ from ms-be1014 with the following:

Mon, May 27, 10:04 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

Thanks for taking a look!

Mon, May 27, 8:53 AM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations

Thu, May 23

fgiunchedi added a comment to T219544: Make hadoop cluster able to push to swift .

Great thanks!

While we're at it I recommend creating the container with the lowlatency storage policy so that swift will allocate objects on SSDs as opposed to spinning disk

Sure...but I don't think low latency / SSDs are really needed for this use case. I can do this if you still prefer!

Thu, May 23, 2:30 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
fgiunchedi added a comment to T219544: Make hadoop cluster able to push to swift .

Alright, I've written a bash wrapper to help out with this. I'd do it with just the swift CLI, but we need to be able to source some env vars from another file, which I don't think the Oozie shell action will let us do.

We should be able to include this Oozie (sub)workflow at the end of a data generation job and have it upload a directory from HDFS to Swift. I've tested swift-upload.sh in deployment-prep. I'd like to test it and this Oozie workflow in prod now, but I need to get some Swift creds to try, as well as a test container. @fgiunchedi or @CDanis can yall help with that?

Thu, May 23, 12:52 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
fgiunchedi added a comment to T220590: Decom ms-be101[345].

Also a note re: ms-be1013, it had its raid failed in T220907: Degraded RAID on ms-be1013 and currently I wasn't able to make it boot again. Not worth spending more time on it so it should be wiped and that's it.

Thu, May 23, 10:03 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi closed T220907: Degraded RAID on ms-be1013 as Resolved.

I'm resolving this since we're going to decom this host in T220907: Degraded RAID on ms-be1013, thanks @Cmjohnson !

Thu, May 23, 10:03 AM · ops-eqiad, Operations
fgiunchedi assigned T221068: decom ms-be201[345] to RobH.

Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to @RobH

Thu, May 23, 9:52 AM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
fgiunchedi updated the task description for T221068: decom ms-be201[345].
Thu, May 23, 9:51 AM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
fgiunchedi assigned T220590: Decom ms-be101[345] to RobH.

Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to @RobH

Thu, May 23, 9:49 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi updated the task description for T220590: Decom ms-be101[345].
Thu, May 23, 9:45 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi added a project to T199406: rsyslog's in:imtcp thread stuck on old sockets: User-fgiunchedi.
Thu, May 23, 9:25 AM · User-fgiunchedi, Operations

Tue, May 21

fgiunchedi added a comment to T223924: pybal logs into logstash.

+1! AFAICS pybal already logs everything to syslog by way of stdout catched by journald, thus the first step should be as easy as adding pybal to ./modules/profile/files/rsyslog/lookup_table_output.json in puppet

Tue, May 21, 5:00 PM · Operations, Wikimedia-Logstash
fgiunchedi added a comment to T219544: Make hadoop cluster able to push to swift .

A less elegant alternative would be to just write a wrapper that downloaded from HDFS to local filesystem and then uploaded to swift :/

Tue, May 21, 1:37 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
fgiunchedi added a comment to T223518: ms-be1033 not powering up.

Since the host is not coming back for another week for sure I'm going to de-weight in swift

Tue, May 21, 10:02 AM · User-fgiunchedi, Operations, ops-eqiad
fgiunchedi moved T222654: ms-be2043 'sdd' throwing lots of errors from Backlog to Doing on the User-fgiunchedi board.
Tue, May 21, 9:59 AM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi moved T223518: ms-be1033 not powering up from Backlog to Doing on the User-fgiunchedi board.
Tue, May 21, 9:59 AM · User-fgiunchedi, Operations, ops-eqiad

Mon, May 20

fgiunchedi added a project to T222654: ms-be2043 'sdd' throwing lots of errors: User-fgiunchedi.
Mon, May 20, 2:43 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi added a project to T223518: ms-be1033 not powering up: User-fgiunchedi.
Mon, May 20, 2:43 PM · User-fgiunchedi, Operations, ops-eqiad

May 17 2019

fgiunchedi added a comment to T223483: Logstash stops processing messages if a single output becomes blocked.

Sounds similar / duplicate as T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable ?

May 17 2019, 3:29 PM · Operations, Wikimedia-Logstash
fgiunchedi added a comment to T223458: mgmt outages for cloud* systems seem to page everyone.

FWIW the paging behavior for cloud* hosts down happens even when non-mgmt is involved, i.e. production network

May 17 2019, 3:28 PM · cloud-services-team (Kanban)
fgiunchedi added a comment to T219544: Make hadoop cluster able to push to swift .

Is the tempauth middleware just for beta? I am pretty sure the same commands should work in production too, in case swift there is using something else for auth.

May 17 2019, 3:18 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
fgiunchedi added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

I thought the "drive's position" in megacli might provide the mapping to scsi target, but that doesn't seem to be the case, i.e. in the current situation the drives groups have shifted (sdf the disk we just replaced doesn't have a group)

May 17 2019, 2:07 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

Trying to debug how slot 3 on the controller wasn't in fact mapped to sdd. In the past (on older hw generations?) the scsi address to which each disk was mapped to corresponded to the slot on the raid controller, in this case:

May 17 2019, 9:47 AM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

@fgiunchedi so what do you want to do here? I still have the old disk with me. Do you want me to keep it with me and not ship it back to Dell for now?

May 17 2019, 9:34 AM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi placed T219404: rack/setup/install restbase10[19-27].eqiad.wmnet up for grabs.

Hosts are undergoing cassandra bootstraps by @mobrovac, unassigning

May 17 2019, 9:29 AM · Cassandra, Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (Security, stability, performance and scalability (TEC1)), User-fgiunchedi, User-Eevans, ops-eqiad, Operations, RESTBase
fgiunchedi added a comment to T220590: Decom ms-be101[345].

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

May 17 2019, 9:28 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi created T223518: ms-be1033 not powering up.
May 17 2019, 8:30 AM · User-fgiunchedi, Operations, ops-eqiad
fgiunchedi added a comment to T221068: decom ms-be201[345].

In this case it was a mistake by me setting decommissioning in netbox for those hosts, although they still run role swift. I'll put them back in active, and try again the decom process !

May 17 2019, 7:46 AM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations

May 16 2019

fgiunchedi added a comment to T222654: ms-be2043 'sdd' throwing lots of errors.

Thanks @Papaul! Turns out I gave you wrong instructions, and sdf is in slot 3 not sdd :(
Not a huge problem on the swift side, we'll have to figure out the right mappings device <-> slot

May 16 2019, 4:22 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
fgiunchedi added a comment to T219544: Make hadoop cluster able to push to swift .

I've ran some tests on deployment-hadoop-test-1, I think the problem is that on the swift side we're using the tempauth middleware to handle authentication (GET <auth_url>, username/password are in headers, and the auth token is also sent back in headers), whereas the swift hadoop client tries keystone authentication (send json to POST <auth_url>, get back tokens).

May 16 2019, 2:06 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
fgiunchedi closed T214166: Improve cassandra JBOD integration post-reimage as Resolved.

All done! Documentation updated at https://wikitech.wikimedia.org/wiki/Cassandra#Add_a_new_host_to_a_multi-instance_cluster

May 16 2019, 1:15 PM · User-fgiunchedi, Core Platform Team Backlog (Watching / External), Services (watching), Cassandra, RESTBase-Cassandra
fgiunchedi added a comment to T210850: WMCS-related dashboards using Diamond metrics.

Ok! I'm fine renaming the metrics. I would need a suggestion though :-)

cloudvps.novafullstack.*?

Looks good to me

May 16 2019, 1:05 PM · cloud-services-team (Kanban), Operations
fgiunchedi added a comment to T219404: rack/setup/install restbase10[19-27].eqiad.wmnet.

Puppet ran on all hosts, next steps are bootstrapping all cassandra instances one at a time

May 16 2019, 9:05 AM · Cassandra, Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (Security, stability, performance and scalability (TEC1)), User-fgiunchedi, User-Eevans, ops-eqiad, Operations, RESTBase
fgiunchedi updated the task description for T214166: Improve cassandra JBOD integration post-reimage.
May 16 2019, 8:59 AM · User-fgiunchedi, Core Platform Team Backlog (Watching / External), Services (watching), Cassandra, RESTBase-Cassandra
fgiunchedi added a comment to T214166: Improve cassandra JBOD integration post-reimage.

For the last problem where cassandra default instance keeps running, it is stopped according to systemd but the process/cgroup are still there:

May 16 2019, 8:06 AM · User-fgiunchedi, Core Platform Team Backlog (Watching / External), Services (watching), Cassandra, RESTBase-Cassandra
fgiunchedi updated the task description for T214166: Improve cassandra JBOD integration post-reimage.
May 16 2019, 7:55 AM · User-fgiunchedi, Core Platform Team Backlog (Watching / External), Services (watching), Cassandra, RESTBase-Cassandra

May 15 2019

fgiunchedi added a comment to T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser.

Other data points, the 250px thumb has the correct c-t (image/png) although that thumb reports last-modified in 2014, as opposed to the 200px version in 2017

May 15 2019, 10:41 AM · Patch-For-Review, Traffic, Operations, media-storage
fgiunchedi moved T220709: Upgrade statsd_exporter to 0.9 from Doing to Radar on the User-fgiunchedi board.
May 15 2019, 7:30 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Analytics, EventBus, observability, User-fgiunchedi, Operations

May 14 2019

fgiunchedi added a comment to T220709: Upgrade statsd_exporter to 0.9.

All production has been updated ! Leaving open for now in case there's still upgrades to be done in k8s (cc @akosiaris )

May 14 2019, 4:30 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Analytics, EventBus, observability, User-fgiunchedi, Operations
fgiunchedi closed T223276: keyholder has just disarmed everywhere (train blocker) as Resolved.

Resolving because we're back, although feel free to reopen if we're missing something.

May 14 2019, 2:03 PM · Operations
fgiunchedi closed T223276: keyholder has just disarmed everywhere (train blocker), a subtask of T220730: 1.34.0-wmf.5 deployment blockers, as Resolved.
May 14 2019, 2:03 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Release, Train Deployments
fgiunchedi added a comment to T223276: keyholder has just disarmed everywhere (train blocker).

Root cause is /usr/local/bin/ssh-agent-proxy changed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/509929 and caused keyholder to reload:

May 14 2019, 1:51 PM · Operations
fgiunchedi added a comment to T223276: keyholder has just disarmed everywhere (train blocker).

Keyholder rearmed on these hosts:

May 14 2019, 1:48 PM · Operations
fgiunchedi added a comment to T210850: WMCS-related dashboards using Diamond metrics.

Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating that?
I have some shadows in the different mechanisms we use for collecting/storing/visualizing metrics, I think the one that I understand the most is prometheus.

I'm trying to figure out if the dashboard https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack should be just refreshed from the server names point of view, or we need to rewrite the fullstack script, or something else.

statsd itself isn't being actively deprecated (although new "things" should use Prometheus) only Diamond is getting deprecated in this case. Diamond writes its metrics to the servers hierarchy and I see nova_fullstack_test.py also writes to servers. That is likely one of the reasons why labs-nova-fullstack dashboard came up in the audit. My suggestion is to keep the script for now but move its metrics to a different top level hierarchy so that the only producer of statsd metrics to servers is Diamond itself, does that sound good ?

Or simply ignore the hierarchy, with Diamond being gone for good it won't matter soon ? The reason it was listed in this task is because it referenced the labnet* servers which are not gone. If the dashboard is still useful from my pov it also seems fine to simply update it to use the relevant metrics from cloudnet* servers.

May 14 2019, 9:24 AM · cloud-services-team (Kanban), Operations
fgiunchedi added a comment to T210850: WMCS-related dashboards using Diamond metrics.

Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating that?
I have some shadows in the different mechanisms we use for collecting/storing/visualizing metrics, I think the one that I understand the most is prometheus.

I'm trying to figure out if the dashboard https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack should be just refreshed from the server names point of view, or we need to rewrite the fullstack script, or something else.

May 14 2019, 8:53 AM · cloud-services-team (Kanban), Operations

May 13 2019

fgiunchedi added a comment to T223126: Install new PDUs into b5-eqiad.

Actions for ms-be hosts updated, to be on the safe side I'll stop swift + rsync in case power goes out. If it'll help I can poweroff hosts too. What time is this activity scheduled for ?

May 13 2019, 4:47 PM · ops-eqiad, Operations
fgiunchedi updated the task description for T223126: Install new PDUs into b5-eqiad.
May 13 2019, 4:45 PM · ops-eqiad, Operations
fgiunchedi moved T218544: ms-be1043 sdk failed from Backlog to In progress on the observability board.
May 13 2019, 3:14 PM · observability, Operations-Software-Development, Operations, ops-eqiad
fgiunchedi added a comment to T220590: Decom ms-be101[345].

Ditto for ms-be1015

May 13 2019, 10:49 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi added a comment to T220590: Decom ms-be101[345].

ms-be1014 has finished swift decom, what's left is zero-bytes old quarantined files

May 13 2019, 10:07 AM · decommission, User-fgiunchedi, media-storage, Operations
fgiunchedi added a project to T214166: Improve cassandra JBOD integration post-reimage: User-fgiunchedi.
May 13 2019, 9:03 AM · User-fgiunchedi, Core Platform Team Backlog (Watching / External), Services (watching), Cassandra, RESTBase-Cassandra
fgiunchedi moved T180696: Terminate Thumbor with SSL from Backlog to Radar on the User-fgiunchedi board.
May 13 2019, 8:59 AM · User-fgiunchedi, Performance-Team (Radar), Thumbor
fgiunchedi moved T210991: Deprecate Diamond collectors in Tool Labs / Tool Forge from Backlog to Radar on the User-fgiunchedi board.
May 13 2019, 8:59 AM · cloud-services-team (Kanban), Toolforge, User-fgiunchedi, observability, Operations