Page MenuHomePhabricator
Feed Advanced Search

Mar 9 2020

fgiunchedi changed the status of T219924: Move mobileapps logging to new logging pipeline from Open to Stalled.

@fgiunchedi This should be finished by April at the latest for both services. AIUI that's when SCB is planned to be decommissioned.

Mar 9 2020, 11:53 AM · Product-Infrastructure-Team-Backlog-Deprecated, Page Content Service, observability, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T226986: Client side error logging production launch.

@fgiunchedi very few error events are flowing in now! This is live on group0 wikis. Can we hook the topics up to logstash? In kafka logging-eqiad it is eqiad.mediawiki.client.error and in kafka logging-codfw it is codfw.mediawiki.client.error.

Mar 9 2020, 11:49 AM · Analytics-Radar, MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Performance-Team (Radar), Desktop Improvements (Vector 2022), observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog-Deprecated, Epic
fgiunchedi added a comment to T245176: Add Prometheus Squid exporter.

Exporter is up, however on install1003 it doesn't seem to be able to query squid yet:

Mar 9 2020, 9:11 AM · Patch-For-Review, observability

Mar 6 2020

fgiunchedi added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

Nice!

Not totally sure what is up with the histogram based panels, must be something with the migration to service-runner prometheus instead of statsd-exporter.

express_router_request_duration_seconds is not a histogram[1] for some reason, that's why. Also those panels still reference service_runner_request_duration_seconds_bucket I guess a sed 's/service_runner_request_duration_seconds_bucket/express_router_request_duration_seconds/' is in order there

In fact, having looked a bit at the metrics using curl 10.64.75.104:9102/metrics, I have a couple of comments

# HELP nodejs_process_heap_bytes process heap usage
# TYPE nodejs_process_heap_bytes gauge
nodejs_process_heap_bytes{service="eventstreams",type="rss"} 72445952
nodejs_process_heap_bytes{service="eventstreams",type="total"} 34713600
nodejs_process_heap_bytes{service="eventstreams",type="used"} 28104688

This is fine as a gauge, but I am not sure if type is warrated to be a label. I have a minor fear it will increase the cardinality of the metric without much gain. The statsd-exporter supported services emit it as 3 different metrics. Whatever we do we should keep it consistent of course. @fgiunchedi thoughts?

Mar 6 2020, 12:29 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline
fgiunchedi added a comment to T246997: smartd not starting properly on gen9 + buster.

Thanks for taking a look.
From both options you suggest, I am more inclined on the first one so we can get rid of a component which is overruled by Prometheus anyways, no?
The idea of having to list devices is a bit scary to me, specially considering how those can change in the future with newer OS or kernel versions, and how we'd need to maintain or adapt that.

Mar 6 2020, 10:54 AM · User-fgiunchedi, SRE
fgiunchedi added a comment to T189333: Changing Kibana filters is ridiculously slow.

This is still an issue.

Editing Kibana dashboards:

  • In Safari, crashes the tab.
  • In Firefox, times out after 30 seconds, the user has to tell Firefox to "Wait for the slow script" and then wait another 20 seconds, in order to open the dropdown menu once. This cycle is repeated for every action with a dashboard filter.
  • in Chrome, takes about 10-20 seconds, works most times.
Mar 6 2020, 10:36 AM · Developer Productivity, User-fgiunchedi, observability, SRE, Traffic, User-Addshore, Wikimedia-Logstash
fgiunchedi added a comment to T246997: smartd not starting properly on gen9 + buster.

Interesting find! Looks like db1078 is the first system that we run Buster on and has HP raid controller (so the disks are "masked" behind a single device). This looks like a "regression" in smartd (6.6-1 from buster, I tried 7.1-1~bpo10+1 from buster-backports and no joy either). Note that in this case smartd is running mostly for logging purposes to track attribute changes, however the smart-data-dump script we're using to export Prometheus metrics does support autodiscovery of hardware raid controllers and we alert on those SMART metrics via Prometheus.

Mar 6 2020, 10:14 AM · User-fgiunchedi, SRE
fgiunchedi added a comment to T219924: Move mobileapps logging to new logging pipeline.

Hmm, is this still worth doing if mobileapps is finally moving to k8s soon (T218733)? Same question for Proton (T219925), which is also moving over soon.

Mar 6 2020, 9:25 AM · Product-Infrastructure-Team-Backlog-Deprecated, Page Content Service, observability, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi closed T244357: Provision grafana VM in codfw as Resolved.

Nothing left to do here, resolving

Mar 6 2020, 9:20 AM · serviceops, vm-requests, observability, SRE

Mar 5 2020

fgiunchedi updated subscribers of T175087: Create a navtiming processor for Prometheus.

We're on, webperf metrics are being collected in Prometheus now! Thanks to everyone involved @Gilles @dpifke @Krinkle, there's of course followup work to do but at least now we should be able to compare metrics with coal

Mar 5 2020, 2:31 PM · NavigationTiming

Mar 4 2020

fgiunchedi added a comment to T246860: some Prometheis not scraping the full set of targets.

Curious indeed, thanks for investigating! I took a quick look and it looks like prometheus2004 didn't even know about lvs2007 before today at ~6.23, so I'm suspecting watching targets files failed in some way / for some reason. I can't find any smoking gun in prometheus metrics ATM we could check for alerting, although we should be updating to newer prometheus releases (e.g. 2.15)

Mar 4 2020, 9:24 AM · Patch-For-Review, Traffic, observability, SRE

Mar 3 2020

fgiunchedi updated the task description for T241719: Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3.
Mar 3 2020, 11:22 AM · cloud-services-team (Kanban), SRE
fgiunchedi updated the task description for T241719: Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3.
Mar 3 2020, 11:21 AM · cloud-services-team (Kanban), SRE
fgiunchedi added a comment to T145703: Horizon loses credentials every day.

Can confirm that this is still happening, I'm pretty sure I checked "remember me" yesterday and today I'm asked to login again in horizon. What's the supposed session duration when ticking "remember me" ?

Mar 3 2020, 10:36 AM · Security, cloud-services-team (Kanban), Horizon
fgiunchedi closed T1268: swift capacity planning as Resolved.

Sure, we can resolve this

Mar 3 2020, 9:36 AM · SRE, SRE-swift-storage
fgiunchedi closed T140075: investigate swift used space spikes since June 2016, a subtask of T1268: swift capacity planning, as Declined.
Mar 3 2020, 9:36 AM · SRE, SRE-swift-storage
fgiunchedi closed T140075: investigate swift used space spikes since June 2016 as Declined.

Resolving since the root cause has been found

Mar 3 2020, 9:36 AM · SRE, SRE-swift-storage

Mar 2 2020

fgiunchedi updated the title for P10580 Audit of elasticsearch fields, 2020.02.14 to 2020.02.29 from Masterwork From Distant Lands to Audit of elasticsearch fields, 2020.02.14 to 2020.02.29.
Mar 2 2020, 2:41 PM
fgiunchedi added a comment to T86969: Send scap log directly to logstash via syslog input.

Patch has been deployed, however I overlooked adding @cee and will followup with a fix

Mar 2 2020, 10:51 AM · User-fgiunchedi, Patch-For-Review, Scap
fgiunchedi added a comment to T215499: Move wikimania-scholarships from udp2log to syslog.

In terms of implementation, wikimania-scholarships uses monolog, so an approach similar/equal to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/477791 would work

Mar 2 2020, 10:46 AM · SRE Observability, observability, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T215497: Move iegreview from udp2log to syslog.

In terms of implementation, iegreview uses monolog, so an approach similar/equal to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/477791 would work

Mar 2 2020, 10:46 AM · Observability-Logging, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T245604: Move wikifeeds to the logging pipeline.

Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!

Mar 2 2020, 10:39 AM · Wikifeeds, observability, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T245603: Move termbox to the logging pipeline.

Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!

Mar 2 2020, 10:39 AM · Wikidata-Termbox, observability, Wikimedia-Logstash
fgiunchedi added a comment to T222377: Move kartotherian/tilerator logging to new logging pipeline.

Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!

Mar 2 2020, 10:38 AM · Product-Infrastructure-Team-Backlog-Deprecated (Kanban), observability, Platform Team Legacy (Watching / External), Services (watching), Maps, service-runner, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T219925: Move proton logging to new logging pipeline.

Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!

Mar 2 2020, 10:38 AM · Product-Infrastructure-Team-Backlog-Deprecated, observability, Proton, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T219919: Move citoid logging to new logging pipeline.

Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!

Mar 2 2020, 10:36 AM · SRE Observability, observability, Citoid, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi closed T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning as Resolved.

Utilization growth has stabilized around Feb 20th and is now back to organic growth, resolving

Mar 2 2020, 8:32 AM · observability, SRE
fgiunchedi updated the task description for T226986: Client side error logging production launch.
Mar 2 2020, 8:15 AM · Analytics-Radar, MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Performance-Team (Radar), Desktop Improvements (Vector 2022), observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog-Deprecated, Epic
fgiunchedi placed T106346: setup an alertable threshold for Cassandra heap dumps up for grabs.

I never got around to deploying it and afaik it hasn't been a recurring problem, de-assigning

Mar 2 2020, 8:01 AM · Cassandra, Platform Team Legacy (Watching / External), Services (watching), SRE, RESTBase-Cassandra
fgiunchedi added a comment to T86552: Monitor and alarm on SMART attributes [tracking].

@fgiunchedi: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task assignee. Thanks in advance! (You can change the task status via Add Action...Change Status in the dropdown menu.)

Mar 2 2020, 8:00 AM · Observability-Alerting, Epic, SRE
fgiunchedi closed T86316: graphite clustering plan, a subtask of T85451: scale graphite deployment (tracking), as Declined.
Mar 2 2020, 7:59 AM · Platform Team Legacy (Watching / External), Services (watching), Tracking-Neverending, Patch-For-Review, WMDE-Analytics-Engineering, SRE, Grafana
fgiunchedi closed T86316: graphite clustering plan as Declined.

Graphite is on its way out, declining

Mar 2 2020, 7:58 AM · SRE, Grafana

Feb 28 2020

fgiunchedi added a comment to T246097: Have monitoring of updatequerypages cronjobs.

If we're moving those to systemd timers then abstractions in puppet will take care of setting up monitoring too

Feb 28 2020, 8:36 AM · observability, SRE

Feb 27 2020

fgiunchedi updated the task description for T123918: 'swift' user/group IDs should be consistent across the fleet.
Feb 27 2020, 1:52 PM · SRE-swift-storage, SRE
fgiunchedi added a comment to T246110: PyBal BGP group prefix-limit 50 teardown.

+1 to bumping the limit, although the snipped above has 20 not 200 as the limit for pybal if I'm reading correctly

Feb 27 2020, 10:48 AM · SRE, netops

Feb 25 2020

fgiunchedi created P10521 Logging fields transformation to ecs.
Feb 25 2020, 6:07 PM
fgiunchedi closed T245512: Move service::uwsgi logs to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Feb 25 2020, 4:50 PM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi closed T245512: Move service::uwsgi logs to logging pipeline as Resolved.

All service::uwsgi roles now log to the logging pipeline!

Feb 25 2020, 4:50 PM · SRE-tools, observability, Wikimedia-Logstash
fgiunchedi closed T245511: Move netbox uwsgi logs to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Feb 25 2020, 2:50 PM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi closed T245511: Move netbox uwsgi logs to logging pipeline as Resolved.

Logs are coming in through the pipeline now and available in Kibana as type:netbox

Feb 25 2020, 2:50 PM · netbox, Wikimedia-Logstash
fgiunchedi closed T239321: Deprecate msdos partition scheme in favor of GPT, a subtask of T156955: Standardizing our partman recipes, as Declined.
Feb 25 2020, 8:02 AM · Patch-For-Review, User-fgiunchedi, SRE
fgiunchedi closed T239321: Deprecate msdos partition scheme in favor of GPT as Declined.

Parent task will be taking care of moving to GPT everywhere

Feb 25 2020, 8:02 AM · SRE

Feb 24 2020

fgiunchedi closed T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Invalid.
Feb 24 2020, 9:39 AM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi closed T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline as Invalid.

Resolving in favor of service-specific tasks (subtasks of T225122)

Feb 24 2020, 9:39 AM · observability, SRE, Wikimedia-Logstash
fgiunchedi added a subtask for T156955: Standardizing our partman recipes: T245810: Standard partman recipe for druid hosts.
Feb 24 2020, 9:08 AM · Patch-For-Review, User-fgiunchedi, SRE
fgiunchedi added a parent task for T245810: Standard partman recipe for druid hosts: T156955: Standardizing our partman recipes.
Feb 24 2020, 9:08 AM · Analytics, User-Elukey

Feb 21 2020

fgiunchedi added a comment to T245778: Spike in "Use of ResourceLoaderSkinModule::getAvailableLogos with $wgLogoHD set instead of $wgLogos was deprecated in MediaWiki 1.35.".

Thanks @Reedy and @jcrespo ! Not that the issue is less urgent but for the record: this issue is/was affecting only the elk7 consumers (not yet in production, but in shadow mode) cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info

Feb 21 2020, 2:10 PM · Wikimedia-production-error, Wikimedia-Logstash, MediaWiki-General
fgiunchedi created T245810: Standard partman recipe for druid hosts.
Feb 21 2020, 10:09 AM · Analytics, User-Elukey
fgiunchedi renamed T245512: Move service::uwsgi logs to logging pipeline from Move debmonitor uwsgi logs to logging pipeline to Move service::uwsgi logs to logging pipeline.
Feb 21 2020, 9:26 AM · SRE-tools, observability, Wikimedia-Logstash
fgiunchedi closed T245725: ArticleEditUpdates deprecation log from FlaggedRevsHooks::onArticleEditUpdates spamming logs as Resolved.

Can confirm that the message is indeed gone now as of latest deploy, thanks @Jdforrester-WMF ! Resolving in favor of T245778 which is still happening

Feb 21 2020, 8:37 AM · Wikimedia-Logstash, MediaWiki-General

Feb 20 2020

fgiunchedi created T245725: ArticleEditUpdates deprecation log from FlaggedRevsHooks::onArticleEditUpdates spamming logs.
Feb 20 2020, 10:06 AM · Wikimedia-Logstash, MediaWiki-General

Feb 19 2020

fgiunchedi added a comment to T245512: Move service::uwsgi logs to logging pipeline.

Agreed, I'll send out patches to add switches to service::uwsgi for logging pipeline. I agree the optimal would be journald although I suspect the full Buster migration isn't imminent at this point, will try with rsyslog first.

Feb 19 2020, 4:28 PM · SRE-tools, observability, Wikimedia-Logstash
fgiunchedi closed T135385: investigate carbon-c-relay stalls/drops towards graphite2002, a subtask of T134889: put additional graphite machines in service, as Declined.
Feb 19 2020, 4:21 PM · Patch-For-Review, SRE, Grafana
fgiunchedi closed T135385: investigate carbon-c-relay stalls/drops towards graphite2002, a subtask of T134016: RESTBase Cassandra cluster: Increase instance count to 3, as Declined.
Feb 19 2020, 4:21 PM · Patch-For-Review, Services, RESTBase, Cassandra
fgiunchedi closed T135385: investigate carbon-c-relay stalls/drops towards graphite2002 as Declined.

Yes resolvable, graphite is on its way out eventually

Feb 19 2020, 4:21 PM · SRE, Grafana
fgiunchedi changed the status of T123918: 'swift' user/group IDs should be consistent across the fleet from Open to Stalled.

@fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

Feb 19 2020, 4:20 PM · SRE-swift-storage, SRE
fgiunchedi created T245604: Move wikifeeds to the logging pipeline.
Feb 19 2020, 11:38 AM · Wikifeeds, observability, Wikimedia-Logstash, SRE
fgiunchedi created T245603: Move termbox to the logging pipeline.
Feb 19 2020, 11:37 AM · Wikidata-Termbox, observability, Wikimedia-Logstash
fgiunchedi awarded T245516: Move mathoid to the logging pipeline a Mountain of Wealth token.
Feb 19 2020, 10:08 AM · Math, Mathoid, observability, Wikimedia-Logstash, SRE
fgiunchedi added a comment to T245512: Move service::uwsgi logs to logging pipeline.

@fgiunchedi what's the current best practice here? Debmonitor is just using service::uwsgi that automatically logs to /srv/log/debmonitor/main.log and AFAIK doesn't log to the local syslog normal connections but just restarts of the daemon.

Feb 19 2020, 8:36 AM · SRE-tools, observability, Wikimedia-Logstash
fgiunchedi removed a member for Scap: fgiunchedi.
Feb 19 2020, 8:27 AM

Feb 18 2020

fgiunchedi created T245516: Move mathoid to the logging pipeline.
Feb 18 2020, 2:12 PM · Math, Mathoid, observability, Wikimedia-Logstash, SRE
fgiunchedi created T245515: Move restrouter to the logging pipeline.
Feb 18 2020, 2:10 PM · RESTBase, observability, Wikimedia-Logstash, SRE
fgiunchedi closed T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs as Resolved.

I'm resolving this task since all the followup is already tracked in parent T227080: Deprecate all non-Kafka logstash inputs

Feb 18 2020, 2:05 PM · SRE Observability, observability, Patch-For-Review, User-herron, SRE, Wikimedia-Logstash
fgiunchedi closed T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Feb 18 2020, 2:05 PM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Feb 18 2020, 2:03 PM · SRE Observability, observability, Patch-For-Review, User-herron, SRE, Wikimedia-Logstash
fgiunchedi created T245512: Move service::uwsgi logs to logging pipeline.
Feb 18 2020, 1:49 PM · SRE-tools, observability, Wikimedia-Logstash
fgiunchedi added a subtask for T227080: Deprecate all non-Kafka logstash inputs: T245511: Move netbox uwsgi logs to logging pipeline.
Feb 18 2020, 1:47 PM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi added a parent task for T245511: Move netbox uwsgi logs to logging pipeline: T227080: Deprecate all non-Kafka logstash inputs.
Feb 18 2020, 1:47 PM · netbox, Wikimedia-Logstash
fgiunchedi created T245511: Move netbox uwsgi logs to logging pipeline.
Feb 18 2020, 1:46 PM · netbox, Wikimedia-Logstash
fgiunchedi closed T227108: Port varnishlog consumers to log to syslog / logging infra, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Feb 18 2020, 1:40 PM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi closed T227108: Port varnishlog consumers to log to syslog / logging infra as Resolved.

This is now complete! All varnish logging goes through the logging pipeline

Feb 18 2020, 1:40 PM · Traffic, observability, Wikimedia-Logstash, User-fgiunchedi, SRE
fgiunchedi added a comment to T230847: Logstash missing most messages from mediawiki (Aug 2019).

Thanks @Krinkle, to answer your questions:

Feb 18 2020, 9:57 AM · Wikimedia-Incident, SRE, Wikimedia-Logstash

Feb 17 2020

fgiunchedi lowered the priority of T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning from High to Medium.
Feb 17 2020, 9:21 AM · observability, SRE
fgiunchedi claimed T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning.

I'll take this and resolve once space has stabilized again

Feb 17 2020, 9:21 AM · observability, SRE
fgiunchedi added a comment to T245280: logstash_formatter_key_conflict in mediawiki logs.

This should be in a much better place now (the mediawiki-config patch fixed most).. I've fixed (or created a subtask) most of the ones showing in the logs, and a few others pre-emptively too

I don't see any point backporting MW core patches, so .20 should clear up most of the other ones

Also filed T245289 to pre-emptively prevent some of these in the future

Feb 17 2020, 9:19 AM · MW-1.35-notes (1.35.0-wmf.23; 2020-03-10), PageViewInfo, Wikimedia-production-error, MediaWiki-extensions-LoginNotify, Community-Tech, MediaWiki-Core-AuthManager, MediaWiki-Debug-Logger
fgiunchedi added a comment to T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning.

Thanks @Marostegui @Volans ! Indeed the space used grew because of longer retention, I added 150G to the LVs (last log is wrong, it is 100G) which should be enough to stabilize in ~15d and leave headroom too

Feb 17 2020, 9:11 AM · observability, SRE

Feb 14 2020

fgiunchedi created T245280: logstash_formatter_key_conflict in mediawiki logs.
Feb 14 2020, 4:25 PM · MW-1.35-notes (1.35.0-wmf.23; 2020-03-10), PageViewInfo, Wikimedia-production-error, MediaWiki-extensions-LoginNotify, Community-Tech, MediaWiki-Core-AuthManager, MediaWiki-Debug-Logger
fgiunchedi awarded T245242: Allow !log in #wikimedia-sre a Like token.
Feb 14 2020, 3:48 PM · Stashbot
fgiunchedi created T245242: Allow !log in #wikimedia-sre.
Feb 14 2020, 10:11 AM · Stashbot
fgiunchedi committed rLPRIc33a588c2712: hieradata: add dummy performance_arclamp key.
hieradata: add dummy performance_arclamp key
Feb 14 2020, 9:52 AM

Feb 13 2020

fgiunchedi added a comment to T244776: Swift container for performance flame graphs (ArcLamp).

Looking at yesterday's (2020-02-11) output, it was about 8 GB of (uncompressed) logs and 14 MB of SVGs, and about 800 files total. We can control the sampling interval to regulate how big these get, so let's assume it's relatively constant. I'll have to check if there's a reason we don't compress the logs; I feel like we should, which would dramatically reduce this. (I just now tried gzip -1 on one set of logs, and they went from 4 GB to 479 MB.)

Feb 13 2020, 9:48 AM · Patch-For-Review, Performance-Team, Arc-Lamp, SRE-swift-storage

Feb 11 2020

fgiunchedi added a comment to T244776: Swift container for performance flame graphs (ArcLamp).

Great to see this work ! re: authentication and permissions it is indeed like @aaron outlined, we'd be creating a user and that can create containers and upload files at will.

Feb 11 2020, 4:54 PM · Patch-For-Review, Performance-Team, Arc-Lamp, SRE-swift-storage
fgiunchedi added a comment to T242250: rack/setup/install ps[12]-60[34]-eqsin.

@RobH please let me know once the PDUs should be snmp-accessible, they'll need to be added to puppet/monitoring

Feb 11 2020, 9:48 AM · SRE, ops-eqsin
fgiunchedi added a comment to T244761: Script to point SRE local machine traffic to another LB.

+1 to /etc/hosts, I've done similar in the past and has worked as expected. As a side note the script could even take the form of a puppet manifest we can then puppet apply locally.

Feb 11 2020, 9:41 AM · SRE
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Feb 11 2020, 9:22 AM · SRE Observability, observability, Patch-For-Review, User-herron, SRE, Wikimedia-Logstash
fgiunchedi closed T242585: Move cassandra logging to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Feb 11 2020, 9:21 AM · SRE Observability, observability, Patch-For-Review, Wikimedia-Logstash, SRE
fgiunchedi closed T242585: Move cassandra logging to logging pipeline as Resolved.

This is complete! All cassandra production clusters now log through the logging pipeline.

Feb 11 2020, 9:21 AM · Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, SRE

Feb 10 2020

fgiunchedi moved T240667: Ingestion errors for production logs on ELK7 from Backlog to Doing on the User-fgiunchedi board.
Feb 10 2020, 10:25 AM · Observability-Logging, observability, SRE, Wikimedia-Logstash

Feb 7 2020

fgiunchedi added a comment to T244357: Provision grafana VM in codfw.

added vm-requests tag and pasted vm-request form. please add the missing data above.

Feb 7 2020, 9:32 AM · serviceops, vm-requests, observability, SRE
fgiunchedi updated the task description for T244357: Provision grafana VM in codfw.
Feb 7 2020, 9:31 AM · serviceops, vm-requests, observability, SRE

Feb 6 2020

fgiunchedi added a comment to T225125: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline.

Status update: out of the box json logging support has been introduced in elasticsearch 7 (https://github.com/elastic/elasticsearch/issues/8786). Whereas for previous versions we'd need to bring in jackson-databind, which comes with its own set of challenges (e.g. https://github.com/elastic/elasticsearch/issues/22103). Thus I'm of the opinion that waiting for the elasticsearch 7 upgrade on cirrus/relforge/cloudelastic will be easier.

Feb 6 2020, 2:23 PM · SRE Observability, observability, Discovery-Search, Elasticsearch, SRE, Wikimedia-Logstash

Feb 5 2020

fgiunchedi renamed T244208: Upgrade Grafana to 6.7 from Upgrade Grafana to 6.4 to Upgrade Grafana to 6.6.
Feb 5 2020, 5:02 PM · cloud-services-team (Kanban), Patch-For-Review, User-CDanis, SRE, observability
fgiunchedi created T244357: Provision grafana VM in codfw.
Feb 5 2020, 2:02 PM · serviceops, vm-requests, observability, SRE

Feb 4 2020

fgiunchedi created T244208: Upgrade Grafana to 6.7.
Feb 4 2020, 9:23 AM · cloud-services-team (Kanban), Patch-For-Review, User-CDanis, SRE, observability
fgiunchedi moved T242585: Move cassandra logging to logging pipeline from Backlog to Doing on the User-fgiunchedi board.
Feb 4 2020, 9:18 AM · Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, SRE
fgiunchedi added a comment to T227108: Port varnishlog consumers to log to syslog / logging infra.

Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:

  1. journald < buster has a maximum line length of 2k, thus long lines get broken into multiple lines, in turn breaking json parsing.
Feb 4 2020, 9:01 AM · Traffic, observability, Wikimedia-Logstash, User-fgiunchedi, SRE

Feb 3 2020

fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Feb 3 2020, 3:34 PM · SRE Observability, observability, Patch-For-Review, User-herron, SRE, Wikimedia-Logstash
fgiunchedi added a comment to T227108: Port varnishlog consumers to log to syslog / logging infra.

Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:

Feb 3 2020, 10:53 AM · Traffic, observability, Wikimedia-Logstash, User-fgiunchedi, SRE

Jan 27 2020

fgiunchedi closed T242511: Degraded RAID on ms-be1039 as Resolved.

@godog I replaced the disk, please see what you need to do to add it back to the raid. Thanks!

Jan 27 2020, 11:13 PM · SRE-swift-storage, ops-eqiad, SRE