In T219924#5949120, @Mholloway wrote:@fgiunchedi This should be finished by April at the latest for both services. AIUI that's when SCB is planned to be decommissioned.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Mar 9 2020
Mar 9 2020
fgiunchedi changed the status of T219924: Move mobileapps logging to new logging pipeline from Open to Stalled.
In T226986#5949928, @Ottomata wrote:@fgiunchedi very few error events are flowing in now! This is live on group0 wikis. Can we hook the topics up to logstash? In kafka logging-eqiad it is eqiad.mediawiki.client.error and in kafka logging-codfw it is codfw.mediawiki.client.error.
Exporter is up, however on install1003 it doesn't seem to be able to query squid yet:
Mar 6 2020
Mar 6 2020
In T238658#5947574, @akosiaris wrote:In T238658#5946564, @Ottomata wrote:FYI grafana dash here:
Nice!
Not totally sure what is up with the histogram based panels, must be something with the migration to service-runner prometheus instead of statsd-exporter.
express_router_request_duration_seconds is not a histogram[1] for some reason, that's why. Also those panels still reference service_runner_request_duration_seconds_bucket I guess a sed 's/service_runner_request_duration_seconds_bucket/express_router_request_duration_seconds/' is in order there
In fact, having looked a bit at the metrics using curl 10.64.75.104:9102/metrics, I have a couple of comments
# HELP nodejs_process_heap_bytes process heap usage # TYPE nodejs_process_heap_bytes gauge nodejs_process_heap_bytes{service="eventstreams",type="rss"} 72445952 nodejs_process_heap_bytes{service="eventstreams",type="total"} 34713600 nodejs_process_heap_bytes{service="eventstreams",type="used"} 28104688This is fine as a gauge, but I am not sure if type is warrated to be a label. I have a minor fear it will increase the cardinality of the metric without much gain. The statsd-exporter supported services emit it as 3 different metrics. Whatever we do we should keep it consistent of course. @fgiunchedi thoughts?
In T246997#5947487, @Marostegui wrote:Thanks for taking a look.
From both options you suggest, I am more inclined on the first one so we can get rid of a component which is overruled by Prometheus anyways, no?
The idea of having to list devices is a bit scary to me, specially considering how those can change in the future with newer OS or kernel versions, and how we'd need to maintain or adapt that.
In T189333#5945454, @Krinkle wrote:This is still an issue.
Editing Kibana dashboards:
- In Safari, crashes the tab.
- In Firefox, times out after 30 seconds, the user has to tell Firefox to "Wait for the slow script" and then wait another 20 seconds, in order to open the dropdown menu once. This cycle is repeated for every action with a dashboard filter.
- in Chrome, takes about 10-20 seconds, works most times.
Interesting find! Looks like db1078 is the first system that we run Buster on and has HP raid controller (so the disks are "masked" behind a single device). This looks like a "regression" in smartd (6.6-1 from buster, I tried 7.1-1~bpo10+1 from buster-backports and no joy either). Note that in this case smartd is running mostly for logging purposes to track attribute changes, however the smart-data-dump script we're using to export Prometheus metrics does support autodiscovery of hardware raid controllers and we alert on those SMART metrics via Prometheus.
In T219924#5946258, @Mholloway wrote:
Nothing left to do here, resolving
Mar 5 2020
Mar 5 2020
Mar 4 2020
Mar 4 2020
Curious indeed, thanks for investigating! I took a quick look and it looks like prometheus2004 didn't even know about lvs2007 before today at ~6.23, so I'm suspecting watching targets files failed in some way / for some reason. I can't find any smoking gun in prometheus metrics ATM we could check for alerting, although we should be updating to newer prometheus releases (e.g. 2.15)
Mar 3 2020
Mar 3 2020
fgiunchedi updated the task description for T241719: Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3.
fgiunchedi updated the task description for T241719: Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3.
Can confirm that this is still happening, I'm pretty sure I checked "remember me" yesterday and today I'm asked to login again in horizon. What's the supposed session duration when ticking "remember me" ?
Sure, we can resolve this
fgiunchedi closed T140075: investigate swift used space spikes since June 2016, a subtask of T1268: swift capacity planning, as Declined.
Resolving since the root cause has been found
Mar 2 2020
Mar 2 2020
fgiunchedi updated the title for P10580 Audit of elasticsearch fields, 2020.02.14 to 2020.02.29 from Masterwork From Distant Lands to Audit of elasticsearch fields, 2020.02.14 to 2020.02.29.
Patch has been deployed, however I overlooked adding @cee and will followup with a fix
In terms of implementation, wikimania-scholarships uses monolog, so an approach similar/equal to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/477791 would work
In terms of implementation, iegreview uses monolog, so an approach similar/equal to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/477791 would work
Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!
Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!
Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!
Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!
Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has moved) to k8s then what's left to do is disable gelf log output and keep on stdout/stderr. If the service isn't moving to k8s then we'll also need to perform puppet-level changes. Thanks!
fgiunchedi closed T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning as Resolved.
Utilization growth has stabilized around Feb 20th and is now back to organic growth, resolving
I never got around to deploying it and afaik it hasn't been a recurring problem, de-assigning
In T86552#5930238, @Aklapper wrote:@fgiunchedi: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task assignee. Thanks in advance! (You can change the task status via Add Action... → Change Status in the dropdown menu.)
fgiunchedi closed T86316: graphite clustering plan, a subtask of T85451: scale graphite deployment (tracking), as Declined.
Graphite is on its way out, declining
Feb 28 2020
Feb 28 2020
If we're moving those to systemd timers then abstractions in puppet will take care of setting up monitoring too
Feb 27 2020
Feb 27 2020
fgiunchedi updated the task description for T123918: 'swift' user/group IDs should be consistent across the fleet.
+1 to bumping the limit, although the snipped above has 20 not 200 as the limit for pybal if I'm reading correctly
Feb 25 2020
Feb 25 2020
fgiunchedi closed T245512: Move service::uwsgi logs to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
All service::uwsgi roles now log to the logging pipeline!
fgiunchedi closed T245511: Move netbox uwsgi logs to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Logs are coming in through the pipeline now and available in Kibana as type:netbox
fgiunchedi closed T239321: Deprecate msdos partition scheme in favor of GPT, a subtask of T156955: Standardizing our partman recipes, as Declined.
Parent task will be taking care of moving to GPT everywhere
Feb 24 2020
Feb 24 2020
fgiunchedi closed T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline as Invalid.
Resolving in favor of service-specific tasks (subtasks of T225122)
fgiunchedi added a parent task for T245810: Standard partman recipe for druid hosts: T156955: Standardizing our partman recipes.
Feb 21 2020
Feb 21 2020
fgiunchedi added a comment to T245778: Spike in "Use of ResourceLoaderSkinModule::getAvailableLogos with $wgLogoHD set instead of $wgLogos was deprecated in MediaWiki 1.35.".
Thanks @Reedy and @jcrespo ! Not that the issue is less urgent but for the record: this issue is/was affecting only the elk7 consumers (not yet in production, but in shadow mode) cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info
fgiunchedi renamed T245512: Move service::uwsgi logs to logging pipeline from Move debmonitor uwsgi logs to logging pipeline to Move service::uwsgi logs to logging pipeline.
fgiunchedi closed T245725: ArticleEditUpdates deprecation log from FlaggedRevsHooks::onArticleEditUpdates spamming logs as Resolved.
Can confirm that the message is indeed gone now as of latest deploy, thanks @Jdforrester-WMF ! Resolving in favor of T245778 which is still happening
Feb 20 2020
Feb 20 2020
Feb 19 2020
Feb 19 2020
Agreed, I'll send out patches to add switches to service::uwsgi for logging pipeline. I agree the optimal would be journald although I suspect the full Buster migration isn't imminent at this point, will try with rsyslog first.
fgiunchedi closed T135385: investigate carbon-c-relay stalls/drops towards graphite2002, a subtask of T134889: put additional graphite machines in service, as Declined.
fgiunchedi closed T135385: investigate carbon-c-relay stalls/drops towards graphite2002, a subtask of T134016: RESTBase Cassandra cluster: Increase instance count to 3, as Declined.
fgiunchedi closed T135385: investigate carbon-c-relay stalls/drops towards graphite2002 as Declined.
Yes resolvable, graphite is on its way out eventually
fgiunchedi changed the status of T123918: 'swift' user/group IDs should be consistent across the fleet from Open to Stalled.
In T123918#5897578, @Aklapper wrote:@fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via Add Action... → Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!
fgiunchedi awarded T245516: Move mathoid to the logging pipeline a Mountain of Wealth token.
In T245512#5892948, @Volans wrote:@fgiunchedi what's the current best practice here? Debmonitor is just using service::uwsgi that automatically logs to /srv/log/debmonitor/main.log and AFAIK doesn't log to the local syslog normal connections but just restarts of the daemon.
Feb 18 2020
Feb 18 2020
fgiunchedi closed T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs as Resolved.
I'm resolving this task since all the followup is already tracked in parent T227080: Deprecate all non-Kafka logstash inputs
fgiunchedi added a parent task for T245511: Move netbox uwsgi logs to logging pipeline: T227080: Deprecate all non-Kafka logstash inputs.
fgiunchedi closed T227108: Port varnishlog consumers to log to syslog / logging infra, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
This is now complete! All varnish logging goes through the logging pipeline
Thanks @Krinkle, to answer your questions:
Feb 17 2020
Feb 17 2020
fgiunchedi lowered the priority of T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning from High to Medium.
I'll take this and resolve once space has stabilized again
In T245280#5886024, @Reedy wrote:This should be in a much better place now (the mediawiki-config patch fixed most).. I've fixed (or created a subtask) most of the ones showing in the logs, and a few others pre-emptively too
I don't see any point backporting MW core patches, so .20 should clear up most of the other ones
Also filed T245289 to pre-emptively prevent some of these in the future
fgiunchedi added a comment to T245361: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning.
Thanks @Marostegui @Volans ! Indeed the space used grew because of longer retention, I added 150G to the LVs (last log is wrong, it is 100G) which should be enough to stabilize in ~15d and leave headroom too
Feb 14 2020
Feb 14 2020
fgiunchedi awarded T245242: Allow !log in #wikimedia-sre a Like token.
hieradata: add dummy performance_arclamp key
Feb 13 2020
Feb 13 2020
In T244776#5875581, @dpifke wrote:Looking at yesterday's (2020-02-11) output, it was about 8 GB of (uncompressed) logs and 14 MB of SVGs, and about 800 files total. We can control the sampling interval to regulate how big these get, so let's assume it's relatively constant. I'll have to check if there's a reason we don't compress the logs; I feel like we should, which would dramatically reduce this. (I just now tried gzip -1 on one set of logs, and they went from 4 GB to 479 MB.)
Feb 11 2020
Feb 11 2020
Great to see this work ! re: authentication and permissions it is indeed like @aaron outlined, we'd be creating a user and that can create containers and upload files at will.
@RobH please let me know once the PDUs should be snmp-accessible, they'll need to be added to puppet/monitoring
+1 to /etc/hosts, I've done similar in the past and has worked as expected. As a side note the script could even take the form of a puppet manifest we can then puppet apply locally.
fgiunchedi closed T242585: Move cassandra logging to logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
This is complete! All cassandra production clusters now log through the logging pipeline.
Feb 10 2020
Feb 10 2020
fgiunchedi moved T240667: Ingestion errors for production logs on ELK7 from Backlog to Doing on the User-fgiunchedi board.
Feb 7 2020
Feb 7 2020
In T244357#5853220, @Dzahn wrote:added vm-requests tag and pasted vm-request form. please add the missing data above.
Feb 6 2020
Feb 6 2020
fgiunchedi added a comment to T225125: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline.
Status update: out of the box json logging support has been introduced in elasticsearch 7 (https://github.com/elastic/elasticsearch/issues/8786). Whereas for previous versions we'd need to bring in jackson-databind, which comes with its own set of challenges (e.g. https://github.com/elastic/elasticsearch/issues/22103). Thus I'm of the opinion that waiting for the elasticsearch 7 upgrade on cirrus/relforge/cloudelastic will be easier.
Feb 5 2020
Feb 5 2020
fgiunchedi renamed T244208: Upgrade Grafana to 6.7 from Upgrade Grafana to 6.4 to Upgrade Grafana to 6.6.
Feb 4 2020
Feb 4 2020
fgiunchedi moved T242585: Move cassandra logging to logging pipeline from Backlog to Doing on the User-fgiunchedi board.
In T227108#5844341, @fgiunchedi wrote:Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:
- journald < buster has a maximum line length of 2k, thus long lines get broken into multiple lines, in turn breaking json parsing.
Feb 3 2020
Feb 3 2020
Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:
Jan 27 2020
Jan 27 2020
In T242511#5834630, @Cmjohnson wrote:@godog I replaced the disk, please see what you need to do to add it back to the raid. Thanks!
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL