Fri, Feb 14
Thu, Feb 13
Tue, Feb 11
Great to see this work ! re: authentication and permissions it is indeed like @aaron outlined, we'd be creating a user and that can create containers and upload files at will.
@RobH please let me know once the PDUs should be snmp-accessible, they'll need to be added to puppet/monitoring
+1 to /etc/hosts, I've done similar in the past and has worked as expected. As a side note the script could even take the form of a puppet manifest we can then puppet apply locally.
This is complete! All cassandra production clusters now log through the logging pipeline.
Mon, Feb 10
Fri, Feb 7
Thu, Feb 6
Status update: out of the box json logging support has been introduced in elasticsearch 7 (https://github.com/elastic/elasticsearch/issues/8786). Whereas for previous versions we'd need to bring in jackson-databind, which comes with its own set of challenges (e.g. https://github.com/elastic/elasticsearch/issues/22103). Thus I'm of the opinion that waiting for the elasticsearch 7 upgrade on cirrus/relforge/cloudelastic will be easier.
Wed, Feb 5
Tue, Feb 4
Mon, Feb 3
Had to revert in https://gerrit.wikimedia.org/r/c/operations/puppet/+/569529, at least two issues found:
Mon, Jan 27
Tue, Jan 21
Thumbor now is logging to localhost:11514, from there logs are shipped to the logging pipeline, resolving!
Mon, Jan 20
No idea @greg right off the bat, although I see the panel Gate and Submit "Resident Time" on top left is working as expected so I take it is possible in some fashion
Jan 17 2020
All done, service is being implemented in T243000
Jan 15 2020
In other panels I see data going back to Aug 2017 for core's gate-and-submit, e.g. https://grafana.wikimedia.org/d/000000108/releng-kpis?orgId=1&from=1496046004105&to=1579080951474&fullscreen&edit&panelId=2 maybe it has to do with the panel's query ?
Jan 14 2020
Something else I realized today: with ELK7 we dropped our custom logstash template in favor of logstash's default, although we'll need to bump the default number of fields per index (1000 -> 4096) at least
Jan 13 2020
Reviving this as part of this Q's OKRs to move services off logstash non-kafka inputs, I'll followup with patches to move to the localhost-udp compatibility endpoint! (See T242609)
In case it is helpful: we can reuse the centrallog hosts in codfw/eqiad. For site-local netconsole instead we'll need to setup local syslog collectors anyways (on ganeti VMs) for network devices syslog, so we could piggyback on those.
Looks like the controller freaked out (T141756), firmware upgraded and rebooted.
Jan 9 2020
Thanks @Papaul ! Upon reboot the host booted into pxe, I am assuming because the first disk was present but was unbootable and didn't fallback onto booting from the second disk. Anyways all good after a reimage, resolving.
Jan 7 2020
Jan 4 2020
Jan 2 2020
Host is back in service!
I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was reporting multiple GBs of memory free at the time of the crashes:
@Papaul host is in warranty and looks like an SSD failed, could we get that replaced (led is blinking), thanks!
Configured both ports to use PXE when booting, now the host is running the reimage correctly:
@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:
Analytics now publishes media access stats, might be useful to drive some/all thumbnail cleanup: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests
Dec 20 2019
Dec 19 2019
Dec 18 2019
My two cents re: prometheus_client_php current adapters, apcu and redis, since AIUI none are optimal/desirable. There might be another way inspired by what the ruby client does (see also this PromCon 2019 talk and video). Very broadly, the idea IIRC is for each process to mmap a separate file with its own metrics, then at collection time metrics are read from the files and merged.
Dec 17 2019
FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how far that gets us.
Dec 16 2019
hw raid firmware upgraded, resolving
Dec 13 2019
After getting a little more perspective in T240667 it seems that indeed res and body are sometimes sent as strings and sometimes as nested objects.