Thumbor now is logging to localhost:11514, from there logs are shipped to the logging pipeline, resolving!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 21 2020
Jan 20 2020
No idea @greg right off the bat, although I see the panel Gate and Submit "Resident Time" on top left is working as expected so I take it is possible in some fashion
Jan 17 2020
In T125408#5811525, @Addshore wrote:@fgiunchedi Any idea if there is any sort of regular / scheduled backups of the disks for graphite nodes?
If not I'll try to put something in place for the few metrics we would like to not loose for now :)
All done, service is being implemented in T243000
Jan 15 2020
In other panels I see data going back to Aug 2017 for core's gate-and-submit, e.g. https://grafana.wikimedia.org/d/000000108/releng-kpis?orgId=1&from=1496046004105&to=1579080951474&fullscreen&edit&panelId=2 maybe it has to do with the panel's query ?
Jan 14 2020
Something else I realized today: with ELK7 we dropped our custom logstash template in favor of logstash's default, although we'll need to bump the default number of fields per index (1000 -> 4096) at least
In T232820#5799861, @sbassett wrote:In T232820#5796753, @fgiunchedi wrote:AFAICT the code is basically ready to be merged, when does security review need to happen ?
@fgiunchedi - a few points:
- Given that the patch set is a config file change and some fairy minimal JavaScript for error-tracking, this probably wouldn't warrant a full security readiness review. Those are typically performed for new (or substantial rewrites) of MediaWiki extensions, services, etc. I know our documentation isn't very clear on various review thresholds - hopefully we can improve that at some point.
- I'd imagine the Performance-Team might be interested in having a look at this if they haven't already.
- Prior to any patch review, we'd probably require that all of the relevant CI checks are passing.
- For the patch review itself, @Reedy or I would most likely focus on ensuring that the regexWebkit and regexGecko patterns are sufficiently hardened and if any potential attack vectors against EventGate are being introduced.
Jan 13 2020
Reviving this as part of this Q's OKRs to move services off logstash non-kafka inputs, I'll followup with patches to move to the localhost-udp compatibility endpoint! (See T242609)
In case it is helpful: we can reuse the centrallog hosts in codfw/eqiad. For site-local netconsole instead we'll need to setup local syslog collectors anyways (on ganeti VMs) for network devices syslog, so we could piggyback on those.
In T232820#5783328, @chasemp wrote:In T232820#5744696, @fgiunchedi wrote:In T232820#5732793, @sbassett wrote:@fgiunchedi - any update or progress on this? If not, we might just want to decline this task for now until there's actually some code ready for review. Thanks.
Yes, progress in the form of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/553376 which is the client-side javascript part. The code is being reviewed and worked on although I don't think it is yet at the security-review stage. Having said that, feel free to close this task and we'll reopen/followup when needed.
Looks like the controller freaked out (T141756), firmware upgraded and rebooted.
@Jclark-ctr @Cmjohnson host is under warranty for another month according to netbox, please order a replacement for the failed 4TB disk (led is blinking), thanks!
Jan 9 2020
Thanks @Papaul ! Upon reboot the host booted into pxe, I am assuming because the first disk was present but was unbootable and didn't fallback onto booting from the second disk. Anyways all good after a reimage, resolving.
Jan 7 2020
Thank you @fgiunchedi for sharing this story. Congratulations for working through some difficult bugs to roll out this important upgrade.
In T240520#5781082, @ArielGlenn wrote:In T240520#5778255, @EBernhardson wrote:How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?
This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.
The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?
Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.
How does swift hold up under this? Can we run one process doing commons and another/others doing the rest, or is that going to be a noticeable load on the servers? I'm going to add @fgiunchedi for comments on this.
Jan 4 2020
Jan 2 2020
Host is back in service!
I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was reporting multiple GBs of memory free at the time of the crashes:
In T226373#5762068, @jcrespo wrote:What is the right followup after a month? "I don't know" is an ok answer, I just want to clarify the status of the ticket- e.g. does ATS people need to be involved? Or just "being handled, just needs time"?
@Papaul host is in warranty and looks like an SSD failed, could we get that replaced (led is blinking), thanks!
Configured both ports to use PXE when booting, now the host is running the reimage correctly:
In T239805#5770424, @fgiunchedi wrote:@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:
Link Status <Disconnected> *
@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:
Analytics now publishes media access stats, might be useful to drive some/all thumbnail cleanup: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests
Dec 20 2019
Dec 19 2019
Dec 18 2019
My two cents re: prometheus_client_php current adapters, apcu and redis, since AIUI none are optimal/desirable. There might be another way inspired by what the ruby client does (see also this PromCon 2019 talk and video). Very broadly, the idea IIRC is for each process to mmap a separate file with its own metrics, then at collection time metrics are read from the files and merged.
Dec 17 2019
FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how far that gets us.
Dec 16 2019
In T232820#5732793, @sbassett wrote:@fgiunchedi - any update or progress on this? If not, we might just want to decline this task for now until there's actually some code ready for review. Thanks.
hw raid firmware upgraded, resolving
Dec 13 2019
After getting a little more perspective in T240667 it seems that indeed res and body are sometimes sent as strings and sometimes as nested objects.
Resolving as we're using metrics from swift to plot original uploads: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops&refresh=5m&fullscreen&panelId=26
In T240048#5737123, @jcrespo wrote:CC @fgiunchedi , although maybe it was someone else from Foundations that worked on this?
Some more details, afaict it is errors from mobileapps and graphoid that fail to be emitted and parsed correctly
Dec 12 2019
rsyslog does indeed use librdkafka so it might be that! re: losing logs AFAICT that's not happening at the moment in the sense that there aren't discards reported by rsyslog-exporter (also dashboard at https://grafana.wikimedia.org/d/000000596/rsyslog)
In T236832#5734296, @Krinkle wrote:I thought maybe the puppet change didn't propagate to the wtp servers, but that's not the case I realise now.
@fgiunchedi That's the same class of issue, but other than being a violation of the same constraint in PHP, I see no indication that it is caused by php7-fatal-error.php.
It has a different stack trace than the one reported in this task. Or rather, these have no stack trace.
Task descriptionPHP Fatal Error: ob_start(): Cannot use output buffering in output buffering display handlers from line 57 of /etc/php/php7-fatal-error.php:Currently Logstash entries forCannot use output bufferingfrom wpt* serversPHP Fatal error: Unknown: Cannot use output buffering in output buffering display handlers in Unknown on line 0 exception.file "Unknown:0" exception.message "Unknown: Cannot use output buffering in output buffering display handlers" exception.trace ""Note that all fatal errors from MediaWiki and Parsoid-php in production are caught by and reported through "php7-fatal-error.php". The issue in this task however was an output buffer bug existing in the php7-fatal-error.php script itself. Very confusing, I know :)
Dec 11 2019
Complete! check_prometheus will fail on queries with single quotes
grafana.wikimedia.org defaults to UTC, logged-in users can change their settings of course!
Tentatively resolving, we'll be paging if >= 1% of global traffic is 5xx
Dec 10 2019
Given we're moving off ipsec for most/all use cases I'm boldly declining
We have dashboard_links and notes_link now for check_prometheus and monitoring::service enforces the presence of notes_url. @Volans is there anything else to do here you think ?
We're getting rid of graphite and graphite-based alerts over time, declining but please reopen if needed!
I believe nowadays the alert is based on metrics from logstash and appears on IRC in a timely fashion, resolving but please do reopen if it occurs again.