thanks @JAllemandou ! I've converted the tables to use parquet and dropped the old plaintext tables
I'm assuming you'll be storing the result in swift beta, should be ok in terms of space (i.e. around 400MB + thumbs)
This is resolved, we're running swift 2.10 and some machines in codfw/eqiad running stretch too.
prometheus-node-exporter does create the prometheus group, IIRC it fails because of what you mentioned, i.e. prometheus user exists in labs/cloud
Wed, Jun 21
Current status, swift in esams hasn't been touched since it is slated for decom anyway. ms-be from 01 to 12 are decom'd from swift. Remaining machines are either running stretch or jessie, with swift 2.10
Reopening, beginning at around 6/6 eventstreams has been creating a lot of metrics consuming ~20% of graphite disk space in 8 days and it is now at around 400G
All done, 1019 BBU was swapped yesterday by @Cmjohnson
Tue, Jun 20
With help from @Cmjohnson we've restored the console on these two boxes by draining flea power
Mon, Jun 19
I catched a 500 on that file with thumbnails, looks like the imagescalers are having an hard time with some of those thumbnails:
@Dzahn almost, I'm running the last swift ring rebalance today. ETA is two/three days, I'll update/reassign this task once the machines are good to decom!
@Cmjohnson today sounds good, ping me here or on IRC
Wed, Jun 14
There's potentially other VMs across labs/cloud doing the same expensive resolving localhost, it seems to me it'd be more useful to understand the root cause instead
Tue, Jun 13
Mon, Jun 12
thanks @JAllemandou ! I give it a quick try and it looks very interesting, how often is the data loaded from webrequest? IOW how much lag we should be expecting?
Fri, Jun 9
I've finished converting ms-be stretch systems to predictable network interfaces, no problems observed so far. For reference the commands:
We were getting duplicate alerts from ms-be1019 due to its hp raid check going unknown (I think). I've disabled the handler for hp raid on ms-be1019 though it'll need to be reenabled once this is fixed.
Thu, Jun 8
Just nits really, LGTM
FWIW if we also want to store mail logs off-host a simple solution would be to syslog exim logs too, syslog hosts already have 90d retention in place.
I've checked the diff and uploaded thumbor 6.3.2+git20170607-1 internally to jessie-wikimedia
I've uploaded schedule 0.3.2-1~bpo8+1 to Debian jessie-backports with its maintainer approval.
Wed, Jun 7
I've upgraded all ms-fe2* to swift 2.10, the trusty -> stretch conversion of ms-be2* is ongoing. Regardless of the latter I think we could test some user traffic in swift codfw next week and see how that goes
09:43 <volans> I need to check later why we got 2 tasks though 09:44 <godog> my fault, the first is manual because I thought the disk was already failed on the controller but it wasn't 09:44 <godog> then I marked the disk failed manually on the controller too
Indeed, the problem there I think is that prometheus user exists in labs but not the group, was node-exporter working otherwise?
Tue, Jun 6
ms-be1020 had its bbu swapped, error cleared:
@Papaul this host is scheduled for decom and has otherwise no production data, don't bother replacing the disk
Mon, Jun 5
@Dereckson could you try the uploads one more time? I've disabled spooling of files to disk in nginx
So the problem is the tmpfs on /var/lib/nginx being 1G, IOW the maximum client body that nginx will spool there.
@Dereckson I've enabled debug on nginx for connections coming from terbium, can you try the uploads again? thanks!
Focusing only on one file for now, found a 500 from swift in FileOperation, now looking on the swift side
@Dereckson I saw your importImages run has finished on terbium (?) how'd it go this time?
Fri, Jun 2
proxy-server 2.10 seems to be basically working on ms-fe2005. I've asked upstream about an increase in proxy-server.errors metrics that seem related to ratelimit here: https://bugs.launchpad.net/swift/+bug/1695273
thanks! I think the version in aptly has been removed so we should be set for tools too, what's the best way I can run a command on all labs + tools ?
@Andrew indeed the uploaded version was lacking the upstart script, and falling back to jessie's init.d script won't work as you discovered, I've fixed it now in a new internal version by shipping the upstart script. Upgrading prometheus-node-exporter should fix it!