Both machines upgraded and back with 96GB, thanks @Cmjohnson !
Some more frequency distributions of size vs number of requests using bitly's data hacks
And a rough estimation of the long tail, note that ~60% of sizes have been requested less than 1000 times in april. Only 4% of sizes are requested more than once per second (on average in april)
I started doing some analytics with hive on webrequest data for upload, reporting the queries here for reference. Note that running a query over a month of data took ~1h, writing the query into another table allows for faster querying/processing later.
For reference, switching mw to talk to deployment-ms-fe02 the configuration is here: https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-cache-upload for the varnish bits and here https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep for swift
@thcipriani yep, all done!
@Cmjohnson not ATM, initially I thought it was a HW raid config issue but doesn't look like it, thanks!
@Cmjohnson yeah today at 10AM your time works for me, if not monday works too
Thu, Apr 27
Looks like now battery count is reported as zero and Cache Status: Permanently Disabled plus Cache Status Details: Cable Error are still active, though the hp raid check reports OK
@mmodell thanks! Also having systemd service file shipped would be useful I think, dh-systemd makes the task easy and we could basically reduce the puppet module to install the package and a few other things
Wed, Apr 26
ms-be0 and ms-fe02 are up and running with swift 2.10, next steps:
@hashar for ms-be the used ram seems in the order of ~12GB so m1.large would be tight. I'll go with m1.xlarge for now, we can revisit if resources get tighter
Indeed we'll have to do this also because production will no longer have trusty "soon" (cfr T162609). I'll start with provisioning a jessie ms-fe since that's the easiest and will allow us to test swift 2.10 too.
Tue, Apr 25
WRT minimum swift version, we're running 2.2 and 2.10 is on the cards (https://phabricator.wikimedia.org/T162609) here's the relevant changelog entries between 2.2 and 2.10
Resolving as the swift upgrade is complete and varnish bandaids have been reverted.
@thcipriani package built and updated in reprepro
Mon, Apr 24
Both machines up at 96GB, thanks @Papaul !
@Cmjohnson the disk in slot 7 was marked as 'foreign config' and it looks like it contained a previous filesystem, maybe from another swift box? These disks should be wiped when used as spares. I've put the disk back in service and it is rebuilding
Indeed megacli doesn't seem happy
This is one of the new machines in this batch, I tried burning-in the disks before production but clearly it wasn't enough :(
Note that the disk is fine according to hpssacli
I've tried rebooting ms-be1036 though that didn't change anything, I think the issue is a combination of these factors:
@Cmjohnson LMK when you can do this, we can depool one machine at a time for maintenance
@Papaul LMK when you can do this, we can depool one machine at a time for maintenance
@Cmjohnson I'm ok to do this today, LMK when it is a good time for you
naos is online and used, I think we should fix mira's NIC and deprovision / allocate to spare now (or decom altogether)
Known issue with cache_upload, see T145661
Followup for trebuchet/mwdeploy fixed uid/gid: https://phabricator.wikimedia.org/T163667
Wed, Apr 19
A related issue discovered in T163278 is a consideration of APT priority between components (and/or distros, if multiple) so that packages are picked up from the right place in all cases (most commonly a reimage vs adding experimental to a machine with packages already installed)
I've downgraded hhvm-related packages back to their non-experimental version.
Upstream issue: https://github.com/prometheus/procfs/issues/40
Looking at the situation on naos, it looks like an accidental upgrade via hhvm-dbg
Tue, Apr 18
I've merged @RobH patch and ran puppet on naos, issues I've encountered so far:
tin rebooted, I've enabled HT and fixed performance profile to be "performance per watt (OS)", see also the icinga task for alarming on this and parent task
Confirmed sdh isn't well. @Cmjohnson do you have spares onsite?
This is completed, baremetal in service
This is completed, decom for equivalent old hw is T162785: Decomission ms-be2001 - ms-be2012
@Krinkle on graphite2001, I've opened T163194: Backfill restored coal whisper files with current data to followup on the actual backfill. Note I won't have to work on it this week, though if you want to take a stab at it all files should be readable
FWIW the swift 2.2.0 upgrade is complete (from T162609)
I believe at least bast* and prometheus* are due to T150456: puppet compiler fails with modules using puppetdb
@ayounsi sounds good to me! I think for the longer period of time we can start with 3x (or 2x) the current 5min and see if that helps. Usual cases I've seen is analytics hosts, db hosts (during reimage) and swift hosts tripping the alert. The latter hosts sometimes has had real heavy swift usage by external clients that bypass varnish (i.e. with cache-busting query string) but I don't think it'll be a problem in practice.
Hosts are gone now from servermon
Odd, I've ran puppet node clean and puppet node deactivate again just in case
Thu, Apr 13
Completed! emails to performance-team ML should be happening now. Note that for consistency with the rest the actual icinga contact name is team-performance
Wed, Apr 12
For context: fixing this would also alleviate a current problem where long-running backup jobs stall both other backup jobs and restore jobs as well (e.g. dbstore1001 backup job takes several hours to complete ATM)