Finishing the setup in T148408: Put prometheus baremetal servers in service
Wed, Mar 22
eventstreams is at 110G and cleaned up periodically, good enough for now
@Cmjohnson is there an ETA to have the servers with OS installed online? thanks!
Tue, Mar 21
@hashar indeed the workers 100 is static in puppet, I'm not going to have time for it but please feel free to play with the setting and see if that makes a difference!
ok @Gilles ! I'll review/upload likely next week
Mon, Mar 20
Discussion on today's ops meeting result: approved
Thu, Mar 16
reporting from IRC
racking will be 3x systems per row, 10G where possible or 1G where we can't do 10G (see also T148647)
Upstream patch has been merged, though there's another outstanding issue https://github.com/google/mtail/issues/56 which makes mtail crash on parse errors. We've also seen mtail stopping sending metrics after a while
Wed, Mar 15
Indeed the previous version from 2007 doesn't seem to be there, not sure there's much we can do
No issues, resolving
This was done a while ago
Thanks @Gilles for the explanation!
I was thinking about memcache connection pooling, though each thumbor instance will have its own connection(s). If it becomes a problem we can put sth like nutcracker on localhost
@Cmjohnson if we have 2TB spares onsite please replace. We can reclaim the disks once decommission happens in a few weeks
Tue, Mar 14
FWIW this host is slated for decom in some weeks, I wouldn't spend too much time on its idrac especially if there's other hosts not to be decom with broken idrac
Mon, Mar 13
@Papaul ok! please replace even though we're decommissioning the machines in 4-5 weeks, on the basis that disks will be wiped and possibly used as spares?
My mistake, host is provisioning
HTTPS for ms-fe.svc is now active in eqiad and codfw
sde is unhappy
Fri, Mar 10
Thu, Mar 9
I took a quick look at the swift logs on lithium, all DELETEs seem to be successful (i.e. HTTP 200s) with the exception of some for which swift replied 404. All around 02.00 UTC in march. Does that help? Also re: deletion of all files but one, meaning one was left behind for no reason or deletions are happening for no reason?
This is completed, I've left out the part about sending logs to both datacenters as out of scope for this. The real solution as suggested by ottomata and tracked in T126989: MediaWiki logging & encryption is to use kafka for log shipping, that'd also buy us encryption
Row C works (oresrdb2001 is row B)
Wed, Mar 8
Note mwlog001 need to be whitelisted to be reachable from analytics vlan, rsync-http-https term
Tue, Mar 7
We're using codfw to test swift as docker registry backend.
Mon, Mar 6
Update from the monitoring meeting, this can be implemented via puppetdb queries, additionally there's now classes to generate Prometheus config files e.g. with all hosts having a specific class and parameters applied. prometheus::class_config
Sat, Mar 4
Fri, Mar 3
Thu, Mar 2
To get started with oresrdb2001 in codfw please see/review https://gerrit.wikimedia.org/r/#/c/34048
Package is out
Resolving since tools has been upgraded as well.
@scfc that's awesome, thanks a lot for your help!
@hashar maybe the aggregation was different on graphite2001, that'd be my guess. Anyways we're going to reprovision grpahtie2001 this week so I'm tentatively resolving
I'm assuming this happened on ms-fe2001 today? It was during decomission and a legitimate error, I'll tentatively resolve