Hi, could you add some detail what this is for? What is SWAP please and what do you need access to stat1002 for? Are you sure it's stat1002 and not stat1003? The requested group "researchers" is described as " Access to stat1003 and the credentials for the MariaDB slaves in/etc/mysql/conf.d/research-client.cnf." on https://wikitech.wikimedia.org/wiki/Analytics/Data_access and the word "SWAP" doesn't seem to appear there at all. Thanks!
It would be nice to keep this open until ocg1001 is actually back in service.
Fri, Apr 28
@RobH please remove switch config (asw-b-codfw:ge-5/0/12)
2007 and 2008 should go into A5 then (still has to happen)
removed from rack
ganeti2006 has been moved to A4 @ 17. switch: asw-a4-codfw: port ge-4/0/17 please configure switch
ganeti2005 has been moved to A4 @ 16. switch: asw-a4-codfw: port ge-4/0/16 please configure switch
We have lots of room in A2 and A4 and we can move into A4, but we can't move into A2 because there is a 10G switch and the server just has 1G nic cards (would need adapter).
mainboard is being replaced right now
Thu, Apr 27
This sounds similar to the other tickets linked to "tracking" task T162850. We have observed downthrottling to 200MHz on other servers before. The interesting part is that they were all R320 while lead is a R420 (or racktables says so, is it really?). That would be unfortunate if it means more models are potentially affected by this bug.
Thanks for deploying that :)
@Cmjohnson it's still ok to take it down, it's not getting traffic. i'll wait for the IPMI issue before putting it back to work.
looks like it's ok.
it has been reinstalled and re-added to puppet and salt but i saw errors about failed deployment of the ocg app, so i didn't repool it yet... then after being afk for a while i came back and look at it.. and now ocg is running and puppet run without errors. who fixed it? :)
added decom checklist template from https://wikitech.wikimedia.org/wiki/Server_Lifecycle/reclaim_checklist
Thanks, i will do a reinstall later today.
cool, thank you. taking it back
You should not need the passphrase since the key is already loaded:
Wed, Apr 26
I think what needs to happen here is adding a new "identity" for ORES deployments.
I'm uploading the change but i'm not going to be able to deploy it and i'm not in the European time zone. I'm hoping that somebody in Europe will take it or it will be deployed in SWAT window.
added "dty" since it exists now (the other 2 are still incubator redirects as of today)
Tue, Apr 25
This was actually invalid and i was simply confused by T163324 and still looked on einsteinium instead of tegmen.
We talked on IRC. It works again.
@Capt_Swing Your key has been replaced on stat1003 and bast1001 (and other bastions puppet will do it soon). It should work now again.
@RobH @Jgreen see DNS change above. in this case it removes both main IP and mgmt at once, realizing that normally we do it seperate in non-fr prod, but it keeps the option to reach it per asset tag as wmf4076.mgmt.eqiad.wmnet.
@Cmjohnson I fixed the typo above in mgmt DNS entry, but i still can't get on mgmt console after that.
Any other disk to try? can we replace sda one more time?
@Cmjohnson Somehow the new /dev/sda also seems to be broken. Maybe it was used in something else before? Or it was this disk that was broken the whole time and we replaced the wrong one? I dunno, but from /var/log/syslog in installer shell this looks pretty much like broken hardware (again?/still?).
I changed the boot order in BIOS (port A was still first, switched to port B), did not change the error. Still "during read on /dev/sda" at partitioning step.
@Cmjohnson I attempted a reinstall but it consistently fails at the partitioning step with:
Mon, Apr 24
Should this exist in both DCs? one in eqiad one in codfw per default nowadays?
I'll take a look at it today. Pretty sure we can just reinstall this.
Fri, Apr 21
@Gilles @aaron @Peter @Krinkle see change above, it should give you the requested permissions. feel free to try it. when logging in at Icinga make sure you use the lower case version please. There is a caveat that the LDAP login accepts both upper case and lower case but for Icinga to get the permission part for you to run commands, it must match the Icinga contact. Go to the service (you are a contact for) and try "send acknowledgement" or "schedule downtime" or other commands from the drop down menu. Let me know if it works. If there are any other services you want to be able to control then we'll just have to add the contact group now.
Thu, Apr 20
@Gilles alright. thanks. so first step is we need to create Icinga users ("contacts") for you. Krinkle has one but not the other 3 of you. And they have to match the LDAP users. The login you use to see Icinga is just LDAP auth in front of it, but from Icinga's point of view you are not users yet because you are not "contacts" for anything. It automatically gives permissions to do these things for hosts and services that a user is a contact for. The file with the Icinga contacts is in a private repo (because some contacts have phone numbers). So i'm doing that there now. i'll use your standard wikimedia.org email addresses but nothing will happen yet. Then i'll create a contactgroup for performance that has your users as members. Finally we'll add that contactgroup to the relevant service(s) via puppet and it will do 2 things at once that are directly connected. You will get email notifications about the service and you will get the right to execute commands for it (ACK it, schedule downtime, enable/disable notifications) which is what you are requesting. (Unless you really don't want email notifications).
12:09 -!- icinga-wm [~email@example.com] has joined #wikimedia-fundraising 12:11 < icinga-wm> test for T163368
@Gilles could you please define who exactly is "we" in this request. Ideally a list of LDAP/wikitech user names. (what you see behind " Logged in as" when on Icinga). Thanks
The way the custom IRC notifications work:
So as of today if "sms" does not show up in contact_groups for a host or service, individual Ops don't get email or sms notification. If that's correct, we're much closer than I initially thought.
my suggestion would be:
relevant puppet code:
Wed, Apr 19
closing this again to handle mw2256 in a subtask (please continue on T163346)
@Papaul just the kernel panic PANIC: double fault, error_code: 0x0 and this during boot:
mw2256 died again
Puppet was not running because of an Icinga configuration error
puppet runs alright now, no errors
Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,
- backups: confirmed with bconsole that naos now exists in Bacula with the same backup sets (/home and /srv/deployment are backed up on deployment servers)
- UID issue: see this https://gerrit.wikimedia.org/r/#/c/348884/
Tue, Apr 18
The Icinga/graphite check "Response time for WDQS" is in status "UNKNOWN" because there are "No valid datapoints found".
16:05 < icinga-wm> PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
16:05 < icinga-wm> PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
16:05 < icinga-wm> PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
16:05 < icinga-wm> PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
On any other day i would probably just do that since they are debug hosts. Though right now might be a bad moment. The "is running on the canaries" should cover mwdebug* though, shouldn't it.