Sun, Dec 9
Sat, Dec 8
Fri, Dec 7
FYI I've added this small section to the docs for running the script:
Thu, Dec 6
@Papaul @fgiunchedi Today the RAID alarm was continuously flapping and created a ton of tasks (see above) that I asked mo.brovac to close as he had access to the batch edit interface in Phabricator.
I've disabled the event handler for the 2 RAID checks in Icinga for this host. Please remember to re-enable them once fixed.
Wed, Dec 5
This is totally expected, that alias query is using the global grammar mixing the results of three different queries to OpenStack according to the provided boolean operators.
The current openstack grammar allow to query with the given parameters either on all projects or a specific one. All projects but some is not a feature of the current OpenStack grammar in Cumin.
So not sure what is the request here,
As to avoid the error, a quick git grep on the Puppet repo when deleting a project should be enough. In fact contintcloud is still mentioned in another two places in the Puppet repo beside this one.
Mon, Dec 3
Note to self: do not reply to a complex topic on a Friday night
True, if the usernames and user activities are public there are lots of information available for a malicious HIBP site.
Fri, Nov 30
Thu, Nov 29
I wasn't aware of this task, but I've contacted the Security team few months ago with more or less the same idea. Hence here my a-bit-more-than-2 cents:
Wed, Nov 28
@Dzahn thanks for all the fixes!
IIRC it was decided to use the UID, cc @faidon
Tue, Nov 27
Mon, Nov 26
Fri, Nov 23
I brought this up few weeks ago in the WMCS-admin IRC channel, explaining also that cumins puppetization uses the hiera variable profile::openstack::main::region in modules/profile/manifests/openstack/main/cumin/master.pp and to feel free to change/override it at will based on the migration.
The other short term option is to generate two different config files like config-eqiad.yaml and config-eqiad1-r.yaml and maybe adding a bash alias ease of use.
Added read-only access to cn=wmf and confirmed it works as expected allowing people to login but in read-only mode. Edit/delete/add buttons are not shown and accessing edit pages redirect to the login page. Same for the django admin panel.
Wed, Nov 21
@crusnov for the puppettization I think we could go with a simple git clone and setting netbox config accordingly. You can see as an example how the cookbooks in profile::spicerack are deployed.
ms-be2047 reported down by Icinga since few minutes, unable to ssh, black screen at the console so far.
I went ahead and created the repo for the reports at:
Tue, Nov 20
Mon, Nov 19
My proposal is to start with 1+2, 6 and 8.
Sat, Nov 17
Things that I've found so far, some may be unrelated but still need a fix anyway.
Thu, Nov 15
As we'll be tackling this shortly, we should start deciding which report we want to write and what kind of puppetization and deployment method we want to choose.
The last bit might vary a bit also based on how we want to run those reports (manually via UI on demand, manually or automatically via HTTP API and/or CLI). See https://netbox.readthedocs.io/en/stable/additional-features/reports/#running-reports for more details.
I'll try to summarize here a few options.
Wed, Nov 14
Adding analytics, Luca and Otto in case it was missed. Also puppet has issues because of RO filesystem.
Mon, Nov 12
Regarding the few that I know:
- profile::openstack::main::cumin::auth_group: cumin_masters doesn't actually seems to be defined elsewhere, it should probably be moved like the ones below
- profile::openstack::main::cumin::project_pub_key: undef and profile::openstack::main::cumin::project_masters:  seem to be defined in hieradata/eqiad/profile/openstack/main/cumin.yaml so could probably be removed easily
- profile::netbox::netbox_server: netmon1002.wikimedia.org doesn't seem to be referenced anywhere
I guess he's referring to the search bar at the top-right, pending code review since July ;)
Thanks @Bstorm for formalizing our random IRC chat into this proposal 😉
Nov 9 2018
@RobH yep, known issue, the immediate fix was already scheduled in https://gerrit.wikimedia.org/r/c/operations/puppet/+/463820 but then we decided to go directly in the direction of using swift as a backend for attachments to avoid to setup and rsync between the two netbox hosts. For that I preferred to leave it "broken" on purpose to avoid having then to migrate existing attachments to swift. I just din't had yet time to set it up, I hope in the next week or two to be able set everything up, but please let me know also how much is a blocker so I can prioritize accordingly.
Nov 7 2018
[Sorry hit submit too early...]
So either shutdown and run the decom script and then reimage with --new or follow the steps that Luca has outlined there the last time he did it.
Shutting down doesn't remove it from puppetdb, revoke it's puppet certificate and remove it from debmonitor though.
Nov 6 2018
@GTirloni yeah, sorry for the trouble, I know about them, just didn't had the time yet to fix them as the local PuppetDB is broken (I think it happened during the migration to the new region).
I will not spend time to fix the immediate PuppetDB failure as it's an old one anyway and we're in the process to upgrade the local PuppetDB used by the local Puppet master to the same version of production in the next few days.
If this is a blocker for anything please let me know so that I can point those instances temporarily to the Cloud puppetmasters and then back to the local one once we have the new PuppetDB up and running.
Nov 5 2018
In general I'm all in for the nose -> pytest migration and pytest is what we're using in a lot of other projects.
Regarding the Puppet repo specifically though there are multiple angle to look at, that makes me wonder if those more complex scripts that requires testing shouldn't be inside the puppet repo in the first place. Also I think that if we start touching it we shouldn't just blindly replace nose with pytest but instead re-think the whole Python testing within the Puppet repo.
Nov 2 2018
Oct 31 2018
Just for a quick reference the alter to create the table (confirmed also by the history on neodymium) should be:
set session sql_log_bin=0; ALTER TABLE ipblocks ADD ipb_sitewide bool NOT NULL default 1;
I've quickly audited the ipblocks.frm on all cored DBs in all shards (s1-s8) for all schemas and the only one missing (apart schemas that don't have it either on the masters because not in all.dblist) is ruwikiquote on db2050.
To do it quickly (as I'm not anymore familiar with the current tooling around DB stuff) I did the poor's man approach running things like:
sudo cumin 'C:mariadb::heartbeat%shard = s3' "grep -c 'ipb_sitewide' /srv/sqldata*/*/ipblocks.frm"
I've then checked with a similar approach the dbstores, and again, only dbstore2002 for s3 has that field missing.
Oct 30 2018
@Andrew did it reoccurred during last week? do you have a list of hostnames+time by any chance?
Oct 29 2018
This hasn't repro in months and we're moving to stretch on the Icinga hosts. Resolving for now, feel free to re-open if this happens again.
Oct 27 2018
Oct 26 2018
Actually the --new might not work either as the host is in puppetdb, sorry for the wrong suggestion.
Anyway this is kinda unrelated to the reimage script as the issue is that we don't monitor the other icinga hosts from the active one, so it's really a corner case and not sure it should be fixed hardcoding this weirdness into the reimage script.
See the --new option
Oct 24 2018
I guess the description should be updated, as we have more installations in prod now, and we actually already have a check for replication, see modules/postgresql/manifests/slave/monitoring.pp
Oct 23 2018
Oct 22 2018
Oct 19 2018
Oct 18 2018
Opened T207417 for the ferm part.
db2042 failed to start ferm at reboot due to a DNS timeout query:
Oct 18 15:53:04 db2042 ferm: DNS query for 'prometheus2003.codfw.wmnet' failed: query timed out [...SNIP...] Oct 18 15:53:04 db2042 systemd: Failed to start ferm firewall configuration.
Apparently the 2 icinga checks that report it were not noticed as probably the host was downtimed for the programmed maintenance.
I've manually started ferm and it all worked fine but it has been without ferm since the reboot.
I'm opening a separated task to fix the puppet/systemd side of it
That's exactly what I meant, we should have this check independently and adding other checks to the other part described in T207385 to prevent it.
Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?
My suggestion for this kind of check was not for the passive dc, but mainly the active one to make sure that the parser caches are properly used. We might have changes in mediawiki that will change the hit ratio over time and it could go below a threshold that causes issues.
I think it might be useful in general to have this check and it would have also immediately alarm after the switch to tell us the real cause of the issue. It's not meant to prevent it, for that we'll have the other ones (replication/heartbeat/cookbook)
Netbox has been upgraded to upstream 2.4.6. Report any issue you might found.
Oct 17 2018
Oct 16 2018
I think we could also consider adding an alert based on the hit ratio of the parsercache caches (we already have the data in grafana)