Sorry, I didn't mention the multi-DC setup :)
Mon, Apr 23
Personally never used, +1 to drop it.
Thu, Apr 19
To summarize the work done recently, I've made an audit of existing checks and fixed/improved some of them that had clear errors or needed to be updated. @chasemp has very kindly offered himself to review the WMCS related checks, users and groups.
Wed, Apr 18
@Joe no it would not be super easy to solve in a DRY way, I agree.
Thu, Apr 12
I've fixed it, it was a case of password misalignment, see one of the cases described in T150160,
Tue, Apr 10
Reporting it here too for the future, to fix it's sufficient to replace the --diff of the above command with --commit and then re-run the --diff to ensure that this time it will show no error.
Mon, Apr 9
Patch updated to overcome this problem, once reviewed and merged it should solve the issue.
Wed, Apr 4
Reports are enabled since ~1 day without any incident. Resolving.
Tue, Apr 3
Sat, Mar 31
@jcrespo LMK if you'd like me to do anything about it during the weekend.
Fri, Mar 30
For reference this is the max replication lag between all eqiad DBs in that time frame (Mar. 28th, ~18:30-20:30), from where seems pretty clear that there was no sensible lag at all in that time frame.
Thu, Mar 29
@MarcoAurelio thanks for checking in and for the additional info. At this point I think that this is due to some older discrepancy between the two hosts for this table due to the fact that your query deleted only 14678 rows on the master while the slave has much more rows that met this condition:
Replication lag back to zero, no other errors, the two tables are different though. I've opened T191020 for tracking it, while resolving this one.
The plan as of now is to enable it on next Tuesday, to avoid issues in the long weekend.
From the quick test I've made yesterday enabling reporting also to puppetdb for some minutes, I got ~200 hosts reported and showing data in Puppetboard, I didn't notice any sensible load/ram/disk usage increase on puppetdb hosts.
Moreover our report-ttl parameter is set to 1d, so I don't expect this huge amount of data to be kept longer term.
I've found the missing 20 rows that were missing on labsdb1004 in this delete that was deleting 14677 rows, re-added them to labsdb1004 and re-started the replication.
@Niharika as you might know, our DBAs are out for the end of this week, do you think this can wait Monday? If not let me know and I'll try to have a look although I'm missing some context on what analysis was already done.
Wed, Mar 28
Mon, Mar 26
Puppetboard is now reachable via https://puppetboard.wikimedia.org (LDAP auth), resolving.
Mar 22 2018
It's now rebuilding AFAIK there was a disk replaced:
Mar 21 2018
List of hosts with puppet disabled since before the migration, that are missing in the new puppetdb and would disappear from Icinga upon re-enabling puppet there:
Mar 20 2018
Mar 19 2018
Agreed on the meeting that for now the simple HTTP check is enough, given that we check that the uWSGI web app is running too.
Mar 14 2018
Mar 13 2018
I've split the WMCS part into a separate CR that can be merged independently of production: https://gerrit.wikimedia.org/r/c/419131
Mar 5 2018
Mar 2 2018
@aggro the fix has been merged into master and will be included in the next cumin release.
For now as a quick workaround to generate the log file in the current directory you could use ./cumin.log in the configuration file instead of just cumin.log.
Thanks again for reporting the issue.
Mar 1 2018
We had OOMs also with puppet disabled on tegmen, so that's not the culprit.
@aggro Thanks a lot for reporting the issue and I can confirm it.
In effect there is a missing check on the log path once has been split from the filename. I'm sending a fix shortly.
Feb 27 2018
Feb 23 2018
This will be fixed by https://gerrit.wikimedia.org/r/c/412894/ , that is pending the full release of Cumin 3.0.1 in prod that is waiting the full release of conftool 1.0.0 in prod, that is pending final testing and also an issue with the python3-etcd debian package.
The pasted command is without the 'mgmt' part, it seems to work for me adding it:
Feb 21 2018
Feb 20 2018
Feb 19 2018
Thanks for the proposal. It seems to me a nice to have backend, I don't see any conceptual problem with its addition to cumin's backends. For example also other non-configuration backends like Icinga in our case might also be useful sometimes, etc.