We should also investigate other available tools in the container space, for example one recently released is https://github.com/puppetlabs/lumogon or from CoreOS https://github.com/coreos/clair (thanks @Joe for this one). Disclaimer: I've not yet done an extensive search for other available tools ;)
Mon, Jun 19
What @jcrespo said, see also my comment on https://gerrit.wikimedia.org/r/#/c/356383/12/modules/role/files/mariadb/eventlogging_cleaner.py@206 regarding the addition of an ORDER BY.
Sat, Jun 17
Sorry for the late reply, partially because I was too busy cleaning stuff around to reply here (thanks Reedy for the help) and partially to not give too much of a realtime feedback to the abusers.
Thanks everyone here that helped notifying us and limiting the impact whenever possible.
There is an ETA for a permanent fix? It seems to me that we've already delayed this too much given the frequency at which it's happening lately.
Fri, Jun 16
Wed, Jun 14
@Paladox FYI I'm still getting Invalid SSH Key when trying to add my key
Tue, Jun 13
Mon, Jun 12
@akosiaris yes we were aware of it and I spoke with @Joe last week about the requirements for the Docker part, sorry to not have mentioned/referenced it here too. The idea is to have a single tool at this point that can work for both physical hosts and Docker images, so it should overlap fully with the requirements of T167269.
Fri, Jun 9
Thu, Jun 8
Wed, Jun 7
Facter is upgraded in production on the whole fleet apart cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet that will need to be reimaged anyway. Labs also was upgraded by Faidon via Salt.
Tue, Jun 6
Mon, Jun 5
False positive, I'll add the error message to the list of ones to be skipped.
Relating it to T166965
Salt is now deprecated and we're using Cumin instead. We also have new tools to properly manage puppet runs such as run-puppet-agent.
To add some data here, I'm getting very slow responses when opening an instance page, like https://horizon.wikimedia.org/project/instances/edbb1ea0-6e77-4159-8e6f-29886fad5dfa/, it takes around 15 seconds the first time, and then is quicker for a while, I guess until some of the results are cached. Then, to open the Puppet Configuration tab it takes another 4~5 seconds. See the timings below with the details for the instance GET:
Thu, Jun 1
@Cmjohnson @Papaul FYI: given that now the RAID alarm in Icinga can be triggered also for a faulty BBU or wrong WritePolicy, I've added on top of the get raid output the Icinga error.
If the error reports problems related to the BBU or the WritePolicy most likely the output from the disk status will report all ok and not be super helpful.
This is a temporary solution for the moment, until we'll have some time to work on the refactoring/improvement of the raid checks as a whole.
Wed, May 31
Tue, May 30
I think most of this will go away when working on T166300 probably on Q1 as part of the salt deprecation goal. My plan is to get rid of wmf-reimage completely and have a single script that handle the whole process.
@akosiaris @Joe @faidon
I've changed to stringify_facts = false my labs project and this are the different facts. Bare in mind that with the v3 of the PuppetDB API the facts are still reported "stringified", in the sense that they have a value key that is a string, that now is a JSON-encoded strings.
Here below are the diffs of the value property:
Mon, May 29
All looks good, resolving for now:
Should this be resolved? There is still a disk with predictive failure, but not yet failed:
Yes @akosiaris , all the times it happened was during a cron puppet run and seems to me only when there are changes in the puppet_hosts.cfg generated config file.
@akosiaris actually this happened ~2h after I've killed the daemonized puppet on tegmen... I'm not sure this explanation can still be valid, thoughts?
After the above was merged now all the labvirt* instances have no diff, hence all the differences are just the string vs. integer of the $::processorcount as class parameter.
Sun, May 28
Adding @MoritzMuehlenhoff too.
All but two diffs are related to $::processorcount:
Sat, May 27
And db1048 returned to WriteBack policy less than 1h ago 😛
Once we will upgrade to PuppetDB API v4 I will move the PuppetDB queries in Cumin from using GET to using POST to overcome this limit and see if we find any other limit. The v3 of PuppetDB API don't accept POST unfortunately.
So far the lag is limited to 3~4 seconds according to tendril, while from Grafana is flat zero, maybe the dashboard is not graphing the right data?
See db1048 replication lag dashboard.
Re-opening as it alarmed again today for the write policy... the battery is reported to be from 2010, was not swapped few days ago?
tin hit this today. I've tried to rmmod mei_me and rmmod mei as suggested above, but didn't fix the problem live, it probably needs a reboot, but I'm not rebooting it right now (see below).
Fri, May 26
@BBlack yes that is a puppetdb error when the limit is reached.
If you have already an authoritative list of hosts in NodeSet notation (the
one printed by cumin), you can use --backend direct to use that as is
without querying puppetdb.
It seems expected to me, it is used through $::processorcount across different modules in puppet. And the reported diff is only in the parameters of the class.
First diff found on scb1004:
The command to run this across the fleet (skipping the hosts currently down) is:
So it seems that those flapping results are due to puppet running ALSO as a daemon on those hosts (thanks @faidon ), because if at any time when running a puppet agent there is a typo in the options around the -t puppet smartly decides to ignore the wrong option and run as daemon in background.
Some examples were:
Thu, May 25
I've ack'ed the Icinga alarm with this task.
Facter upgraded and verified was a noop across the fleet.
Wed, May 24
The upgrade will be performed with those steps:
- disable puppet reliably (waiting for any in-flight run)
- compile the catalog and output the facts to a directory
- upgrade facter
- compile the catalog again and output the fact to another directory
- compare the result of the two runs
- enable puppet
- remove temporary files
Tue, May 23
May 23 2017
This was a raid check false positive
This was a raid check false positive
Documentation updated on https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs/Documentation