Tue, Apr 25
From the audit I got the same results of the tables in T163196#3206314 except the following ones, and all looks good now for the ipaddress6_primary version:
Mon, Apr 24
Comparison beween ipaddress6 and ipaddress6_primary. All the ones where there is some issue are marked in bold and have a number in square brakects that is referred in the list of details at the bottom. For all the others the correct one seems to be ipaddress6_primary to me, it matches also the DNS record when present:
Comparison beween ipaddress and ipaddress_primary, for all the different ones the correct one seems to be ipaddress_primary to me, it matches also the DNS record for the host:
Sun, Apr 23
Sat, Apr 22
The choice of the next one could be complex to cover all cases with multiple submenu levels, menu with mixed items and submenus, etc. Also it's possible to run a specific task out of order for some reason.
With the above CR new SAL messages will be:
Fri, Apr 21
Thu, Apr 20
Relating it also to T155692
So after the switch of tegmen as active now we have the issue on einsteinium:
Wed, Apr 19
@akosiaris: I've found that the catalog for tegmen doesn't have Nagios_Host and Nagios_Service resources and I think this is due because of this hack:
Increased the max ICMP out packets to 3000 to overcome the bottleneck.
Packet loss is down back to zero and the graph has a normal trend without bottlenecks.
Ping from various codfw hosts confirms packet loss:
Tue, Apr 18
Also, why we do the stop/sync/start all the time instead of just syncing the files on a safe location and have a script make-icinga-primary or similar that does in a run-no-puppet the stop/mv/start that we can run manually only when needed?
Could it be that the crontab that runs every 10 minutes had a race with a puppet run and make all this mess... I don't see it wrapped in a run-no-puppet:
@Dzahn I've updated the output with the result of sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli (you can get the right one arriving at get- and pressing tab to know which one is available on the specific host)
Fri, Apr 14
We had to revert the last change on emergency because it was causing issues on commonswiki (s4) and in general on large wikis.
Thu, Apr 13
Re-opening because this is happening when rebooting hosts, see last days root@ mails
Wed, Apr 12
Sun, Apr 9
Since Feb. 19th we're getting one email every day from terbium with an error for each wiki (~900 lines email) with:
Since a couple of days both einsteinium and tegmen are spamming root@ every hour with certspotter errors, this time seems that the DigiCert service is responding 400 for the check requests:
Sat, Apr 8
I've also ACK'ed on Icinga the related puppet run alarm
Thu, Apr 6
I think it might happen when a VACUUM is running on the master, at least today that we have a lot of delay on the maps-test cluster I've noticed that a VACUUM is running since 15h:
Just nitpicking, pop() is returning the value that of course you don't need, for how the HeaderKeyDict is implemented.
There is a __delitem__ implemented that you could use with del self.headers['Content-Type'] if I'm not mistaken.
A second pass was completed successfully without any manual intervention.
Wed, Apr 5
In the medium term I've in mind a bunch of things that should help towards this direction. Feel free to ping me to talk about it.
The first run of the swiftrepl has finally completed! It is now in the 2 hour sleep between runs, I'll check the next one completes without manual intevention.
The third one was:
wikipedia-commons-local-thumb.3b 3/3b/Hendrick_de_Keyser_-_gulden_cabinet.png/85px-Hendrick_de_Keyser_-_gulden_cabinet.png E-Tag mismatch: bc68f6efc732fda68647dcd65867cef9/cd3b1b810889387c0ff7bed187e87125, syncing
Tue, Apr 4
Mon, Apr 3
Ops, I read the previous message as it required a restart of puppetmasters, not puppetdb, sorry for the misunderstanding.