Fixed by Joe with https://gerrit.wikimedia.org/r/468937
Very interesting discovery today. The probe_delay_initial_ms (time to wait before sending the first health check to memcached after it has been marked with TKO) is 10s. This is the timeline of one mcrouter TKO workflow on mw1347:
Fri, Oct 19
Thu, Oct 18
Recap of what we did so far:
Wed, Oct 17
Tables dropped with Marcel on db110[7,8] (eventlogging master/slave). Marcel checked and nothing is there on HDFS.
As reference, https://www.slideshare.net/HadoopSummit/operating-and-supporting-apache-hbase-best-practices-and-improvements (slide 15) shows a similar problem on HBase, that was related to the kernel driver for the disk controller.
It happened also on the 16th, but didn't lead to any failover:
As reference, this is what https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/JvmPauseMonitor.html does:
I was about to send another code change for the GC, but then I took a look again to the logs in the description and realized that I've missed an important bit:
Tue, Oct 16
@AndyRussG sorry for the lag but I had to clarify with Joseph some details :)
15:24 <icinga-wm> RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational
15:29 <icinga-wm> RECOVERY - Check systemd state on db1107 is OK: OK - running: The system is fully operational
Thanks! So this might be the case of schema present only on Hadoop and not on Mysql? If so the logic that triggered the above check needs to be removed :)
@Gilles hi! Do you know when ResourceTiming will start registering events in Eventlogging?
This should be a protection mechanism that in this case caused a false positive. So https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/466607/ introduces a new schema in the whitelist, but probably no event for the schema has been collected from Eventlogging yet (hence no table created on the DB).
@Cmjohnson this server is OOW but the replacement will take time to arrive (still in procurement..) and this host is really important for the research users. Do we have a spare disk that we can swap?
Mon, Oct 15
All mcrouters now use 5 persistent conns to each shard, the above graph shows the increase that each memcached server observed after the rollout. We still see some connection yield so this might no be enough, it really depends how mcrouter handles short bursts with more than one connection.
Thanks Erik, it should work now :)
I'd need to know the following data:
Sun, Oct 14
Fri, Oct 12
Created the 0.12.3 debs on boron and deployed the new version in Labs, ready for testing!
Thu, Oct 11
Finally build and deployed the new prometheus-memcached-exporter on the mc* hosts, now https://grafana.wikimedia.org/dashboard/db/memcache shows two new metrics, including the rate of connection yields broken down by shard.
Done! Will follow up in another task to replace stat1005 with this new host.
Wed, Oct 10
As of 6.0 we (Cloudera) no longer support/build on debian: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_deprecated_items.html#concept_ylw_bc2_rbb Sorry to be dissappoint. We continue support for debian on 5.x though.
Both changes merged, the space consumption should go down on both eventlog1002 and stat1005 after the next logrotate run. Keeping this task open to verify this.
Opened https://phabricator.wikimedia.org/T206626 to fully decom conf100[1-3] (not in service anymore and with role::spare::system).
@Cmjohnson quick question (might be wrong since I am a n00b with Juniper): is stat1007 in the analytics VLAN?
Tue, Oct 9
Mon, Oct 8
I re-examined the problem from a fresh start, and also tried to validate Joe's initial point about TKO not being handled by mcrouter removing the (failing) shard from the consistent hashing. I think that there are two main issues from what I can see:
This morning we successfully moved the Druid clusters to an-coord1001, tomorrow will do hive/oozie and the cron jobs.
Fri, Oct 5
The host has been set up with basic functionalities, and all daemons and mariadb seem working fine. I also set up root/hive/oozie/druid users in mariadb.
Last login: Wed Oct 3 10:23:32 2018 from 22.214.171.124 elukey@stat1005:~$ cat /etc/gitconfig # vim: set ts=4 sw=4 et: # This file is managed by Puppet! # puppet:://modules/git/gitconfig.erb # git::userconfig for 'git::systemconfig'
Hi Erik, can you give me an example of command that you give that hangs? git should use by default http_proxy configs on stat1005 (system property), but it might not work with your settings.
elukey@ganeti1001:~$ sudo gnt-instance remove bohrium.eqiad.wmnet This will remove the volumes of the instance bohrium.eqiad.wmnet (including mirrors), thus removing all the data of the instance. Continue? y/[n]/?: y
Thu, Oct 4
We do have a logrotate config on an1003:
Adding also @aaron to get his opinion, no idea about how to trace back what piece of code uses the key listed above :)
Wed, Oct 3
Assigning to Rob to see if anything needs to be done from the DC ops side before closing.
We basically care about the following rates: