Fri, Jan 19
Thanks for the explanation!
Pooled and working correctly, closing!
Andrew and Joseph completed a test in labs to verify that Druid running on Java 7 would still work fine with Hadoop running java 8, and no surprises came up.
Just tested the use case in the description on stat1004 with:
So the processor does event['uuid'] = capsule_uuid(event) that is defined like this:
For Spark I believe that update-java-alternatives is enough to force it to pick up java8. I found this in one of the scripts called by spark(-2)*-shell:
Draft of the upgrade plan in https://etherpad.wikimedia.org/p/analytics-hadoop-java8
One thing that I noticed is that when a burst of warning happens the following is registered around the same time in syslog:
Thu, Jan 18
Had an interesting chat with @ayounsi and for the moment it seems that the only format expected in the netflow topic will be: tag,dst_as,as_path,peer_dst_as
Closing the task since puppet should be ok now, please re-open otherwise!
Fixed all except j1.analytics.eqiad.wmflabs - @Ottomata do we still need this? It seems running superset, and puppet is broken in there..
The puppetdb grafana dashboard (and its related monitoring config for nitrogen/nihal) were added in https://phabricator.wikimedia.org/T184796
Closing task since https://grafana.wikimedia.org/dashboard/db/puppetdb is almost a replica of the puppetdb localhost one.
@faidon whenever you have time do you mind to explain a bit what data is currently pushed to the netflow topic in Kafka Jumbo and how to read it? I am planning to work on this task soon (with Joseph's supervision) but I am very ignorant about the subject :)
Wed, Jan 17
The first run completed without any errors, and then another one (cleaning up only daily data) ran as well setting the following:
Host decommed in https://phabricator.wikimedia.org/T183895
Tue, Jan 16
In metastore.sh it is used the HIVE_METASTORE_HADOOP_OPTS that works fine (just tested), but there seems to be no equivalent for Hive Server (https://issues.apache.org/jira/browse/HIVE-12582).
Mon, Jan 15
I added a timeout 3 bash command and it worked fine, but then a similar issue re-happened when I tried to restart the metastore service. Hive is surely not really able to run the jmx_agent at the moment, and oozie is on a similar boat. I am a bit worried about the other hadoop daemons though, everything went fine up to now but not sure if the confusing hadoop init.d scripts might have some bug waiting to bite in the future.
So with a better ps what happens is clear:
About Hive, I tried to re-apply the changes to the metastore and this is the difference in ps:
Started a dashboard in https://grafana-admin.wikimedia.org/dashboard/db/puppetdb
Fri, Jan 12
Tried to open https://community.cloudera.com/t5/CDH-Manual-Installation/Oozie-duplicates-CATALINA-OPTS-variables-in-oozie-env-sh/m-p/63654#M1607, not sure if it is the best place but let's see if anybody answers.
The problem seems to be in the oozie debian package itself:
Finally found the root cause. Each time that oozied.sh does start/stop from the init.d's script it starts with a clean environment. Then the duplication happens in oozie-sys.sh due to the symlink pointed out above and these:
This is the journalctl snippet of oozie sourcing various files:
In https://oozie.apache.org/docs/4.1.0/AG_Install.html -> Advanced/Custom Environment Settings I can't see any CATALINA_OPTS listed for oozie-env.sh, so we might need to set it elsewhere.
Since rates and other things like stdev are Mbean's attributes I cannot easily blacklist them, but sole rewrite rules are needed (in which we can explicitly select what attributes to render). This is what I came up with:
Current status on nitrogen (no jvm metrics displayed since they should already be ok):
I opened https://phabricator.wikimedia.org/T184794 to track down and fix Oozie/Hive bugs, I am inclined to close this task since:
Thu, Jan 11
Just seen that there are more instances to fix. Some of them are under experiment at the moment, will try to fix them asap though.
it does the following (maybe I am reading the code in the wrong way):
Tested in labs the procedure outlined above (install + update-java-alternatives to java8) and everything went fine. The following errors are ok (double checked with Moritz):
Wed, Jan 10
The hive server/metastore issue is more subtle: everything starts, the jmx agent returns metrics correctly but the daemons do not bind to their ports (so they are not working):
I am testing in labs why oozie/hive daemons are not starting up with the -javaagent.
Created https://grafana.wikimedia.org/dashboard/db/prometheus-analytics-hadoop as 1:1 replica of its graphite alter ego https://grafana.wikimedia.org/dashboard/db/analytics-hadoop
Tue, Jan 9
Maintenance done, the mgmt interface is now up and running (Chris also did a reseat of the DIMM banks).
Further refinement to exclude jvm metrics duplication:
Mon, Jan 8
downtime announced to engineering@ and analytics@
@Cmjohnson Would it be fine tomorrow around this time? Or whenever you prefer, I'd need to send an email and announce the downtime, better to alert people :)
Now the BMC/IPMI doesn't seem to be happy:
Sun, Jan 7
Wanted to sanity check the data on db1107 (el master) after the mysql consumers added the missing data from the past days:
Sat, Jan 6
Fri, Jan 5
Thu, Jan 4
Wed, Jan 3
Did a scap pull, set the host to pooled=yes and checked apache metrics. Everything looks good! Closing the task, let's re-open if it gives problems again.
We'll have to do another round of reboots probably next week, so the remaining kafka hosts will be done later on.
Maintenance is ongoing and it will probably last for a couple of days.
Tue, Jan 2
I've discussed this task with my team and a couple of things came up: